Data mining from the web with Python Mechanize Browser Automation - Big data tutorial
In this example the Python Mechanize package is used for browser automation - Selenium is much more feature rich (and is also a bit more difficult to use) and is to be used when feature-rich Javascript and Ajax website data mining or automated test case setup is to be built. A Selenium browser automation example can be found here. This rich commented Python Mechanize Browser Automation example does the following:
- Logs in to a webpage with custom credentials
- Opens an other page by knowing the session used for login
- Finds a form and two input fields of it
- Adds multiple list elements in a for cycle to the first field
- Adds multiple list elements in an other for cycle to the second field
- Lists the results of the above query
- The list can have multiple sub pages (the list shows limited results and have a pager)
- Checks all the links of the query matching a pattern
- LABEL: PAGE - Opens all the links from the next page
- Mines all the unfiltered data from the opened pages
- Goes back to LABEL: PAGE step recursively
- Adds multiple list elements in an other for cycle to the second field
- Adds multiple list elements in a for cycle to the first field
# Import Python Mechanize Browser Automation Library
import mechanize
# Import regular Expressions
import re
import collections
# Dropdown values for the forms
dropdownelements = ['Selection1', 'Selection2']
dropdownelements2 = ['Selection10', 'Selection20']
# Initialize the browser
browser = mechanize.Browser()
browser.set_handle_robots(False)
# Simulate Firefox
browser.addheaders = [('User-agent', 'Firefox')]
# Open the page
browser.open("http://www.tobecheckedpage")
# Select the form named login, e.g. in the HTML
browser.select_form(name="login")
# Find the form element named login and add myloginid to it
browser["login"] = "myloginid"
# Find the form element named password and add mypassword to it
browser["password"] = "mypassword"
# Click on the submit button
response = browser.submit()
# For all dropdownelements
for dropdownelement in dropdownelements:
# For all dropdownelements2
for dropdownelement2 in dropdownelements2:
# Open a page
page = browser.open("http://www.tobecheckedpage/page")
# Get the response to console
print page.read()
# Find element by name, e.g. in the HTML
browser.select_form(name="code")
# Find input element option and add dropdownelement as value
browser.form[ 'options' ] = dropdownelement
# Find input element keywords and add dropdownelement2 as value
browser.form[ 'keywords' ] = dropdownelement2
# Submit the form
browser.submit()
# Call recursive checker - as form has been sent, we have the results of the query
checklinks(browser, country, job)
# The recursive query to chech multi pages of search results; check the main part of the code first
def checklinks(browser, dropdownelements, dropdownelements2):
# Initialize a second browser with the same properties that the first has
browser2 = browser
# Get all the links on the search query page
for link in browser.links():
# For all links do:
# Filter the links that has RegExp in their href
siteMatch = re.compile( '/RegExp' ).search( link.url )
# If link contains RegExp
if siteMatch:
# Then open the link
resp = browser2.follow_link( link )
# And store its content
content = resp.get_data()
# Do data mining in the content of the detail page resulting from the query result page
# An other for cycle checking all the links on the result page (again)
for link in browser.links():
#Check if the link has the text 'Next >'
siteMatch = re.compile( 'Next >' ).search( link.text )
# If it has
if siteMatch:
# Then follow the link (go to next page of the results)
resp = browser.follow_link( link )
# Call recursive checker - as form has been sent, we have the results of the query
checklinks(browser, country, job)