Data mining from webpages with Python Mechanize Browser Automation - a Big data tutorial

Data mining from the web with Python Mechanize Browser Automation - Big data tutorial

In this example the Python Mechanize package is used for browser automation - Selenium is much more feature rich (and is also a bit more difficult to use) and is to be used when feature-rich Javascript and Ajax website data mining or automated test case setup is to be built. A Selenium browser automation example can be found here. This rich commented Python Mechanize Browser Automation example does the following:

Logs in to a webpage with custom credentials
Opens an other page by knowing the session used for login
Finds a form and two input fields of it
- Adds multiple list elements in a for cycle to the first field
  - Adds multiple list elements in an other for cycle to the second field
    - Lists the results of the above query
    - The list can have multiple sub pages (the list shows limited results and have a pager)
      - Checks all the links of the query matching a pattern
      - LABEL: PAGE - Opens all the links from the next page
        Mines all the unfiltered data from the opened pages
        Goes back to LABEL: PAGE step recursively

# Import Python Mechanize Browser Automation Library

import mechanize

# Import regular Expressions

import re

import collections



# Dropdown values for the forms

dropdownelements = ['Selection1', 'Selection2']

dropdownelements2 = ['Selection10', 'Selection20']



# Initialize the browser

browser = mechanize.Browser()

browser.set_handle_robots(False)

# Simulate Firefox

browser.addheaders = [('User-agent', 'Firefox')]

# Open the page

browser.open("http://www.tobecheckedpage")

# Select the form named login, e.g. in the HTML  

browser.select_form(name="login")



# Find the form element named login and add myloginid to it

browser["login"] = "myloginid"

# Find the form element named password and add mypassword to it

browser["password"] = "mypassword"

# Click on the submit button

response = browser.submit()



# For all dropdownelements

for dropdownelement in dropdownelements:

	# For all dropdownelements2

	for dropdownelement2 in dropdownelements2:

		# Open a page

		page = browser.open("http://www.tobecheckedpage/page")

		# Get the response to console

		print page.read()

		# Find element by name, e.g. in the HTML 

		browser.select_form(name="code")

		# Find input element option and add dropdownelement as value

		browser.form[ 'options' ] = dropdownelement

		# Find input element keywords and add dropdownelement2 as value

		browser.form[ 'keywords' ] = dropdownelement2

		# Submit the form

		browser.submit()

		# Call recursive checker - as form has been sent, we have the results of the query

		checklinks(browser, country, job)



# The recursive query to chech multi pages of search results; check the main part of the code first

def checklinks(browser, dropdownelements, dropdownelements2):

	# Initialize a second browser with the same properties that the first has

	browser2 = browser

	# Get all the links on the search query page

	for link in browser.links():

		# For all links do:

		# Filter the links that has RegExp in their href

	    siteMatch = re.compile( '/RegExp' ).search( link.url )

		# If link contains RegExp

	    if siteMatch:

		# Then open the link

		resp = browser2.follow_link( link )

		# And store its content

		content = resp.get_data()

		# Do data mining in the content of the detail page resulting from the query result page

	# An other for cycle checking all the links on the result page (again)	

	for link in browser.links():

		#Check if the link has the text 'Next >'

		siteMatch = re.compile( 'Next >' ).search( link.text )

		# If it has

		if siteMatch:

			# Then follow the link (go to next page of the results)

			resp = browser.follow_link( link )

			# Call recursive checker - as form has been sent, we have the results of the query

			checklinks(browser, country, job)