Data mining from the web with Python Mechanize Browser Automation – Big data tutorial
In this example the Python Mechanize package is used for browser automation – Selenium is much more feature rich (and is also a bit more difficult to use) and is to be used when feature-rich Javascript and Ajax website data mining or automated test case setup is to be built. A Selenium browser automation example can be found here.
This rich commented Python Mechanize Browser Automation example does the following:
- Logs in to a webpage with custom credentials
- Opens an other page by knowing the session used for login
- Finds a form and two input fields of it
- Adds multiple list elements in a for cycle to the first field
- Adds multiple list elements in an other for cycle to the second field
- Lists the results of the above query
- The list can have multiple sub pages (the list shows limited results and have a pager)
- Checks all the links of the query matching a pattern
- LABEL: PAGE – Opens all the links from the next page
- Mines all the unfiltered data from the opened pages
- Goes back to LABEL: PAGE step recursively
- Adds multiple list elements in an other for cycle to the second field
- Adds multiple list elements in a for cycle to the first field
# Import Python Mechanize Browser Automation Library import mechanize # Import regular Expressions import re import collections # Dropdown values for the forms dropdownelements = ['Selection1', 'Selection2'] dropdownelements2 = ['Selection10', 'Selection20'] # Initialize the browser browser = mechanize.Browser() browser.set_handle_robots(False) # Simulate Firefox browser.addheaders = [('User-agent', 'Firefox')] # Open the page browser.open("http://www.tobecheckedpage") # Select the form named login, e.g. in the HTML <input name="login" type="text" /> browser.select_form(name="login") # Find the form element named login and add myloginid to it browser["login"] = "myloginid" # Find the form element named password and add mypassword to it browser["password"] = "mypassword" # Click on the submit button response = browser.submit() # For all dropdownelements for dropdownelement in dropdownelements: # For all dropdownelements2 for dropdownelement2 in dropdownelements2: # Open a page page = browser.open("http://www.tobecheckedpage/page") # Get the response to console print page.read() # Find element by name, e.g. in the HTML <form name="code"> browser.select_form(name="code") # Find input element option and add dropdownelement as value browser.form[ 'options' ] = dropdownelement # Find input element keywords and add dropdownelement2 as value browser.form[ 'keywords' ] = dropdownelement2 # Submit the form browser.submit() # Call recursive checker - as form has been sent, we have the results of the query checklinks(browser, country, job) # The recursive query to chech multi pages of search results; check the main part of the code first def checklinks(browser, dropdownelements, dropdownelements2): # Initialize a second browser with the same properties that the first has browser2 = browser # Get all the links on the search query page for link in browser.links(): # For all links do: # Filter the links that has RegExp in their href siteMatch = re.compile( '/RegExp' ).search( link.url ) # If link contains RegExp if siteMatch: # Then open the link resp = browser2.follow_link( link ) # And store its content content = resp.get_data() # Do data mining in the content of the detail page resulting from the query result page # An other for cycle checking all the links on the result page (again) for link in browser.links(): #Check if the link has the text 'Next >' siteMatch = re.compile( 'Next >' ).search( link.text ) # If it has if siteMatch: # Then follow the link (go to next page of the results) resp = browser.follow_link( link ) # Call recursive checker - as form has been sent, we have the results of the query checklinks(browser, country, job)