Data mining from the web with Selenium Python WebDriver’Browser Automation - Big data tutorial
In this example the Selenium web test automation framework uses Firefox for browser automation - Selenium is much more feature rich (and is also a bit more difficult to use) then Python Mechanize - having an example here. It uses a full suite of a web browser, so as an advantage Javascript and AJAX rich webpage parsing can be automated - in such cases it has to be used instead of Python Mechanize for data mining and auto testing purposes. This rich commented Selenium Python WebDriver Browser Automation example does the following:
- Logs in to a webpage with custom credentials
- Clicks on a link
- Finds a form and two input fields of it
- Adds multiple list elements in a for cycle to the first field
- Adds multiple list elements in an other for cycle to the second field
- Lists the results of the above query
- The list can have multiple sub pages (the list shows limited results and have a pager)
- Checks all the links of the query being in a container
- LABEL: PAGE - Opens all the links from the next page
- Mines specific data from the opened pages
- Goes back to LABEL: PAGE step recursively
- Adds multiple list elements in an other for cycle to the second field
- Adds multiple list elements in a for cycle to the first field
# Selenium Python documentation at: http://selenium-python.readthedocs.org/en/latest/
import collections
import time
# Import Selenium and its supporting packages: have to be installed
import selenium
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait # available since 2.4.0
from selenium.webdriver.support import expected_conditions as EC # available since 2.26.0
from selenium.webdriver.common.keys import Keys
# The recursive query to chech multi pages of search results; check the main part of the code first
def checklinks(driver, element, element2):
# Store the main URL
mainurl = driver.current_url
# Get all the links on the search query page; links is a Python collection of links
# All links are stored which are in the xpath: xpath can be gathered by using Firebug in Firefox
links = driver.find_elements_by_xpath("/html/body/")
# For all links
for link in links:
# Open the link in the browser
driver.get(link.get_attribute("href"))
# If there is an element on the page in the given xpath
if driver.find_elements_by_xpath("/html/body/div[4]/div/div[2]/div/"):
# Print the following elements; and the text contained in the container of the specified xpath
print element,element2,link.text,driver.find_elements_by_xpath("/html/body/div[4]/div/div[2]/div/div[2]/div/").text
else:
# else print the elements and a 0
print element,element2,link.text,'0'
# Go back to the search result page after all links of the search result page have been parsed
driver.get(mainurl)
# Get the Next > button on the pager of the search result page
nextlink = driver.find_element_by_link_text('Next >')
# If nextlink is given - if we are not on the last page
if nextlink:
# Open the next page
driver.get(nextlink.get_attribute("href"))
# Recursively check the links by recalling the above method
checklinks(driver, element, element2)
# Dropdown values for the forms
dropdownelements = ['Selection1', 'Selection2']
dropdownelements2 = ['Selection10', 'Selection20']
# Open Firefox
driver = webdriver.Firefox()
# Open the webpage
driver.get("http://www.tobecheckedpage/")
# Find the form element named login, e.g. in the HTML and link it to username variable
username = driver.find_element_by_name('login')
# Add your user name to the login box
username.send_keys("myloginid")
# Find the form element named password
password = driver.find_element_by_name('password')
# Add your password to the login box
password.send_keys("password")
# Click on the button named signin
driver.find_element_by_name('signin').click()
# Find the link appearing as Link on the page and click on it, e.g. in the HTML Link
driver.find_element_by_link_text('Link').click()
# For all dropdownelements
for element in dropdownelements:
# For all dropdownelements2
for element2 in dropdownelements2:
# Open a page
driver.get("http://www.tobecheckedpage/page")
# Find element by ID, e.g. in the HTML
form = driver.find_element_by_id('code')
# Form is a dropdown, parse through all the chosable options
for option in form.find_elements_by_tag_name('option'):
#If the currently selected option matches element
if option.text == element:
#Then click on the element
option.click()
# Find an other form input element, having an id 'keywords'
keywordsform = driver.find_element_by_id('keywords')
# May be needed if Ajax is on the page and form changes;
# for more complex waits, e.g. wait for an element to show up, see http://selenium-python.readthedocs.org/en/latest/waits.html
time.sleep(1)
# Add element2 to the page
keywordsform.send_keys(element2)
# Click on button, that have a class called submit, e.g. in the HTML
driver.find_element_by_class_name('submit').click()
# Call recursive checker - as form has been sent, we have the results of the query
checklinks(driver, country, job)
# Close the browser
driver.close()