Selenium - web browser automation

Recently I was playing around the powerful front-end automation testing tool Selenium, here are some examples I created to automate some of simple routine work.

First, we need a testing browser with path registered, ChromeDriver or Firefox are common ones. Typically put the executable chromedriver in /usr/local/bin/chromedriver (or chromedriver.exe in C:/Users/%USERNAME/AppData/Local/Google/Chrome/Application/), don’t forget to register this path or specify when using it.

#sudo apt-get install unzip
wget -N https://chromedriver.storage.googleapis.com/2.38/chromedriver_linux64.zip
unzip chromedriver_linux64.zip
chmod +x chromedriver

sudo mv -f chromedriver /usr/local/share/chromedriver
sudo ln -s /usr/local/share/chromedriver /usr/local/bin/chromedriver
sudo ln -s /usr/local/share/chromedriver /usr/bin/chromedriver

Second, write a Python script. Selenium can simulate user browsing actions. Here I provide three templates for sample usage. Including login with username and password, get content url, download all urls. By digging patterns on the page, one can write scripts to make Selenium automate a sequence of actions on webpages and find useful information.

Template 1:

# Log in Facebook
#!/usr/bin/env python3
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait

usrStr, pwdStr = 'foo', 'bar'
browser = webdriver.Chrome('/usr/local/bin/chromedriver')
browser.get(('http://www.facebook.com'))

browser.find_element_by_id("email").send_keys(usrStr)
browser.find_element_by_id("pass").send_keys(pwdStr)
browser.find_element_by_id('loginbutton').click()
browser.close()

Template 2:

# Log in Pinterest, get image url
import time, re
from selenium import webdriver

def getImageURL(url):
    options = webdriver.ChromeOptions()
    prefs = {'profile.default_content_settings.popups': 0, 'download.default_directory': "C:\\Downloads\\"}
    browser = webdriver.Chrome("C:/Users/%USERNAME/AppData/Local/Google/Chrome/Application/chromedriver.exe")
    browser.get(url)

    #login
    usr, pwd = 'foo', 'bar'
    browser.find_element_by_xpath(("//*[contains(text(), 'Log in')]")).click()
    browser.find_element_by_css_selector("input[type=email]").send_keys(usr)
    browser.find_element_by_css_selector("input[type=password]").send_keys(pwd)
    browser.find_element_by_css_selector("button[type=submit]").click()

    # wait till the login finish
    time.sleep(2)
    images = browser.find_elements_by_tag_name("img")
    pattern = re.compile(r"com/236x/")
    picURL = []

    for img in images:
        ima = img.get_attribute('src')
        match = pattern.findall(ima)
        #print(match, ima)
        if match:
            picURL.append(ima)
    browser.quit()
    return picURL
    

Template 3:

# find all urls, print it to pdf (with wkhtmltopdf installed)

def getFileName(url):
    # some regex here
    return url
def isArticle(url):
    # some regex here
    return True
def urlToPDF(url):
    os.popen('wkhtmltopdf ' + url + ' ' + getFileName(url) + '.pdf')

driver = webdriver.Chrome('/usr/local/bin/chromedriver')
driver.get(('http://foo.com'))
hrefs = driver.find_elements_by_xpath("//a[@href]")
for href in hrefs:
    url = href.get_attribute("href")
    if isArticle(url):
        print('processing:', getFileName(url))
        urlToPDF(url)

Bonus: scheduler The last job is putting the script as a routine job. I use cron in Linux: in the shell, initiate the cron job list: crontab -e, add this new job occurring every 24 hours:

* */23 * * * ~/routineJobs/script.py (meaning of *: second, hour, day of month, month, day of week)

Don’t forget to check the cron job status or log the job results in the python script: /etc/init.d/cron status.