Introduction to Selenium
Get familiar with Selenium headless browser and discover its capabilities.
Now that we have a solid understanding of dynamic websites, it's time to delve deeper into this topic and explore how we can adapt our scripts to handle their dynamic nature effectively.
Dynamic sites mechanism
As we said before, dynamic websites require specific interactions, such as clicking or scrolling, to trigger data display. These interactions activate JavaScript or
Note: We can view this JS code using the developer tools ("Debugger" tab).
To illustrate, we can observe the movie data upon choosing a specific year on the ScrapeThisSite website. By analyzing the code, we can identify a function named showfilms()
that is accountable for producing the table containing the essential details.
The JavaScript code either makes an API request to retrieve data, or the data is pre-fetched and awaits browser execution to be structured in the DOM. The former method is straightforward to capture using the Network tool in the developer tools, as it allows us to replicate the request and obtain the data. In this chapter, we will focus on the latter, where we must wait for the browser to execute and load the data in the DOM before scraping it using the previously discussed methods.
Selenium headless browser
A headless browser is a browser implementation that runs without a user interface. It enables automated scripts to interact with a web page as if a user were performing the actions. The headless browser runs in the background, allowing the script to interact with the page and retrieve data or perform actions without a visible browser window. In simpler terms, it is a browser without a GUI.
Working mechanism
When a headless browser loads a web page, it sends a request to the web server, receives the HTML document in response, parses and renders the page, and executes any JavaScript code. In this sense, it’s no different from a standard browser. However, instead of rendering web pages visually, a headless browser grants access to webpage content and features via a command-line interface executing actions (CLI) or application programming interface (API) for executing actions on the page.
Examples
There are multiple available open-source headless browsers, such as:
Chromium
Google Chrome
Firefox
Splash
Selenium is the
Installation
To install Python-Selenium, follow these steps:
Run the command
pip install selenium
to install Python-Selenium.Selenium requires a driver to control the browser, we can download the appropriate driver for our browser from this Selenium documentation website.
Visit the official Selenium website and download the driver that matches the version of our browser.
To determine the version of Google Chrome, open the browser, click the three dots in the top menu, choose "Help," and then select "About Google Chrome."
Once we have downloaded the driver, we must place it in a location accessible to our Python environment.
If we prefer to maintain a consistent environment, especially considering the operating system and version dependencies, installing Selenium within a Docker image is recommended. Here is an example Dockerfile that demonstrates the installation of Selenium Chrome on Linux using Python:
# Can be any version we wantFROM python:3.8RUN apt-get update && apt-get install -y --no-install-recommends wget gnupg curl# Add chrome repositoryRUN wget -q -O - https://dl-ssl.google.com/linux/linux_signing_key.pub | apt-key add -RUN sh -c 'echo "deb [arch=amd64] http://dl.google.com/linux/chrome/deb/ stable main" >> /etc/apt/sources.list.d/google-chrome.list'RUN apt-get update && apt-get install -y --no-install-recommends gcc libc-dev libmagic-dev unzip google-chrome-stableRUN rm -rf /var/lib/apt/lists/*RUN apt-get clean# Install chromedriverRUN DRIVER_VERSION=$(curl -sL http://chromedriver.storage.googleapis.com/LATEST_RELEASE)RUN wget -O /tmp/chromedriver.zip http://chromedriver.storage.googleapis.com/`curl -sL http://chromedriver.storage.googleapis.com/LATEST_RELEASE`/chromedriver_linux64.zipRUN unzip /tmp/chromedriver.zip chromedriver -d /usr/local/bin/# Set display port to avoid crash on chromeENV DISPLAY=:99
Usage
Let's apply this concept to the movies page. After executing the code, observe the output tab to see how Selenium navigates to the specified website and clicks the defined elements. The results will be printed in the terminal.
from selenium import webdriver import time from webdriver_manager.chrome import ChromeDriverManager from selenium.webdriver.chrome.options import Options from selenium.webdriver.chrome.service import Service from selenium.webdriver.common.by import By #defining driver options options = Options() options.add_argument('--no-sandbox') options.add_argument('--disable-dev-shm-usage') prefs = {"download.default_directory": "."} options.add_experimental_option("prefs", prefs) # initializing chrome instance driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=options) # sent GET request to the site using the driver driver.get("https://www.scrapethissite.com/pages/ajax-javascript/") time.sleep(1) print("\nPage's title is: \n", driver.title) # Find element with ID=2013 elem_2013 = driver.find_element(By.ID, "2013") # click on that element elem_2013.click() # wait for results driver.implicitly_wait(3) # find the movies tables using CSS selectors films = driver.find_elements(By.CSS_SELECTOR, "tr.film td.film-title") print("\n2013 Movies: \n") print([x.text for x in films]) time.sleep(2) driver.close()
Code explanation
Lines 1–6: First, we import the necessary Selenium modules.
Lines 9–13: We create the
options
parameter, which allows us to pass specific options to each driver (such as Chrome, Firefox, or Edge).The current options prevent Selenium from crashing inside Docker containers.
Another important option is
--headless
, it prevents Chrome from displaying its actions, but we have not included it in this code for educational purposes.
Line 15: We create a Chrome driver object using the
webdriver.chrome()
function.Typically, this function would require a specified driver path. Still, in this case, we use a manager service to download the driver each time the code is run to ensure compatibility with the educative environment.
Line 17: We then send a
GET
request using the driver to get the HTML DOM.Line 18: We apply
time.sleep(1)
to observe the page's loading in Selenium.Line 22: We utilize here one of Selenium's navigation functions
find_element()
.This function operates similarly to the Beautiful Soup library, allowing users to provide filters using the
By
class to obtain the element/s that match the specified filter.By.ID
: Returns the element/s that match the provided id.By.CLASS_NAME
: Returns the element/s that match the class name.By.NAME:
Returns the element/s that match the name attribute.By.CSS_SELECTOR
: Returns the element/s that match the CSS Selector query.By.TAG_NAME
: Returns the element/s that match the provided tag name.By.LINK_TEXT
: Returns the<a>
tags that have specific text inside them.By.XPATH
: Returns elements that match an XPath query.
The element object returned by the function has several other attributes useful for obtaining information.
.text
: Returns the text content of the tag..get_attribute(attr_name)
: Returns the value of the specified attribute for the tag.
Line 24: We use one of Selenium's interaction functions
.click()
.It is used to instruct the driver to click on the selected element.
.sendKeys()
function can be used to press specific keys on the keyboard. For instance,elem.sendKeys(Keys.ARROW_DOWN)
would simulate the user pressing the"down"
arrow key.
Line 26: We use another useful function from Selenium,
.implicitly_wait(seconds)
.It orders the driver to pause for a specified number of seconds until the page finishes loading.
Note:
While Selenium can send
GET
requests, it cannot perform all of the same actions as the Pythonrequests
library. Although, we can still add cookies usingdriver.add_cookie({"name":"key", "value":"value})
, we cannot send aPOST
request or supply any headers.In most cases, however, these limitations will not pose an issue, as Selenium acts as a real browser and can be detected by websites.
To demonstrate, we can request the ShellHacks website. Normally, the website requires the request to have a "User-Agent" header value similar to a browser's. However, in this case, Selenium will handle this internally without requiring specific headers to be specified.
from selenium import webdriver import time from webdriver_manager.chrome import ChromeDriverManager from selenium.webdriver.chrome.options import Options from selenium.webdriver.chrome.service import Service from selenium.webdriver.common.by import By options = Options() options.add_argument('--no-sandbox') options.add_argument('--disable-dev-shm-usage') prefs = {"download.default_directory": "."} options.add_experimental_option("prefs", prefs) driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=options) driver.get("https://www.shellhacks.com/") time.sleep(3) driver.close()
If we ever need to perform header spoofing or send a different type of request, the Selenium-requests library can be utilized to integrate Selenium with the requests library.
Note: As previously mentioned, Selenium was primarily designed to test browser functions, rather than for web scraping. While there are many other useful functions available in the documentation, we may not need to utilize all of them for our purposes.
Conclusion
In summary, we have delved into the inner workings of dynamic websites and discussed how they operate. We have also introduced Selenium as a helpful tool for loading JavaScript code and fetching data on our behalf. By utilizing this tool, we can more effectively scrape dynamic websites and extract the information we need.