Mastering Web Scraping Using Python: From Beginner to Advanced/

...

Introduction to Selenium

Get familiar with Selenium headless browser and discover its capabilities.

We'll cover the following...

Dynamic sites mechanism
Selenium headless browser
Conclusion

Now that we have a solid understanding of dynamic websites, it's time to delve deeper into this topic and explore how we can adapt our scripts to handle their dynamic nature effectively.

Dynamic sites mechanism

As we said before, dynamic websites require specific interactions, such as clicking or scrolling, to trigger data display. These interactions activate JavaScript or AjaxAjax refers to a group of technologies that are used to develop web applications. code that modifies the DOM by adding or removing elements.

Note: We can view this JS code using the developer tools ("Debugger" tab).

To illustrate, we can observe the movie data upon choosing a specific year on the ScrapeThisSite website. By analyzing the code, we can identify a function named showfilms() that is accountable for producing the table containing the essential details.

Press + to interact

The JavaScript code either makes an API request to retrieve data, or the data is pre-fetched and awaits browser execution to be structured in the DOM. The former method is straightforward to capture using the Network tool in the developer tools, as it allows us to replicate the request and obtain the data. In this chapter, we will focus on the latter, where we must wait for the browser to execute and load the data in the DOM before scraping it using the previously discussed methods.

Selenium headless browser

A headless browser is a browser implementation that runs without a user interface. It enables automated scripts to interact with a web page as if a user were performing the actions. The headless browser runs in the background, allowing the script to interact with the page and retrieve data or perform actions without a visible browser window. In simpler terms, it is a browser without a GUI.

Press + to interact

Working mechanism

When a headless browser loads a web page, it sends a request to the web server, receives the HTML document in response, parses and renders the page, and executes any JavaScript code. In this sense, it’s no different from a standard browser. However, instead of rendering web pages visually, a headless browser grants access to webpage content and features via a command-line interface executing actions (CLI) or application programming interface (API) for executing actions on the page.

Press + to interact

Examples

There are multiple available open-source headless browsers, such as:

Chromium
Google Chrome
Firefox
Splash

Selenium is the web driverA web driver is a browser automation framework. It accepts commands and sends them to a browser. that provides these automated scripts to the headless browser. We will use Selenium Chrome to load and execute the JS for us, and then we can search the resulting DOM for the required data.

Installation

To install Python-Selenium, follow these steps:

Run the command pip install selenium to install Python-Selenium.
Selenium requires a driver to control the browser, we can download the appropriate driver for our browser from this Selenium documentation website.
Visit the official Selenium website and download the driver that matches the version of our browser.
To determine the version of Google Chrome, open the browser, click the three dots in the top menu, choose "Help," and then select "About Google Chrome."
Once we have downloaded the driver, we must place it in a location accessible to our Python environment.

If we prefer to maintain a consistent environment, especially considering the operating system and version dependencies, installing Selenium within a Docker image is recommended. Here is an example Dockerfile that demonstrates the installation of Selenium Chrome on Linux using Python:

Press + to interact

# Can be any version we want
FROM python:3.8
RUN apt-get update && apt-get install -y --no-install-recommends wget gnupg curl
# Add chrome repository
RUN wget -q -O - https://dl-ssl.google.com/linux/linux_signing_key.pub | apt-key add -
RUN sh -c 'echo "deb [arch=amd64] http://dl.google.com/linux/chrome/deb/ stable main" >> /etc/apt/sources.list.d/google-chrome.list'
RUN apt-get update && apt-get install -y --no-install-recommends gcc libc-dev libmagic-dev unzip google-chrome-stable
RUN rm -rf /var/lib/apt/lists/*
RUN apt-get clean
# Install chromedriver
RUN DRIVER_VERSION=$(curl -sL http://chromedriver.storage.googleapis.com/LATEST_RELEASE)
RUN wget -O /tmp/chromedriver.zip http://chromedriver.storage.googleapis.com/`curl -sL http://chromedriver.storage.googleapis.com/LATEST_RELEASE`/chromedriver_linux64.zip
RUN unzip /tmp/chromedriver.zip chromedriver -d /usr/local/bin/
# Set display port to avoid crash on chrome
ENV DISPLAY=:99

from selenium import webdriver
import time
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By

#defining driver options
options = Options()
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')
prefs = {"download.default_directory": "."}
options.add_experimental_option("prefs", prefs)
# initializing chrome instance  
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=options)
# sent GET request to the site using the driver
driver.get("https://www.scrapethissite.com/pages/ajax-javascript/")
time.sleep(1)

print("\nPage's title is: \n", driver.title)
# Find element with ID=2013
elem_2013 = driver.find_element(By.ID, "2013")
# click on that element
elem_2013.click()
# wait for results
driver.implicitly_wait(3)
# find the movies tables using CSS selectors
films = driver.find_elements(By.CSS_SELECTOR, "tr.film td.film-title")
print("\n2013 Movies: \n")
print([x.text for x in films])
time.sleep(2)
driver.close()

Navigating the movies page using Selenium

Code explanation

Lines 1–6: First, we import the necessary Selenium modules.
Lines 9–13: We create the options parameter, which allows us to pass specific options to each driver (such as Chrome, Firefox, or Edge).
- The current options prevent Selenium from crashing inside Docker containers.
- Another important option is --headless, it prevents Chrome from displaying its actions, but we have not included it in this code for educational purposes.
Line 15: We create a Chrome driver object using the webdriver.chrome() function.
- Typically, this function would require a specified driver path. Still, in this case, we use a manager service to download the driver each time the code is run to ensure compatibility with the educative environment.
Line 17: We then send a GET request using the driver to get the HTML DOM.
Line 18: We apply time.sleep(1) to observe the page's loading in Selenium.
Line 22: We utilize here one of Selenium's navigation functions find_element().
- This function operates similarly to the Beautiful Soup library, allowing users to provide filters using the By class to obtain the element/s that match the specified filter.
  - By.ID: Returns the element/s that match the provided id.
  - By.CLASS_NAME: Returns the element/s that match the class name.
  - By.NAME: Returns the element/s that match the name attribute.
  - By.CSS_SELECTOR: Returns the element/s that match the CSS Selector query.
  - By.TAG_NAME: Returns the element/s that match the provided tag name.
  - By.LINK_TEXT: Returns the <a> tags that have specific text inside them.
  - By.XPATH: Returns elements that match an XPath query.
- The element object returned by the function has several other attributes useful for obtaining information.
  - .text: Returns the text content of the tag.
  - .get_attribute(attr_name): Returns the value of the specified attribute for the tag.
Line 24: We use one of Selenium's interaction functions .click().
- It is used to instruct the driver to click on the selected element.
- .sendKeys() function can be used to press specific keys on the keyboard. For instance, elem.sendKeys(Keys.ARROW_DOWN) would simulate the user pressing the "down" arrow key.
Line 26: We use another useful function from Selenium, .implicitly_wait(seconds).
- It orders the driver to pause for a specified number of seconds until the page finishes loading.

Note:
While Selenium can send GET requests, it cannot perform all of the same actions as the Python requests library. Although, we can still add cookies using driver.add_cookie({"name":"key", "value":"value}), we cannot send a POST request or supply any headers.
In most cases, however, these limitations will not pose an issue, as Selenium acts as a real browser and can be detected by websites.

To demonstrate, we can request the ShellHacks website. Normally, the website requires the request to have a "User-Agent" header value similar to a browser's. However, in this case, Selenium will handle this internally without requiring specific headers to be specified.

from selenium import webdriver
import time
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By

options = Options()
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')

prefs = {"download.default_directory": "."}

options.add_experimental_option("prefs", prefs)
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=options)

driver.get("https://www.shellhacks.com/")

time.sleep(3)
driver.close()

Requesting on www.shellhacks.com without any spoofing

If we ever need to perform header spoofing or send a different type of request, the Selenium-requests library can be utilized to integrate Selenium with the requests library.

Note: As previously mentioned, Selenium was primarily designed to test browser functions, rather than for web scraping. While there are many other useful functions available in the documentation, we may not need to utilize all of them for our purposes.

Conclusion

In summary, we have delved into the inner workings of dynamic websites and discussed how they operate. We have also introduced Selenium as a helpful tool for loading JavaScript code and fetching data on our behalf. By utilizing this tool, we can more effectively scrape dynamic websites and extract the information we need.

Introduction to Course Content and Web Scraping

Fundamental Concepts of Web Scraping

Dynamic Sites with Selenium

Assessment: Python Scraping

Scrapy Framework

Scraping Educative’s Courses Information

Wrap Up

Introduction to Selenium

Dynamic sites mechanism

Selenium headless browser

Working mechanism

Examples

Installation

Usage

Code explanation

Conclusion