Scrapy with Selenium

Explore the integration between Selenium and Scrapy.

Now that we have added middleware to our stack, it is time to learn how to utilize it with Selenium.

Scrapy with dynamic sites

While Scrapy provides excellent modules for optimizing web scraping operations, it lacks built-in functionality to handle dynamic websites. To tackle this challenge, we need to integrate Selenium or another library alongside it. As we have already covered Selenium in previous lessons, we will use it in this module.

To efficiently scrape JavaScript-based websites, we will follow a three-step process:

  1. We will use Scrapy to make the initial request.

  2. We will pass this request to Selenium to load the DOM on our behalf.

  3. Finally, we will use selectors to extract the data from the fully loaded DOM.

We learned that downloader middleware are used to manipulate requests. Consequently, they are the ideal components to facilitate the sequence outlined above.

Press + to interact
Scrapy with Selenium using middlewares
Scrapy with Selenium using middlewares

Extracting data using Scrapy and Selenium

Let's implement custom downloader middleware to scrape movie data from ScrapeThisSite using Scrapy and Selenium:

from scraper.scraper.spiders.movies import MovieSpider
from scrapy.crawler import CrawlerProcess

if __name__ == "__main__":
    process = CrawlerProcess()

    process.crawl(MovieSpider)
    process.start()
Scraping movies using custom downloader middleware and Selenium

Code explanation

  • middlewares.py: Defining a custom downloader middleware known as LoadingMiddleware. This class primarily functions by being invoked with each Scrapy request, triggering process_request() to engage the spider's defined function. The responsibility of this spider function is to load the web page through Selenium and generate a request with the content ready for parsing.

    • Lines 7–9: We must implement the method from_crawler to access the spider class's methods and variables. This method returns the current instance of the spider class.

    • Lines 11–17: This section encapsulates the process_request() method of the middleware. Its purpose is to invoke spider.process_request() and return None if an error occurs or another request instance.

  • movies.py:

    • Lines 1–10: Importing the necessary modules for Scrapy and Selenium to kickstart our spider.

    • Lines 12–18: Setting up Selenium driver options crucial for running within our specific environment.

    • Line 24: Integrating our custom downloader middleware by adding it to the custom_settings.

    • Lines 31–35: Defining the parse function responsible for extracting movie titles from the DOM. These titles are logged as warnings in the terminal, utilizing a warning level to filter unnecessary debugging information.

    • Line 37: This is the custom process_request() function that we should have for each spider we make, where it will load the page using the driver's wait condition for custom elements and then return the DOM to the parse function.

    • Lines 40–41: Initialize the Chrome driver and extract the URL to be loaded.

    • Lines 44–48: Defining a custom waiting condition to ensure the page is fully loaded and all the required information is in the DOM.

    • Lines 50–51: To retrieve specific titles, we find a particular year button and click it to load the new table.

    • Line 52: To synchronize with the loading logic, we introduce a slight delay (using time.sleep(2)) to ensure proper page loading.

    • Lines 55–56: Post-loading, we retrieve the page content and the URL from the driver to feedback to Scrapy for continued processing.

    • Line 58: Crucially, we close the driver after each request to prevent potential crashes from multiple concurrent requests.

    • Line 61: Lastly, we return an HtmlResponse object to the downloader middleware's process_request() function, which forwards it to the parse() function. This enables Scrapy to continue the scraping process seamlessly.

Note: Selenium is not the only way to deal with dynamic sites using Scrapy. There are other ways too, such as the scrapy-splash library, which is a splash service compatible with Scrapy, or scrapy-playwright, which provides an easier way to make Selenium requests. However, we won't cover them in detail.

Conclusion

In conclusion, integrating custom downloader middleware and the seamless interplay between Scrapy and Selenium equips us with powerful tools for web scraping in Python.

Get hands-on with 1300+ tech skills courses.