Scrapy with Selenium
Explore the integration between Selenium and Scrapy.
We'll cover the following
Now that we have added middleware to our stack, it is time to learn how to utilize it with Selenium.
Scrapy with dynamic sites
While Scrapy provides excellent modules for optimizing web scraping operations, it lacks built-in functionality to handle dynamic websites. To tackle this challenge, we need to integrate Selenium or another library alongside it. As we have already covered Selenium in previous lessons, we will use it in this module.
To efficiently scrape JavaScript-based websites, we will follow a three-step process:
We will use Scrapy to make the initial request.
We will pass this request to Selenium to load the DOM on our behalf.
Finally, we will use selectors to extract the data from the fully loaded DOM.
We learned that downloader middleware are used to manipulate requests. Consequently, they are the ideal components to facilitate the sequence outlined above.
Extracting data using Scrapy and Selenium
Let's implement custom downloader middleware to scrape movie data from ScrapeThisSite using Scrapy and Selenium:
from scraper.scraper.spiders.movies import MovieSpider from scrapy.crawler import CrawlerProcess if __name__ == "__main__": process = CrawlerProcess() process.crawl(MovieSpider) process.start()
Code explanation
middlewares.py
: Defining a custom downloader middleware known asLoadingMiddleware
. This class primarily functions by being invoked with each Scrapy request, triggeringprocess_request()
to engage the spider's defined function. The responsibility of this spider function is to load the web page through Selenium and generate a request with the content ready for parsing.Lines 7–9: We must implement the method
from_crawler
to access the spider class's methods and variables. This method returns the current instance of the spider class.Lines 11–17: This section encapsulates the
process_request()
method of the middleware. Its purpose is to invokespider.process_request()
and returnNone
if an error occurs or another request instance.
movies.py
:Lines 1–10: Importing the necessary modules for Scrapy and Selenium to kickstart our spider.
Lines 12–18: Setting up Selenium driver options crucial for running within our specific environment.
Line 24: Integrating our custom downloader middleware by adding it to the
custom_settings
.Lines 31–35: Defining the
parse
function responsible for extracting movie titles from the DOM. These titles are logged as warnings in the terminal, utilizing awarning
level to filter unnecessary debugging information.Line 37: This is the custom
process_request()
function that we should have for each spider we make, where it will load the page using the driver's wait condition for custom elements and then return the DOM to theparse
function.Lines 40–41: Initialize the Chrome driver and extract the URL to be loaded.
Lines 44–48: Defining a custom waiting condition to ensure the page is fully loaded and all the required information is in the DOM.
Lines 50–51: To retrieve specific titles, we find a particular year button and click it to load the new table.
Line 52: To synchronize with the loading logic, we introduce a slight delay (using
time.sleep(2)
) to ensure proper page loading.Lines 55–56: Post-loading, we retrieve the page content and the URL from the driver to feedback to Scrapy for continued processing.
Line 58: Crucially, we close the driver after each request to prevent potential crashes from multiple concurrent requests.
Line 61: Lastly, we return an
HtmlResponse
object to the downloader middleware'sprocess_request()
function, which forwards it to theparse()
function. This enables Scrapy to continue the scraping process seamlessly.
Note: Selenium is not the only way to deal with dynamic sites using Scrapy. There are other ways too, such as the scrapy-splash library, which is a splash service compatible with Scrapy, or scrapy-playwright, which provides an easier way to make Selenium requests. However, we won't cover them in detail.
Conclusion
In conclusion, integrating custom downloader middleware and the seamless interplay between Scrapy and Selenium equips us with powerful tools for web scraping in Python.
Get hands-on with 1300+ tech skills courses.