What Is Middleware?
Learn about Scrapy middleware and explore how to attach them to requests and responses.
Now, we will explore Scrapy’s middleware, a crucial framework component. Middleware is pivotal in modifying and controlling Scrapy’s request and response objects by attaching middleware to perform specific processing.
Downloader middleware
Downloader middleware allows us to manipulate requests and responses, add custom headers, handle proxies, or modify how Scrapy interacts with websites. To enable a downloader middleware component, we should define it in the spider settings, the same as we did with Pipelines. To do that, we add this code inside custom_settings in the spider class:
custom_settings = {'DOWNLOADER_MIDDLEWARES':{"ScrapyProject.middlewares.CustomDownloaderMiddleware": 543}}
Much like Pipelines, middleware is implemented in a specific order determined by assigned numbers.
Built-in downloader middleware
Built-in downloader middleware are components between Scrapy’s engine and the website we are scraping. There are several built-in downloader middleware that cover common use cases. Some of the notable built-in middleware include:
CookiesMiddlewareThis middleware enables working with sites that require cookies, such as those that use sessions. It keeps track of cookies sent by web servers and sends them back on subsequent requests (from that spider), just like web browsers do.
To enable this middleware, we set
COOKIES_ENABLEDandCOOKIES_DEBUGtotruein spider settings.
UserAgentMiddlewareIt rotates User-Agent headers to avoid detection as a web scraper. We can customize it to use specific User Agents. It overrides the default user agent.
This middleware is enabled by setting
self.user_agentin the spider class, which will override theUSER_AGENTvalue in the spider settings.
RetryMiddlewareManages request retries in case of network errors or HTTP error codes. Failed pages are collected during the scraping process and rescheduled once the spider has finished crawling all regular (non-failed) pages.
This Middleware can be configured using
RETRY_ENABLEDandRETRY_TIMESin the spider settings.
Let’s see an example of utilizing some of these middleware.
# Automatically created by: scrapy startproject # # For more information about the [deploy] section see: # https://scrapyd.readthedocs.io/en/latest/deploy.html [settings] default = scraper.settings [deploy] #url = http://localhost:6800/ project = scraper
Code explanation
Line 12–20: We define custom settings for the spider, including middleware and retry settings.
Line 13: Configures downloader middleware such as
CookiesMiddlewareandRetryMiddleware.Lines 17–18: Enables cookies handling (
COOKIES_ENABLED) and debugging (COOKIES_DEBUG).Line 19: Sets the maximum number of retry times to 3 (
RETRY_TIMES).
Line 22–30: We send a start request to the login page. ...