What Is Middleware?
Learn about Scrapy middleware and explore how to attach them to requests and responses.
Now, we will explore Scrapy’s middleware, a crucial framework component. Middleware is pivotal in modifying and controlling Scrapy’s request and response objects by attaching middleware to perform specific processing.
Downloader middleware
Downloader middleware allows us to manipulate requests and responses, add custom headers, handle proxies, or modify how Scrapy interacts with websites. To enable a downloader middleware component, we should define it in the spider settings, the same as we did with Pipelines
. To do that, we add this code inside custom_settings
in the spider class:
custom_settings = {'DOWNLOADER_MIDDLEWARES':{"ScrapyProject.middlewares.CustomDownloaderMiddleware": 543}}
Much like Pipelines, middleware is implemented in a specific order determined by assigned numbers.
Built-in downloader middleware
Built-in downloader middleware are components between Scrapy’s engine and the website we are scraping. There are several built-in downloader middleware that cover common use cases. Some of the notable built-in middleware include:
CookiesMiddleware
This middleware enables working with sites that require cookies, such as those that use sessions. It keeps track of cookies sent by web servers and sends them back on subsequent requests (from that spider), just like web browsers do.
To enable this middleware, we set
COOKIES_ENABLED
andCOOKIES_DEBUG
totrue
in spider settings.
UserAgentMiddleware
It rotates User-Agent headers to avoid detection as a web scraper. We can customize it to use specific User Agents. It overrides the default user agent.
This middleware is enabled by setting
self.user_agent
in the spider class, which will override theUSER_AGENT
value in the spider settings.
RetryMiddleware
Manages request retries in case of network errors or HTTP error codes. Failed pages are collected during the scraping process and rescheduled once the spider has finished crawling all regular (non-failed) pages.
This Middleware can be configured using
RETRY_ENABLED
andRETRY_TIMES
in the spider settings.
Let’s see an example of utilizing some of these middleware.
# Automatically created by: scrapy startproject # # For more information about the [deploy] section see: # https://scrapyd.readthedocs.io/en/latest/deploy.html [settings] default = scraper.settings [deploy] #url = http://localhost:6800/ project = scraper
Code explanation
Line 12–20: We define custom settings for the spider, including middleware and retry settings.
Line 13: Configures downloader middleware such as
CookiesMiddleware
andRetryMiddleware
.Lines 17–18: Enables cookies handling (
COOKIES_ENABLED
) and debugging (COOKIES_DEBUG
).Line 19: Sets the maximum number of retry times to 3 (
RETRY_TIMES
).
Line 22–30: We send a start request to the login page.
Line 23: This method returns a list containing a
FormRequest
, which will be called automatically by Scrapy.Line 25: The
FormRequest
is used to perform aPOST
request to the login page withusername
andpassword
form data.Line 29: The
callback
argument specifies that thelogged_in
method should be called after a successful login.
Line 32–39: Define the
logged_in
method to handle the response after logging in.Line 33: Checks if the word
Logout
is present in the response text, indicating a successful login.Lines 34–35: If logged in, it logs a success message and initiates requests to the main page(s) to start scraping quotes.
Line 39: If login fails, it logs an error message.
Line 41–50: Define the
parse
method to extract data from the web pages.Line 42–43: Checks if the word
Logout
is present in the response text, indicating that the spider is still logged in.Line 44: Iterates through each quote element in the HTML using CSS selectors.
Lines 45–49: Extracts the quote text, author, and tags.
Line 50: Yields a
QuoteItem
with the extracted data. When examining the logs, we can observe the cookies being transmitted between requests, indicating the validity of the login session.
When examining the logs, we can observe the cookies being transmitted between requests, indicating the validity of the login session.
Custom downloader middleware
They offer a way to inject custom logic into the request handling and response processing flows of our Scrapy spider. This can include altering requests before they are sent, changing responses before they are passed to spiders, handling exceptions in a specific way, and more. Creating a custom downloader middleware involves defining a Python class with a primary entry point from_crawler
class method, which Scrapy calls when initiating the spider, and other functions depending on the middleware's intended function:
process_request(request, spider)
: This method is called for each request through the downloader middleware.It should either return
None
, return aResponse
object, return aRequest
object, or raiseIgnoreRequest
.If it returns
None
, Scrapy continues processing the request through other middleware until it reaches the appropriate downloader handler to perform the request and download its response.If it returns a
Response
object, Scrapy immediately returns that response without further middleware processing and callsprocess_response()
is called.If it returns a
Request
object, Scrapy stops callingprocess_request()
methods and schedules the new request. Once this new request is performed, the middleware chain is applied to the downloaded response.If it raises an
IgnoreRequest
exception, theprocess_exception()
methods of installed downloader middleware are called. If none handle the exception, the request'serrback
function is invoked. Unlike other exceptions, if no code handles the exception, it is silently ignored and not logged.
process_response(request, response, spider)
: This method is called with the response returned from the downloader, for each request that goes through the downloader middleware. It should return aResponse
object (the same received, a new one, or a result of a request).process_exception(request, exception, spider)
:Scrapy calls
process_exception()
when a download handler or aprocess_request()
(from a downloader middleware) raises an exception (including anIgnoreRequest
exception).It can return
None
to continue processing exceptions, aResponse
object to stop process chains and return that response, or aRequest
object to stop process chains and issue a new request.
Here’s an example of a custom downloader middleware that modifies the User-Agent header of outgoing requests:
# Automatically created by: scrapy startproject # # For more information about the [deploy] section see: # https://scrapyd.readthedocs.io/en/latest/deploy.html [settings] default = scraper.settings [deploy] #url = http://localhost:6800/ project = scraper
Code explanation
books.py
: No changes are made to the code. Initially, we enable the custom middleware in thecustom_settings
, as previously discussed. Following that, we create a list of fictitiousUSER_AGENTS
. Scrapy will select one of theseUSER_AGENTS
for each request it sends.Lines 5–9: We define a custom middleware called
RandomUserAgentMiddleware
.This middleware aims to randomly select a user agent from the provided list and then include this chosen value in the request header.Lines 11–15: The
from_crawler()
method is defined here. It serves as a constructor for an instance of the middleware. Before creating the instance, it retrieves theUSER_AGENTS
list and passes it to the constructor. This is the main method Scrapy calls when starting the spider to initialize any middleware.Lines 17–20: This part implements the
process_request()
function. This function will be invoked for every request generated by the spider. Within this function, we attach the newUSER-AGENT
to the request headers.
Spider middleware
The spider middleware is an entry point within Scrapy's spider processing system. It allows us to integrate custom functionality for handling the responses sent to spiders during processing and managing the requests and items generated by those spiders.
Built-in spider middlewares
Built-in Spider Middlewares in Scrapy are components between the Scrapy engine and our spider. They are designed to process spider input (responses) and output (items and requests). Spider middlewares can modify, drop, or add new requests or items. They are an essential part of Scrapy's architecture, allowing for the extension and customization of the framework's functionality. Some of the commonly used built-in middleware components include:
DepthMiddleware
The
DepthMiddleware
tracks request depth within a website, helpful in limiting crawling depth and managing request priority.Its configuration settings include
DEPTH_LIMIT
to set a maximum depth andDEPTH_STATS_VERBOSE
for collecting request counts at each depth.
HttpErrorMiddleware
Eliminate unsuccessful (error) HTTP responses to spare spiders from handling them. These responses typically add overhead, consume extra resources, and complicate spider logic.
OffsiteMiddleware
This middleware filters out requests to domains other than those specified in the spider's
allowed_domains
attribute. This helps keep the crawl focused on the relevant domains and avoids wasting resources on crawling external sites.
Let’s see an example of utilizing some of these middlewares.
# Automatically created by: scrapy startproject # # For more information about the [deploy] section see: # https://scrapyd.readthedocs.io/en/latest/deploy.html [settings] default = scraper.settings [deploy] #url = http://localhost:6800/ project = scraper
Code explanation
books.py
Line 8: We define the
allowed_domains
paramter that enablesOffsiteMiddleware
Lines 10–13: We define custom settings for the spider,
DEPTH_LIMIT
is responsible for configuring the depth the spider will adhere to while following the URLsLine 21: To get the current request depth, we query the
depth
key from theresponse.meta
Line 25: Lastly, we can get the redirected URLs using the
redirect_urls
fromresponse.meta
Custom spider middleware
Custom spider middlewares can modify, drop, or enrich requests and items or handle exceptions that occur during the processing of responses. They provide hooks into the Scrapy engine for pre- and post-processing spider input (requests) and output (items and requests). A custom spider middleware is a Python class that defines one or more of the following methods:
process_spider_input(response, spider)
This method is invoked for each response passing through the spider middleware into the spider for further processing.
It can either return
None
or raise an exception.If it returns
None
, Scrapy continues processing the response through other middlewares and ultimately passes it to the spider.If it raises an exception, other spider middleware's
process_spider_input()
methods are skipped, and Scrapy either calls the request's error back or initiates theprocess_spider_exception()
chain. The output of the error back is processed in reverse, involvingprocess_spider_output()
orprocess_spider_exception()
if an exception was raised.
process_spider_output(response, result, spider)
This method is called with the results returned from the Spider after the response is processed.
This method is an iterator, which means it should yield Request, dict, or Item objects.
process_spider_exception(response, exception, spider)
This method is called when a spider or
process_spider_input()
method (from another spider middleware) raises an exception. It is an iterator, so it should yield Request, dict, or Item objects, or it can also returnNone
to continue processing other middlewares.
Here is an example of a custom Spider Middleware in Scrapy that logs the response body size for each request:
# Automatically created by: scrapy startproject # # For more information about the [deploy] section see: # https://scrapyd.readthedocs.io/en/latest/deploy.html [settings] default = scraper.settings [deploy] #url = http://localhost:6800/ project = scraper
Code explanation
books.py
: No changes are made to the code. Initially, we enable the custom middleware in thecustom_settings
, as previously discussed.middlewares.py
Lines 1: We define a custom spider middleware called
ResponseSizeMiddleware
. This middleware aims to log every response body size in bytes.Lines 3–6: The
process_sspider_output()
method is implemented here to handle the request's output. It simply calculates the length of the response body and logs it.
Note: While Spider middleware is important, most of the time we might not need to write a custom one and make use only of the built-in ones.
Downloader vs. Spider middleware
Aspect | Downloader middleware | Spider middleware |
Purpose | Primarily concerned with handling the HTTP requests and responses, such as modifying headers, handling proxies, and compression. | Primarily responsible for processing and manipulating spider-specific logic, including response parsing and error handling. |
Order of Execution | Executes before sending requests and after receiving responses. | Executes after receiving responses and before passing data to spiders. |
Typical Use Cases |
|
|
Configuration | Configured using | Configured using |
Conclusion
In conclusion, Spider middleware and Downloader middleware are essential components in the Scrapy framework, providing developers with power tools to customize and enhance Scrapy workflows.
Get hands-on with 1300+ tech skills courses.