Now, we will explore Scrapy’s middleware, a crucial framework component. Middleware is pivotal in modifying and controlling Scrapy’s request and response objects by attaching middleware to perform specific processing.

Press + to interact
Middleware types
Middleware types

Downloader middleware

Downloader middleware allows us to manipulate requests and responses, add custom headers, handle proxies, or modify how Scrapy interacts with websites. To enable a downloader middleware component, we should define it in the spider settings, the same as we did with Pipelines. To do that, we add this code inside custom_settings in the spider class:

custom_settings = {
'DOWNLOADER_MIDDLEWARES':{
"ScrapyProject.middlewares.CustomDownloaderMiddleware": 543}
}

Much like Pipelines, middleware is implemented in a specific order determined by assigned numbers.

Built-in downloader middleware

Built-in downloader middleware are components between Scrapy’s engine and the website we are scraping. There are several built-in downloader middleware that cover common use cases. Some of the notable built-in middleware include:

  • CookiesMiddleware

    • This middleware enables working with sites that require cookies, such as those that use sessions. It keeps track of cookies sent by web servers and sends them back on subsequent requests (from that spider), just like web browsers do.

    • To enable this middleware, we set COOKIES_ENABLED and COOKIES_DEBUG to true in spider settings.

  • UserAgentMiddleware

    • It rotates User-Agent headers to avoid detection as a web scraper. We can customize it to use specific User Agents. It overrides the default user agent.

    • This middleware is enabled by setting self.user_agent in the spider class, which will override the USER_AGENT value in the spider settings.

  • RetryMiddleware

    • Manages request retries in case of network errors or HTTP error codes. Failed pages are collected during the scraping process and rescheduled once the spider has finished crawling all regular (non-failed) pages.

    • This Middleware can be configured using RETRY_ENABLED and RETRY_TIMES in the spider settings.

Let’s see an example of utilizing some of these middleware.

# Automatically created by: scrapy startproject
#
# For more information about the [deploy] section see:
# https://scrapyd.readthedocs.io/en/latest/deploy.html

[settings]
default = scraper.settings

[deploy]
#url = http://localhost:6800/
project = scraper
Scraping quotes by utilizing built-in middleware

Code explanation

  • Line 12–20: We define custom settings for the spider, including middleware and retry settings.

    • Line 13: Configures downloader middleware such as CookiesMiddleware and RetryMiddleware.

    • Lines 17–18: Enables cookies handling (COOKIES_ENABLED) and debugging (COOKIES_DEBUG).

    • Line 19: Sets the maximum number of retry times to 3 (RETRY_TIMES).

  • Line 22–30: We send a start request to the login page.

    • Line 23: This method returns a list containing a FormRequest, which will be called automatically by Scrapy.

    • Line 25: The FormRequest is used to perform a POST request to the login page with username and password form data.

    • Line 29: The callback argument specifies that the logged_in method should be called after a successful login.

  • Line 32–39: Define the logged_in method to handle the response after logging in.

    • Line 33: Checks if the word Logout is present in the response text, indicating a successful login.

    • Lines 34–35: If logged in, it logs a success message and initiates requests to the main page(s) to start scraping quotes.

    • Line 39: If login fails, it logs an error message.

  • Line 41–50: Define the parse method to extract data from the web pages.

    • Line 42–43: Checks if the word Logout is present in the response text, indicating that the spider is still logged in.

    • Line 44: Iterates through each quote element in the HTML using CSS selectors.

    • Lines 45–49: Extracts the quote text, author, and tags.

    • Line 50: Yields a QuoteItem with the extracted data. When examining the logs, we can observe the cookies being transmitted between requests, indicating the validity of the login session.

When examining the logs, we can observe the cookies being transmitted between requests, indicating the validity of the login session.

Custom downloader middleware

They offer a way to inject custom logic into the request handling and response processing flows of our Scrapy spider. This can include altering requests before they are sent, changing responses before they are passed to spiders, handling exceptions in a specific way, and more. Creating a custom downloader middleware involves defining a Python class with a primary entry point from_crawler class method, which Scrapy calls when initiating the spider, and other functions depending on the middleware's intended function:

  • process_request(request, spider): This method is called for each request through the downloader middleware.

    • It should either return None, return a Response object, return a Request object, or raise IgnoreRequest.

      • If it returns None, Scrapy continues processing the request through other middleware until it reaches the appropriate downloader handler to perform the request and download its response.

      • If it returns a Response object, Scrapy immediately returns that response without further middleware processing and calls process_response() is called.

      • If it returns a Request object, Scrapy stops calling process_request() methods and schedules the new request. Once this new request is performed, the middleware chain is applied to the downloaded response.

      • If it raises an IgnoreRequest exception, the process_exception() methods of installed downloader middleware are called. If none handle the exception, the request's errback function is invoked. Unlike other exceptions, if no code handles the exception, it is silently ignored and not logged.

  • process_response(request, response, spider): This method is called with the response returned from the downloader, for each request that goes through the downloader middleware. It should return a Response object (the same received, a new one, or a result of a request).

  • process_exception(request, exception, spider):

    • Scrapy calls process_exception() when a download handler or a process_request() (from a downloader middleware) raises an exception (including an IgnoreRequest exception).

    • It can return None to continue processing exceptions, a Response object to stop process chains and return that response, or a Request object to stop process chains and issue a new request.

Here’s an example of a custom downloader middleware that modifies the User-Agent header of outgoing requests:

# Automatically created by: scrapy startproject
#
# For more information about the [deploy] section see:
# https://scrapyd.readthedocs.io/en/latest/deploy.html

[settings]
default = scraper.settings

[deploy]
#url = http://localhost:6800/
project = scraper
Scraping books by utilizing custom downloader middlewares

Code explanation

  • books.py: No changes are made to the code. Initially, we enable the custom middleware in the custom_settings, as previously discussed. Following that, we create a list of fictitious USER_AGENTS. Scrapy will select one of these USER_AGENTS for each request it sends.

  • middlewares.py

    • Lines 5–9: We define a custom middleware called RandomUserAgentMiddleware.This middleware aims to randomly select a user agent from the provided list and then include this chosen value in the request header.

    • Lines 11–15: The from_crawler() method is defined here. It serves as a constructor for an instance of the middleware. Before creating the instance, it retrieves the USER_AGENTS list and passes it to the constructor. This is the main method Scrapy calls when starting the spider to initialize any middleware.

    • Lines 17–20: This part implements the process_request() function. This function will be invoked for every request generated by the spider. Within this function, we attach the new USER-AGENT to the request headers.

Spider middleware

The spider middleware is an entry point within Scrapy's spider processing system. It allows us to integrate custom functionality for handling the responses sent to spiders during processing and managing the requests and items generated by those spiders.

Built-in spider middlewares

Built-in Spider Middlewares in Scrapy are components between the Scrapy engine and our spider. They are designed to process spider input (responses) and output (items and requests). Spider middlewares can modify, drop, or add new requests or items. They are an essential part of Scrapy's architecture, allowing for the extension and customization of the framework's functionality. Some of the commonly used built-in middleware components include:

  • DepthMiddleware

    • The DepthMiddleware tracks request depth within a website, helpful in limiting crawling depth and managing request priority.

    • Its configuration settings include DEPTH_LIMIT to set a maximum depth and DEPTH_STATS_VERBOSE for collecting request counts at each depth.

  • HttpErrorMiddleware

    • Eliminate unsuccessful (error) HTTP responses to spare spiders from handling them. These responses typically add overhead, consume extra resources, and complicate spider logic.

  • OffsiteMiddleware

    • This middleware filters out requests to domains other than those specified in the spider's allowed_domains attribute. This helps keep the crawl focused on the relevant domains and avoids wasting resources on crawling external sites.

Let’s see an example of utilizing some of these middlewares.

# Automatically created by: scrapy startproject
#
# For more information about the [deploy] section see:
# https://scrapyd.readthedocs.io/en/latest/deploy.html

[settings]
default = scraper.settings

[deploy]
#url = http://localhost:6800/
project = scraper
Scraping books by utilizing spider middlewares

Code explanation

  • books.py

    • Line 8: We define the allowed_domains paramter that enables OffsiteMiddleware

    • Lines 10–13: We define custom settings for the spider, DEPTH_LIMIT is responsible for configuring the depth the spider will adhere to while following the URLs

    • Line 21: To get the current request depth, we query the depth key from the response.meta

    • Line 25: Lastly, we can get the redirected URLs using the redirect_urls from response.meta

Custom spider middleware

Custom spider middlewares can modify, drop, or enrich requests and items or handle exceptions that occur during the processing of responses. They provide hooks into the Scrapy engine for pre- and post-processing spider input (requests) and output (items and requests). A custom spider middleware is a Python class that defines one or more of the following methods:

  • process_spider_input(response, spider)

    • This method is invoked for each response passing through the spider middleware into the spider for further processing.

    • It can either return None or raise an exception.

      • If it returns None, Scrapy continues processing the response through other middlewares and ultimately passes it to the spider.

      • If it raises an exception, other spider middleware's process_spider_input() methods are skipped, and Scrapy either calls the request's error back or initiates the process_spider_exception() chain. The output of the error back is processed in reverse, involving process_spider_output() or process_spider_exception() if an exception was raised.

  • process_spider_output(response, result, spider)

    • This method is called with the results returned from the Spider after the response is processed.

    • This method is an iterator, which means it should yield Request, dict, or Item objects.

  • process_spider_exception(response, exception, spider)

    • This method is called when a spider or process_spider_input() method (from another spider middleware) raises an exception. It is an iterator, so it should yield Request, dict, or Item objects, or it can also return None to continue processing other middlewares.

Here is an example of a custom Spider Middleware in Scrapy that logs the response body size for each request:

# Automatically created by: scrapy startproject
#
# For more information about the [deploy] section see:
# https://scrapyd.readthedocs.io/en/latest/deploy.html

[settings]
default = scraper.settings

[deploy]
#url = http://localhost:6800/
project = scraper
Scraping books by utilizing custom spider middlewares

Code explanation

  • books.py: No changes are made to the code. Initially, we enable the custom middleware in the custom_settings, as previously discussed.

  • middlewares.py

    • Lines 1: We define a custom spider middleware called ResponseSizeMiddleware. This middleware aims to log every response body size in bytes.

    • Lines 3–6: The process_sspider_output() method is implemented here to handle the request's output. It simply calculates the length of the response body and logs it.

Note: While Spider middleware is important, most of the time we might not need to write a custom one and make use only of the built-in ones.

Press + to interact
Working of middleware
Working of middleware

Downloader vs. Spider middleware

Aspect

Downloader middleware

Spider middleware

Purpose

Primarily concerned with handling the HTTP requests and responses, such as modifying headers, handling proxies, and compression.

Primarily responsible for processing and manipulating spider-specific logic, including response parsing and error handling.

Order of Execution

Executes before sending requests and after receiving responses.

Executes after receiving responses and before passing data to spiders.

Typical Use Cases

  1. Modifying request headers
  2. Handling proxies and IP rotation
  3. Managing HTTP compression
  4. Implementing custom download logic
  5. Retry policies


  1. Parsing and extracting data from responses
  2. Handling redirects
  3. Filtering and prioritizing requests
  4. Handling exceptions and errors
  5. Authentication (e.g., login)
  6. Custom processing as per spider requirements

Configuration

Configured using DOWNLOADER_MIDDLEWARES

Configured using SPIDER_MIDDLEWARES

Conclusion

In conclusion, Spider middleware and Downloader middleware are essential components in the Scrapy framework, providing developers with power tools to customize and enhance Scrapy workflows.

Get hands-on with 1300+ tech skills courses.