Scrapy Data Pipeline

Learn how Scrapy organizes the data pipeline and exports it in any structured format.

Having familiarized ourselves with Scrapy's fundamental modules, which empower us to extract information from various websites, it's time to explore exporting our scraper's output in a structured format.

Core modules

Scrapy offers a systematic approach to organizing the data we scrape in unstructured formats that can be easily employed for various purposes. It achieves this through the utilization of three core modules:

Press + to interact
Scrapy outputs modules
Scrapy outputs modules

The diagram below illustrates the fundamental connections between these modules:

Press + to interact
Scrapy output modules
Scrapy output modules

Spider.py is the core scraping spider code. It utilizes Items.py with ItemLoader to containerize the scraped data, then using ItemPipeline.py to perform final processing on the data and save it in a structured format.

Items

Items are simple containers that hold the data we want to extract from a website. They serve as a structured data representation and help us maintain consistency in our scraped results.

Items are defined using Python classes that inherit from scrapy.Item inside the Items.py file. Each attribute of the item class represents a piece of data we want to extract. By defining the fields in the item class, we specify the data structure we will scrape.

Here’s a basic example of defining a Scrapy item for scraping quotes from the Quotes to Scrape website:

import scrapy
class QuoteItem(scrapy.Item):
text = scrapy.Field()
author = scrapy.Field()
tags = scrapy.Field()

In this example, the QuoteItem class represents a quote with its corresponding text, author, and tags. Field objects are used to specify metadata for each field. We can specify any metadata for each field. There is no restriction on the values accepted by Field objects.

Once we've defined our item class, we can start using it. Within our spider's parsing methods, we can create instances of the item class, assign values to its fields, and yield the populated item.

# Automatically created by: scrapy startproject
#
# For more information about the [deploy] section see:
# https://scrapyd.readthedocs.io/en/latest/deploy.html

[settings]
default = scraper.settings

[deploy]
#url = http://localhost:6800/
project = scraper
Scraping data from quotes website by utilizing Items

Inspecting the code output, we will find the data yielded in a more structured way as a dictionary; Scrapy will automatically handle the data and ensure it’s processed consistently and matches our item definition.

Item Loaders

Scrapy Item Loaders are a powerful tool to streamline the process of populating Scrapy items. It offers a more organized approach to data manipulation by allowing us to apply input and output processors to item fields, making it easier to clean, validate, and format the scraped data before it’s stored in the item.

Let’s dive into an example to see how Scrapy Item Loaders work:

# Automatically created by: scrapy startproject
#
# For more information about the [deploy] section see:
# https://scrapyd.readthedocs.io/en/latest/deploy.html

[settings]
default = scraper.settings

[deploy]
#url = http://localhost:6800/
project = scraper
Scraping data from quotes website by utilizing Item Loaders

In this example, we use an ItemLoader to populate the QuoteItem. The ItemLoader is initialized with the selector of the HTML element containing the data. Then, add_css is used to specify how to extract data from the selector and assign it to item fields. We can also apply custom processors, such as cleaning or formatting functions, to fields during this process. Finally, loader.load_item() returns the populated item.

How do Item Loaders work?

Item Loaders can be exceptionally helpful when extracting the same field from various sources. Let's consider an example:

We're scraping a product item, and the product's name can be obtained either from the name value or the title. In such a case, the ItemLoader will look like this:

def parse(self, response):
l = ItemLoader(item=Product(), response=response)
l.add_xpath("name", '//div[@class="product_name"]')
l.add_xpath("name", '//div[@class="product_title"]')

Here's how it works:

  1. We extract the name from //div[@class="product_name"] and pass it through the input processor of the name field.

  2. We extract the name from //div[@class="product_title"], and it also goes through the same input processor used in step (1).

  3. The result of the input processor is appended to the data collected in Step 1 if there is any data.

  4. The data gathered in Steps 1 and 21 is then passed through the output processor of the name field. The result of the output processor becomes the value assigned to the name field in the item.

Scrapy offers various types of output processors, such as Identity, Join, and MapCompose. We can use them according to our specific needs.

Note: While Item Loaders are valuable tools provided by Scrapy, they are not necessary for every web scraping task.

Item pipeline

After extracting items, Scrapy allows us to apply additional processing using item pipelines. Pipelines are a series of processing steps that items pass through after being yielded or used.

Each item pipeline component is essentially a Python class that implements a straightforward method and is coded inside pipelines.py file. These classes receive an item, execute a specific action on it, and determine whether the item should proceed through the pipeline or be discarded, thereby ending its processing.

Typical applications of item pipelines include:

  • Cleansing HTML data

  • Validating scraped data (checking that the items contain specific fields)

  • Checking for duplicates (and dropping them)

  • Storing the scraped item in a database

The class implements the following methods:

  • process_item(item, spider): This method is called for every item yield. It must return an item or raise DropItem exception and must be implemented.

  • open_spider(spider): We don't need to implement this one, but we can use it if we have a logic to apply when the spider opens.

  • close_spider(spider): Same as open_spider(spider) but when the spider is closed.

Let's try adding the dropping pipeline and a JSON writer pipeline to our Quote scraper and see how it works in Scrapy.

# Automatically created by: scrapy startproject
#
# For more information about the [deploy] section see:
# https://scrapyd.readthedocs.io/en/latest/deploy.html

[settings]
default = scraper.settings

[deploy]
#url = http://localhost:6800/
project = scraper
Scraping data from quotes website by utilizing Item pipeline

Code explanation

Here is the breakdown of the above code files:

  1. quotes.py

    1. Line 8–13: Here we define a custom_settings that will be included in the general settings for the project but only for this scraper.

      1. To enable the items pipeline we have to define this in the scraper's settings by specifying the pipeline the scraper will use, in our case, it is scraper.scraper.pipelines.DuplicateItemPipeline and scraper.scraper.pipelines.JsonWriterPipeline which are the explicit paths of the Python class inside our project directory.

      2. Items pipelines work sequentially, so one pipeline's output is the following pipeline's input. That's why we put an arbitrary number {200, 300} that defines which pipeline will be executed first; the lower will take priority.

    2. Line 22–28: This part replicates the original code to trigger the duplicate pipeline.

  2. pipelines.py

    1. Line 5–16: Here, we define the DuplicateItemPipeline which initiates a Python set that stores the scraped items and checks whether it exist.

      1. Line 9: We implement the process_item() function that we must have to process each item in the pipeline

      2. The function checks if the item exists in the set; if yes, it will raise a DropItem error. Otherwise, we return the item to be processed by other pipelines.

    2. Line 19–29: This is the second pipeline we define, responsible for writing each item to a JSON file.

      1. Line 21: We first open the file we will write to. This happens directly after the spider starts.

      2. Line 23–24: We implement the close_spider() which will close the file after the spider finishes.

      3. Line 26–29: Finally, we implement again process_item() which will write the item to the file. To do that, we need to convert the Scrapy item to a Python dictionary using ItemAdapter Library.

Feed exports

Feed exports are another Scrapy module that stores scraped items in structured output formats. This can save time compared to creating a custom item pipeline for popular formats like JSON, CSV, etc.

Scrapy, by default, supports multiple types of output formats:

  1. JSON

  2. JSON Lines

  3. CSV

  4. XML

  5. Local FileSystem

  6. S3

  7. Google Cloud Storage (GCS)

We only need to add two values to the spider settings to use feed exports.

  • FEED_FORMAT: Format can be json, csv, etc.

  • FEED_URI: Output file location.

Let’s test this as well with an example:

# Automatically created by: scrapy startproject
#
# For more information about the [deploy] section see:
# https://scrapyd.readthedocs.io/en/latest/deploy.html

[settings]
default = scraper.settings

[deploy]
#url = http://localhost:6800/
project = scraper
Scraping data from quotes website by utilizing Feed exports

Running this spider will save the scraped data in JSON format to the file data.json.

Conclusion

In this section, we covered the essential components of Scrapy that enable us to harness the full power of web scraping. We began by exploring Scrapy Items, a structured container for organizing the data we scrape from websites. We then took a closer look at Scrapy Item Loaders, a valuable tool for streamlining the process of populating items. Additionally, we discussed the importance of Item pipeline in Scrapy, which allows us to apply additional processing steps to our scraped data. Lastly, we explored feed exports that use pre-defined exporting modules to common data structures.

Get hands-on with 1300+ tech skills courses.