Scrapy Data Pipeline
Learn how Scrapy organizes the data pipeline and exports it in any structured format.
Having familiarized ourselves with Scrapy's fundamental modules, which empower us to extract information from various websites, it's time to explore exporting our scraper's output in a structured format.
Core modules
Scrapy offers a systematic approach to organizing the data we scrape in unstructured formats that can be easily employed for various purposes. It achieves this through the utilization of three core modules:
The diagram below illustrates the fundamental connections between these modules:
Spider.py
is the core scraping spider code. It utilizes Items.py
with ItemLoader
to containerize the scraped data, then using ItemPipeline.py
to perform final processing on the data and save it in a structured format.
Items
Items are simple containers that hold the data we want to extract from a website. They serve as a structured data representation and help us maintain consistency in our scraped results.
Items are defined using Python classes that inherit from scrapy.Item
inside the Items.py
file. Each attribute of the item class represents a piece of data we want to extract. By defining the fields in the item class, we specify the data structure we will scrape.
Here’s a basic example of defining a Scrapy item for scraping quotes from the Quotes to Scrape website:
import scrapyclass QuoteItem(scrapy.Item):text = scrapy.Field()author = scrapy.Field()tags = scrapy.Field()
In this example, the QuoteItem
class represents a quote with its corresponding text
, author
, and tags
. Field
objects are used to specify metadata for each field. We can specify any metadata for each field. There is no restriction on the values accepted by Field
objects.
Once we've defined our item class, we can start using it. Within our spider's parsing methods, we can create instances of the item class, assign values to its fields, and yield the populated item.
# Automatically created by: scrapy startproject # # For more information about the [deploy] section see: # https://scrapyd.readthedocs.io/en/latest/deploy.html [settings] default = scraper.settings [deploy] #url = http://localhost:6800/ project = scraper
Inspecting the code output, we will find the data yielded in a more structured way as a dictionary; Scrapy will automatically handle the data and ensure it’s processed consistently and matches our item definition.
Item Loaders
Scrapy Item Loaders are a powerful tool to streamline the process of populating Scrapy items. It offers a more organized approach to data manipulation by allowing us to apply input and output processors to item fields, making it easier to clean, validate, and format the scraped data before it’s stored in the item.
Let’s dive into an example to see how Scrapy Item Loaders work:
# Automatically created by: scrapy startproject # # For more information about the [deploy] section see: # https://scrapyd.readthedocs.io/en/latest/deploy.html [settings] default = scraper.settings [deploy] #url = http://localhost:6800/ project = scraper
In this example, we use an ItemLoader
to populate the QuoteItem
. The ItemLoader
is initialized with the selector of the HTML element containing the data. Then, add_css
is used to specify how to extract data from the selector and assign it to item fields. We can also apply custom processors, such as cleaning or formatting functions, to fields during this process. Finally, loader.load_item()
returns the populated item.
How do Item Loaders work?
Item Loaders can be exceptionally helpful when extracting the same field from various sources. Let's consider an example:
We're scraping a product item, and the product's name can be obtained either from the name
value or the title
. In such a case, the ItemLoader
will look like this:
def parse(self, response):l = ItemLoader(item=Product(), response=response)l.add_xpath("name", '//div[@class="product_name"]')l.add_xpath("name", '//div[@class="product_title"]')
Here's how it works:
We extract the name from
//div[@class="product_name"]
and pass it through the input processor of thename
field.We extract the name from
//div[@class="product_title"]
, and it also goes through the same input processor used in step (1).The result of the input processor is appended to the data collected in Step 1 if there is any data.
The data gathered in Steps 1 and 21 is then passed through the output processor of the
name
field. The result of the output processor becomes the value assigned to thename
field in the item.
Scrapy offers various types of output processors, such as Identity, Join, and MapCompose. We can use them according to our specific needs.
Note: While Item Loaders are valuable tools provided by Scrapy, they are not necessary for every web scraping task.
Item pipeline
After extracting items, Scrapy allows us to apply additional processing using item pipelines. Pipelines are a series of processing steps that items pass through after being yielded or used.
Each item pipeline component is essentially a Python class that implements a straightforward method and is coded inside pipelines.py
file. These classes receive an item, execute a specific action on it, and determine whether the item should proceed through the pipeline or be discarded, thereby ending its processing.
Typical applications of item pipelines include:
Cleansing HTML data
Validating scraped data (checking that the items contain specific fields)
Checking for duplicates (and dropping them)
Storing the scraped item in a database
The class implements the following methods:
process_item(item, spider)
: This method is called for every item yield. It must return an item or raiseDropItem
exception and must be implemented.open_spider(spider)
: We don't need to implement this one, but we can use it if we have a logic to apply when the spider opens.close_spider(spider)
: Same asopen_spider(spider)
but when the spider is closed.
Let's try adding the dropping pipeline and a JSON writer pipeline to our Quote scraper and see how it works in Scrapy.
# Automatically created by: scrapy startproject # # For more information about the [deploy] section see: # https://scrapyd.readthedocs.io/en/latest/deploy.html [settings] default = scraper.settings [deploy] #url = http://localhost:6800/ project = scraper
Code explanation
Here is the breakdown of the above code files:
quotes.py
Line 8–13: Here we define a
custom_settings
that will be included in the general settings for the project but only for this scraper.To enable the items pipeline we have to define this in the scraper's settings by specifying the pipeline the scraper will use, in our case, it is
scraper.scraper.pipelines.DuplicateItemPipeline
andscraper.scraper.pipelines.JsonWriterPipeline
which are the explicit paths of the Python class inside our project directory.Items pipelines work sequentially, so one pipeline's output is the following pipeline's input. That's why we put an arbitrary number
{200, 300}
that defines which pipeline will be executed first; the lower will take priority.
Line 22–28: This part replicates the original code to trigger the duplicate pipeline.
pipelines.py
Line 5–16: Here, we define the
DuplicateItemPipeline
which initiates a Python set that stores the scraped items and checks whether it exist.Line 9: We implement the
process_item()
function that we must have to process each item in the pipelineThe function checks if the item exists in the set; if yes, it will raise a
DropItem
error. Otherwise, we return the item to be processed by other pipelines.
Line 19–29: This is the second pipeline we define, responsible for writing each item to a JSON file.
Line 21: We first open the file we will write to. This happens directly after the spider starts.
Line 23–24: We implement the
close_spider()
which will close the file after the spider finishes.Line 26–29: Finally, we implement again
process_item()
which will write the item to the file. To do that, we need to convert the Scrapy item to a Python dictionary usingItemAdapter
Library.
Feed exports
Feed exports are another Scrapy module that stores scraped items in structured output formats. This can save time compared to creating a custom item pipeline for popular formats like JSON, CSV, etc.
Scrapy, by default, supports multiple types of output formats:
JSON
JSON Lines
CSV
XML
Local FileSystem
S3
Google Cloud Storage (GCS)
We only need to add two values to the spider settings to use feed exports.
FEED_FORMAT
: Format can bejson
,csv
, etc.FEED_URI
: Output file location.
Let’s test this as well with an example:
# Automatically created by: scrapy startproject # # For more information about the [deploy] section see: # https://scrapyd.readthedocs.io/en/latest/deploy.html [settings] default = scraper.settings [deploy] #url = http://localhost:6800/ project = scraper
Running this spider will save the scraped data in JSON format to the file data.json
.
Conclusion
In this section, we covered the essential components of Scrapy that enable us to harness the full power of web scraping. We began by exploring Scrapy Items, a structured container for organizing the data we scrape from websites. We then took a closer look at Scrapy Item Loaders, a valuable tool for streamlining the process of populating items. Additionally, we discussed the importance of Item pipeline in Scrapy, which allows us to apply additional processing steps to our scraped data. Lastly, we explored feed exports that use pre-defined exporting modules to common data structures.
Get hands-on with 1300+ tech skills courses.