Introduction to Scrapy
Get introduced to the Scrapy application framework and its capabilities.
We'll cover the following
By now, we've developed a firm grasp of web scraping concepts and how to apply this knowledge to extract information from various websites. Yet, the true potential of scraping shines when we implement it on a larger scale. Our focus will now shift toward mastering the art of constructing expansive scraping projects.
Scrapy
Scrapy is a powerful and popular web crawling and scraping framework in Python. It provides a convenient and flexible way to extract data from websites, with capabilities including:
Scrapy allows us to send HTTP requests to websites, including GET, POST, PUT, DELETE, etc. We can specify parameters such as headers, cookies, and form data.
Scrapy provides a Spider class that allows us to define how to scrape information from a website. We can follow links, parse HTML pages, and extract data using XPath or CSS selectors.
Once data is extracted, Scrapy pipelines allow us to process and store it. This could involve cleaning, validating, and storing the data in various formats such as JSON, CSV, or databases like MySQL or MongoDB.
Scrapy provides a flexible system for processing requests and responses using middleware. This allows us to customize and extend Scrapy’s functionality, such as adding custom headers, handling proxies, or implementing custom caching mechanisms.
Scrapy is built on top of the Twisted asynchronous networking library, allowing it to perform asynchronous requests and handle multiple requests concurrently. This makes it efficient for scraping large amounts of data from various websites.
Scrapy can be easily integrated with other Python libraries and tools for data analysis, such as pandas, NumPy, and Matplotlib, allowing us to perform further analysis on the scraped data.
Scrapy vs. Beautiful Soup
Here are some common differences between Scrapy and Beautiful Soup:
Features | Beautiful Soup | Scrapy |
Purpose | Primiraly designed for data parsing | Designed for large web scraping and crawling |
Best for | Small to medium sized projects, number of sites < 100 | Better suited for large web projects, number of sites > 1000 |
Concurrency | Not supported | Supported |
Data Storage | Needs external libraries and tools | Provide built-in support for storing data in different formats. |
Parsing | HTML | HTML and XML |
Learning Curve | Low learning curve, easier for beginners in web scraping | Steeper learning curve requires more experience in coding and familiarity of web scraping |
Advantages vs. disadvantages
Scrapy brings many benefits, surpassing straightforward tools like Beautiful Soup or Selenium. This becomes especially evident when tackling intricate websites, juggling multiple requests, and managing data pipelines. While Scrapy offers numerous advantages for web scraping, it's also essential to be aware of some potential disadvantages.
Here are some advantages and disadvantages of using Scrapy:
Advantages | Disadvantages |
Asynchronous Processing Scrapy operates on an asynchronous foundation, allowing it to send multiple requests and process data simultaneously. | Complexity for Simple Scraping Scrapy's power and flexibility might be overkill for simple scraping tasks. |
Built-in Request Management Scrapy's built-in system takes care of sending and overseeing HTTP requests, handling cookies, managing redirects, and dealing with various response codes. | Limited JavaScript Support Scrapy is primarily designed for static web page scraping and might struggle with dynamic websites that heavily rely on JavaScript-generated content. |
Scalability Scrapy's architecture is engineered to handle large-scale scraping endeavors with efficiency. | Resource Intensive Scrapy's concurrent request capabilities can lead to increased resource consumption, especially when dealing with a large number of requests. |
Logging and Debugging Scrapy offers comprehensive logging and debugging tools, making it easier to identify and troubleshoot issues during the scraping process. | Maintenance Overhead While Scrapy encourages a modular design, creating and managing custom components such as middlewares and pipelines can introduce additional maintenance overhead, especially as our project evolves. |
Item Pipelines Scrapy provides a powerful item pipeline system for processing scraped data before it's stored. | Learning Curve Scrapy's advanced features and architectural concepts can present a steep learning curve for beginners. |
Crawl Control With Scrapy, we can easily control the rate of our requests, manage crawl delays, and set rules for avoiding overloading websites or getting blocked by anti-scraping mechanisms. | Data Integrity Due to the asynchronous nature of Scrapy, managing data integrity and consistency across multiple requests and pipelines can be more complex. |
Installation
We can install the Scrapy library in any Python environment by running the command pip install Scrapy
.
Usage
Scrapy's structure differs slightly from the other libraries we've explored so far. When creating a project with Scrapy, we initiate the process by executing the command scrapy startproject project_name
. For instance, if we opt for the project name scraper
and execute this command, it will generate the following project structure.
scraper/scrapy.cfgscraper/__init__.pyitems.pymiddlewares.pypipelines.pysettings.pyspiders/__init__.py
Run the command scrapy startproject scraper
in the terminal below, and navigate to the scraper
directory using the ls scraper
to see the same files generated.
scrapy.cfg
: This file serves as the configuration hub for our Scrapy project. It contains settings about the project structure, including global configurations such as the location ofsettings.py
and the project default name.scraper/
: The main project directory.items.py
: Defines the data structure (items) we want to scrape and extract from websites. These items could represent information such as titles, descriptions, prices, and more.middlewares.py
: Implements custom middleware components to attach to each request and response during the scraping process.pipelines.py
: Set up data processing pipelines to handle scraped items (e.g., clean, validate, and store the scraped data, such as saving it to a database or exporting it to a file).settings.py
: The central hub for configuring various settings for our project. It allows us to customize Scrapy’s behavior, set download delays, manage user agents, configure pipelines, etc.spiders/
: This directory is where we will create and store our scraping scripts, which define how to scrape specific websites. We will have a script for multiple websites for each one, and all will have the same Scrapy configuration and settings.
Example
Let's dive deeper into the concepts with a practical example. We will scrape the Books to Scrape site using Scrapy. To accomplish this, we'll create a file named books.py
and place it in the designated spiders
directory within our project.
# Automatically created by: scrapy startproject # # For more information about the [deploy] section see: # https://scrapyd.readthedocs.io/en/latest/deploy.html [settings] default = scraper.settings [deploy] #url = http://localhost:6800/ project = scraper
The structure of the scraping code remains consistent with what we've previously explored using XPath. However, the key distinction lies in how the script is organized and presented.
Line 3: In Scrapy, every spider is defined as a Python class. This class is required to inherit from the
scrapy.Spider
class, a crucial step that enables Scrapy to recognize and interact with our spider effectively.Line 4: We assign our spider a unique identity through the
name
identifier. Each spider must possess a distinct name. Scrapy recognizes which spider to run when we give this name in the running command.Line 6–8: The
start_requests()
is one of Scrapy functions that we should implement in our spider.It orchestrates the initiation of our crawling process by returning an iterable of
scrapy.Request
objects. These objects essentially encapsulate the URLs from which our spider will commence its crawl.Scrapy is asynchronous by default; as a result, every function call within Scrapy must be designed to be iterable. This enables Scrapy to seamlessly control the progression from one URL to the next without blocking or waiting issues.
scrapy.Request()
conducts an HTTP request to the specified URL. The result is then channeled to the callback function, as determined by thecallback
attribute.
Line 10: Here we define the callback function that we called in
start_requests
,parse()
.Scrapy will call this function by default if we didn't specify any callback function in the
Request()
function.The response object that this function receives is simply the HTTP response we used to get with
requests
library, and then we can search within this response normally using XPaths or CSS Selectors as before.
Line 26: We will yield from this function as Scrapy deals with it as a generator function. Each result will be yielded to the terminal since we didn't define any output files.
Expected output: We'll notice abundant text displayed in the terminal alongside the extracted content when executing the code. Scrapy has a built-in logging module set to
DEBUG
mode as a default configuration. This is the reason why the extensive debug information is being showcased. However, this data deluge is surprisingly beneficial as it allows us to track and comprehend the behavior of the scraping process effectively.
Our focus centers on specific segments of this log, exemplified below:
2023-08-18 21:41:23 [scrapy.middleware] INFO: Enabled downloader middlewares:['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware','scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware','scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware','scrapy.downloadermiddlewares.useragent.UserAgentMiddleware','scrapy.downloadermiddlewares.retry.RetryMiddleware','scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware','scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware','scrapy.downloadermiddlewares.redirect.RedirectMiddleware','scrapy.downloadermiddlewares.cookies.CookiesMiddleware','scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware','scrapy.downloadermiddlewares.stats.DownloaderStats']2023-08-18 21:41:23 [scrapy.middleware] INFO: Enabled spider middlewares:['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware','scrapy.spidermiddlewares.offsite.OffsiteMiddleware','scrapy.spidermiddlewares.referer.RefererMiddleware','scrapy.spidermiddlewares.urllength.UrlLengthMiddleware','scrapy.spidermiddlewares.depth.DepthMiddleware']2023-08-18 21:41:23 [scrapy.middleware] INFO: Enabled item pipelines:[]
This log snippet provides valuable insights into Scrapy's current settings, including the default middleware and item pipelines. Since we haven't employed any item pipelines, they remain empty.
023-08-18 21:51:10 [scrapy.core.engine] INFO: Spider opened2023-08-18 21:51:10 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)2023-08-18 21:51:10 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:60232023-08-18 21:51:10 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/> (referer: None)2023-08-18 21:51:11 [scrapy.core.scraper] DEBUG: Scraped from <200 https://books.toscrape.com/>{'title': 'A Light in the Attic', 'image': 'https://books.toscrape.com/media/cache/2c/da/2cdad67c44b002e7ead0cc35693c0e8b.jpg', 'rate': 'Three', 'price': '£51.77'}2023-08-18 21:51:11 [scrapy.core.scraper] DEBUG: Scraped from <200 https://books.toscrape.com/>{'title': 'Tipping the Velvet', 'image': 'https://books.toscrape.com/media/cache/26/0c/260c6ae16bce31c8f8c95daddd9f4a1c.jpg', 'rate': 'One', 'price': '£53.74'}
Subsequently, we'll witness the extracted items being presented in dictionary format within the terminal. These items encapsulate all the fields we gathered during the scraping process. Finally, as the culmination of the scraping process, Scrapy will display the ultimate statistics, encompassing various key metrics of the operation:
2023-08-18 21:51:11 [scrapy.core.engine] INFO: Closing spider (finished)2023-08-18 21:51:11 [scrapy.statscollectors] INFO: Dumping Scrapy stats:{'downloader/request_bytes': 219,'downloader/request_count': 1,'downloader/request_method_count/GET': 1,'downloader/response_bytes': 51555,'downloader/response_count': 1,'downloader/response_status_count/200': 1,'elapsed_time_seconds': 0.402809,'finish_reason': 'finished','finish_time': datetime.datetime(2023, 8, 18, 21, 51, 11, 124639),'item_scraped_count': 20,'log_count/DEBUG': 22,'log_count/INFO': 10,'log_count/WARNING': 2,'memusage/max': 57954304,'memusage/startup': 57954304,'response_received_count': 1,'scheduler/dequeued': 1,'scheduler/dequeued/memory': 1,'scheduler/enqueued': 1,'scheduler/enqueued/memory': 1,'start_time': datetime.datetime(2023, 8, 18, 21, 51, 10, 721830)}2023-08-18 21:51:11 [scrapy.core.engine] INFO: Spider closed (finished)
Scrapy offers a comprehensive overview of vital statistics concerning the scraping endeavor in this phase. These statistics encompass a multitude of factors, including:
The volume of data requested and received in bytes.
The number of requests initiated.
Breakdown of request methods, like
GET
.The total bytes of responses received.
Response count with specific HTTP statuses, such as
200
.The elapsed time of the entire process is in seconds.
The reason for finishing the operation (usually "finished").
The precise date and time when the operation concluded.
The number of items successfully scraped.
Counts of various log types, such as
DEBUG
,INFO
, andWARNING
.Memory usage details at startup and maximum memory usage during operation.
The count of received responses.
The count of items dequeued and enqueued in the scheduler.
Get hands-on with 1300+ tech skills courses.