Scrapy Cores
Delve into Scrapy's foundational concepts, which encompass essential elements like Spiders, Requests, Responses, and the LinkExtractor classes.
We'll cover the following
Now that we learned about Scrapy, let's dive into more detail about its core modules.
The Spider
Class
The scrapy.Spider
class is the heart of any Scrapy project. It defines how to crawl and extract information from a website. Let's delve into some of the critical parameters that can be utilized within this class to fine-tune our web scraping process.
name
It uniquely identifies our spider, which is crucial when running multiple spiders within a single project. This name differentiates the output files, logs, and other spider-related resources. We should choose a descriptive and meaningful name for our spider.
allowed_domains
The
allowed_domains
parameter is a list of domains that our spider is allowed to crawl. Any links outside these domains will be ignored. This handy feature ensures our spider stays focused on the relevant content.
start_urls
This parameter is a list of URLs where the spider begins crawling.
start_urls
serves as a shortcut forstart_requests()
functions. If this parameter is defined and we didn't define the start_requests() function, Scrapy will internally initialize this function for us with this list of URLs.
custom_settings
Since Scrapy is designed to run multiple spiders, it gives us the capability to modify the project's default settings for each spider. This is achieved through the utilization of the
custom_settings
parameter.
logger
This is a Python logger created with the spider name Scrapy. We can access it anytime using
self.logger
and log any custom messages we want.
Let’s implement these parameters with the same book’s spider example we made before.
Get hands-on with 1400+ tech skills courses.