Scrapy Cores
Delve into Scrapy's foundational concepts, which encompass essential elements like Spiders, Requests, Responses, and the LinkExtractor classes.
We'll cover the following
Now that we learned about Scrapy, let's dive into more detail about its core modules.
The Spider
Class
The scrapy.Spider
class is the heart of any Scrapy project. It defines how to crawl and extract information from a website. Let's delve into some of the critical parameters that can be utilized within this class to fine-tune our web scraping process.
name
It uniquely identifies our spider, which is crucial when running multiple spiders within a single project. This name differentiates the output files, logs, and other spider-related resources. We should choose a descriptive and meaningful name for our spider.
allowed_domains
The
allowed_domains
parameter is a list of domains that our spider is allowed to crawl. Any links outside these domains will be ignored. This handy feature ensures our spider stays focused on the relevant content.
start_urls
This parameter is a list of URLs where the spider begins crawling.
start_urls
serves as a shortcut forstart_requests()
functions. If this parameter is defined and we didn't define the start_requests() function, Scrapy will internally initialize this function for us with this list of URLs.
custom_settings
Since Scrapy is designed to run multiple spiders, it gives us the capability to modify the project's default settings for each spider. This is achieved through the utilization of the
custom_settings
parameter.
logger
This is a Python logger created with the spider name Scrapy. We can access it anytime using
self.logger
and log any custom messages we want.
Let’s implement these parameters with the same book’s spider example we made before.
# Automatically created by: scrapy startproject # # For more information about the [deploy] section see: # https://scrapyd.readthedocs.io/en/latest/deploy.html [settings] default = scraper.settings [deploy] #url = http://localhost:6800/ project = scraper
We can see in the logs that Scrapy identified the overridden settings and since we changed the debug option to WARNING
it will not show the scraped items anymore. We can explore other parameters the Scrapy provides by visiting the official documentation.
The Request
Class
In Scrapy, the Request
class is a fundamental component that allows us to request HTTP web pages. It provides a wide range of parameters that we can customize to control how our requests are made and the responses are handled. Let's explore some key parameters we can use with the class.
url
The URL parameter specifies the address of the web page we want to scrape.
callback
The callback parameter determines which function will be called to process the response once it's received.
cb_kwargs
A dictionary of parameters that will be passed as keyword arguments to the Request’s callback.
method
This determines the type of request Scrapy will make to the URL, it can be
GET
,POST
,PUT
, etc.
body
When using the
POST
method, we can pass data to the server using thebody
parameter.Example:
scrapy.Request(url=url, method='POST', body={'username': 'my_username'}, callback=self.parse)
headers
Scrapy allows us to customize headers using the
headers
parameter.
meta
The
meta
parameter is a dictionary where we can store additional information that is available in the response callback.
dont_filter
Scrapy by default filters duplicate requests to prevent going into an infinite loop. If
dont_filter
parameter is set toTrue
, it will indicate that this request should not be filtered.
An example of this parameter's use is clarified below with a quotes spider that aims to scrape the quotes site.
# Automatically created by: scrapy startproject # # For more information about the [deploy] section see: # https://scrapyd.readthedocs.io/en/latest/deploy.html [settings] default = scraper.settings [deploy] #url = http://localhost:6800/ project = scraper
Many other parameters can be useful in different scenarios and can be explored in the official documentation.
The Response
Class
The Response
class holds valuable information about the server's response to a request made by a spider. This information includes both the content of the response and various metadata. Let's delve into some of the essential parameters of the Response
class:
response.url
The URL from which the response was obtained.
response.status
The HTTP status code of the response (e.g., 200 for success, 404 for not found).
response.headers
The headers sent by the server in the response.
response.body
The raw content of the response. It contains the HTML content of the response.
response.text
The decoded text content of the response.
response.xpath()
A method to extract data using XPath expressions.
Example:
response.xpath('//h1/text()').get()
extracts the text from the first<h1>
element.
response.css()
A method to extract data using CSS selectors.
Example:
response.css('p::text').get()
retrieves the text from all<p>
elements.
response.follow()
A method to initiate a new request based on a link in the current response.
Example:
response.follow(href, callback)
follows a link and calls the provided callback.
response.followall()
The same as
follow()
function but with multiple links at the same time rather than doing it with afor loop
in the code.
response.meta
The dictionary we passed using
reques.meta
.
Try it yourself
Let’s test our knowledge and apply these parameters by scraping the Quotes site and utilizing the response class functions.
# Automatically created by: scrapy startproject # # For more information about the [deploy] section see: # https://scrapyd.readthedocs.io/en/latest/deploy.html [settings] default = scraper.settings [deploy] #url = http://localhost:6800/ project = scraper
The LinkExtractor
class
Scrapy's LinkExtractor
class is a powerful tool for extracting links from web pages. It's often used in conjunction with spiders to navigate websites and scrape information. Let's look at some of the key parameters of the LinkExtractor
class and their functionalities:
allow(str or List[str])
This parameter specifies a regex pattern or a list of patterns that URLs must match to be extracted.
Example:
allow=['/books/']
will extract links only with URLs containing/books/
.
deny(str or List[str])
Works inversely to
allow
, preventing URLs matching the specified pattern(s) from being extracted.Example:
deny=['/private/']
will exclude links with URLs containing/private/
.
allow_domains(str or List[str])
Extracts links only from the specified domains.
Example:
allow_domains=['quotes.toscrape.com']
limits extraction to links from the given domain.
deny_domains(str or List[str])
Prevents links from specified domains from being extracted.
Example:
deny_domains=['example.com']
avoids extracting links from theexample.com
domain.
restrict_xpaths(str or List[str])
Extracts links only from elements matching the provided XPath expression(s).
Example:
restrict_xpaths=['//div[@class="article"]']
restricts links to those within the specified div element.
restrict_css(str or List[str])
Similar to
restrict_xpaths
, but based on CSS selectors.Example:
restrict_css=['.main-content']
narrows down links to those within elements with themain-content
class.
tags(str or List[str])
Specifies HTML tag(s) to consider when extracting links.
Example:
tags=['a', 'img']
will extract links from both anchor (<a>
) and image (<img>
) tags.
attrs(Dict[str, str])
Defines attributes that should be present in the tag for the link to be extracted.
Example:
attrs={'class': 'external-link'}
extracts links with the specified class attribute.
process_value(Callable)
Allows a function to be applied to each extracted link, modifying its value.
Example:
process_value=lambda x: x.replace(' ', '-')
replaces spaces with dashes in the extracted links.
An example of a spider using LinkExtractor
parameters would be as follows:
# Automatically created by: scrapy startproject # # For more information about the [deploy] section see: # https://scrapyd.readthedocs.io/en/latest/deploy.html [settings] default = scraper.settings [deploy] #url = http://localhost:6800/ project = scraper
In this example, the spider extracts links containing /tag/
and excludes those with /login
. The extracted links are then used to navigate to tag pages for further scraping. The allow_domains
, deny_domains
, restrict_xpaths
, and other parameters can also be integrated as needed based on the scraping requirements.
Get hands-on with 1300+ tech skills courses.