Scrapy Cores

Delve into Scrapy's foundational concepts, which encompass essential elements like Spiders, Requests, Responses, and the LinkExtractor classes.

Now that we learned about Scrapy, let's dive into more detail about its core modules.

The Spider Class

The scrapy.Spider class is the heart of any Scrapy project. It defines how to crawl and extract information from a website. Let's delve into some of the critical parameters that can be utilized within this class to fine-tune our web scraping process.

  • name

    • It uniquely identifies our spider, which is crucial when running multiple spiders within a single project. This name differentiates the output files, logs, and other spider-related resources. We should choose a descriptive and meaningful name for our spider.

  • allowed_domains

    • The allowed_domains parameter is a list of domains that our spider is allowed to crawl. Any links outside these domains will be ignored. This handy feature ensures our spider stays focused on the relevant content.

  • start_urls

    • This parameter is a list of URLs where the spider begins crawling.

    • start_urls serves as a shortcut for start_requests() functions. If this parameter is defined and we didn't define the start_requests() function, Scrapy will internally initialize this function for us with this list of URLs.

  • custom_settings

    • Since Scrapy is designed to run multiple spiders, it gives us the capability to modify the project's default settings for each spider. This is achieved through the utilization of the custom_settings parameter.

  • logger

    • This is a Python logger created with the spider name Scrapy. We can access it anytime using self.logger and log any custom messages we want.

Let’s implement these parameters with the same book’s spider example we made before.

# Automatically created by: scrapy startproject
#
# For more information about the [deploy] section see:
# https://scrapyd.readthedocs.io/en/latest/deploy.html

[settings]
default = scraper.settings

[deploy]
#url = http://localhost:6800/
project = scraper
Extracting books data using Spider class

We can see in the logs that Scrapy identified the overridden settings and since we changed the debug option to WARNING it will not show the scraped items anymore. We can explore other parameters the Scrapy provides by visiting the official documentation.

The Request Class

In Scrapy, the Request class is a fundamental component that allows us to request HTTP web pages. It provides a wide range of parameters that we can customize to control how our requests are made and the responses are handled. Let's explore some key parameters we can use with the class.

  • url

    • The URL parameter specifies the address of the web page we want to scrape.

  • callback

    • The callback parameter determines which function will be called to process the response once it's received.

  • cb_kwargs

    • A dictionary of parameters that will be passed as keyword arguments to the Request’s callback.

  • method

    • This determines the type of request Scrapy will make to the URL, it can be GET, POST, PUT, etc.

  • body

    • When using the POST method, we can pass data to the server using the body parameter.

    • Example: scrapy.Request(url=url, method='POST', body={'username': 'my_username'}, callback=self.parse)

  • headers

    • Scrapy allows us to customize headers using the headers parameter.

  • meta

    • The meta parameter is a dictionary where we can store additional information that is available in the response callback.

  • dont_filter

    • Scrapy by default filters duplicate requests to prevent going into an infinite loop. If dont_filter parameter is set to True , it will indicate that this request should not be filtered.

An example of this parameter's use is clarified below with a quotes spider that aims to scrape the quotes site.

# Automatically created by: scrapy startproject
#
# For more information about the [deploy] section see:
# https://scrapyd.readthedocs.io/en/latest/deploy.html

[settings]
default = scraper.settings

[deploy]
#url = http://localhost:6800/
project = scraper
Extracting quotes using Request class

Many other parameters can be useful in different scenarios and can be explored in the official documentation.

The Response Class

The Response class holds valuable information about the server's response to a request made by a spider. This information includes both the content of the response and various metadata. Let's delve into some of the essential parameters of the Response class:

  • response.url

    • The URL from which the response was obtained.

  • response.status

    • The HTTP status code of the response (e.g., 200 for success, 404 for not found).

  • response.headers

    • The headers sent by the server in the response.

  • response.body

    • The raw content of the response. It contains the HTML content of the response.

  • response.text

    • The decoded text content of the response.

  • response.xpath()

    • A method to extract data using XPath expressions.

    • Example: response.xpath('//h1/text()').get() extracts the text from the first <h1> element.

  • response.css()

    • A method to extract data using CSS selectors.

    • Example: response.css('p::text').get() retrieves the text from all <p> elements.

  • response.follow()

    • A method to initiate a new request based on a link in the current response.

    • Example: response.follow(href, callback) follows a link and calls the provided callback.

  • response.followall()

    • The same as follow() function but with multiple links at the same time rather than doing it with a for loop in the code.

  • response.meta

    • The dictionary we passed using reques.meta.

Try it yourself

Let’s test our knowledge and apply these parameters by scraping the Quotes site and utilizing the response class functions.

# Automatically created by: scrapy startproject
#
# For more information about the [deploy] section see:
# https://scrapyd.readthedocs.io/en/latest/deploy.html

[settings]
default = scraper.settings

[deploy]
#url = http://localhost:6800/
project = scraper
Extracting quotes using Response class

The LinkExtractor class

Scrapy's LinkExtractor class is a powerful tool for extracting links from web pages. It's often used in conjunction with spiders to navigate websites and scrape information. Let's look at some of the key parameters of the LinkExtractor class and their functionalities:

  • allow(str or List[str])

    • This parameter specifies a regex pattern or a list of patterns that URLs must match to be extracted.

    • Example: allow=['/books/'] will extract links only with URLs containing /books/.

  • deny(str or List[str])

    • Works inversely to allow, preventing URLs matching the specified pattern(s) from being extracted.

    • Example: deny=['/private/'] will exclude links with URLs containing /private/.

  • allow_domains(str or List[str])

    • Extracts links only from the specified domains.

    • Example: allow_domains=['quotes.toscrape.com'] limits extraction to links from the given domain.

  • deny_domains(str or List[str])

    • Prevents links from specified domains from being extracted.

    • Example: deny_domains=['example.com'] avoids extracting links from the example.com domain.

  • restrict_xpaths(str or List[str])

    • Extracts links only from elements matching the provided XPath expression(s).

    • Example: restrict_xpaths=['//div[@class="article"]'] restricts links to those within the specified div element.

  • restrict_css(str or List[str])

    • Similar to restrict_xpaths, but based on CSS selectors.

    • Example: restrict_css=['.main-content'] narrows down links to those within elements with the main-content class.

  • tags(str or List[str])

    • Specifies HTML tag(s) to consider when extracting links.

    • Example: tags=['a', 'img'] will extract links from both anchor (<a>) and image (<img>) tags.

  • attrs(Dict[str, str])

    • Defines attributes that should be present in the tag for the link to be extracted.

    • Example: attrs={'class': 'external-link'} extracts links with the specified class attribute.

  • process_value(Callable)

    • Allows a function to be applied to each extracted link, modifying its value.

    • Example: process_value=lambda x: x.replace(' ', '-') replaces spaces with dashes in the extracted links.

An example of a spider using LinkExtractor parameters would be as follows:

# Automatically created by: scrapy startproject
#
# For more information about the [deploy] section see:
# https://scrapyd.readthedocs.io/en/latest/deploy.html

[settings]
default = scraper.settings

[deploy]
#url = http://localhost:6800/
project = scraper
Extracting quotes using LinkExtractor class

In this example, the spider extracts links containing /tag/ and excludes those with /login. The extracted links are then used to navigate to tag pages for further scraping. The allow_domains, deny_domains, restrict_xpaths, and other parameters can also be integrated as needed based on the scraping requirements.

Get hands-on with 1300+ tech skills courses.