Scrapy Settings

The final element of Scrapy that we'll delve into is the settings.py file. This is the pivotal space where we fine-tune our web scraping project to tailor it to our specific needs, ranging from the user agent to middleware settings. Properly configuring the settings can significantly impact our scraper's performance, politeness, and functionality.

Populating settings in Scrapy

In Scrapy, settings can be populated from various sources, each with a specific precedence. Let's explore the mechanisms for populating settings, starting with the highest precedence.

1. Command line options

Command line options take precedence, allowing us to override any other setting. We can explicitly set a value using the -s or --set command line option. For instance:

scrapy crawl myspider -s LOG_FILE=scrapy.log

This command will override the LOG_FILE setting for the specific spider.

2. Settings per spider

Individual spiders can define their settings, taking precedence over project-wide settings. As we did before, we can achieve this by utilizing the custom_settings attribute or by implementing the update_settings() method:

import scrapy
class Spider(scrapy.Spider):
name = "scraper"
custom_settings = {
"SETTING": "value",
}
# or using update_settings()
@classmethod
def update_settings(cls, settings):
super().update_settings(settings)
settings.set("SETTING", "value", priority="spider")
Updating custom spider settings using custom_settings or update_settings() method

Additionally, we can modify settings dynamically in the from_crawler() method based on spider arguments or other logic:

import scrapy
class Spider(scrapy.Spider):
name = "scraper"
@classmethod
def from_crawler(cls, crawler, *args, **kwargs):
spider = super().from_crawler(crawler, *args, **kwargs)
if "USER_AGENT" in kwargs:
spider.settings.set("SETTING", kwargs["USER_AGENT"], priority="spider")
return spider
Dynamically updating spider settings using from_crawler() method

And then we provide arguments using -a command line option:

scrapy crawl scraper -a USER_AGENT="Agent"

3. Project settings module

The project settings module, typically residing in the settings.py file, defines most custom settings for our Scrapy project. This serves as the standard configuration file. Adjustments and additions to settings should be made within this module.

The settings.py file

This is the hub for all predefined settings within our project. Scrapy relies on the parameters stored in this file to tailor its internal modules according to our preferences. Let's familiarize ourselves with some key parameters that are crucial in shaping our scraping endeavors.

Press + to interact
BOT_NAME = "scraper"
USER_AGENT = ""
# Obey robots.txt rules
ROBOTSTXT_OBEY = True
CONCURRENT_REQUESTS = 32
DOWNLOAD_DELAY = 3
CONCURRENT_REQUESTS_PER_DOMAIN = 16
COOKIES_ENABLED = False
SPIDER_MIDDLEWARES = {
"scraper.middlewares.ScraperSpiderMiddleware": 543,
}

Key parameters in settings.py

  1. BOT_NAME = 'Name'

    1. This parameter specifies the name of our Scrapy spider. It should be unique to our project.

    2. The name defined here is what we use with the running command scrapy crawl name.

  2. USER_AGENT

    1. This is where we defined the unified USER_AGENT value for any spider in our project.

    2. We can change this per spider internally using custom_Settings.

  3. ROBOTSTXT_OBEY

    1. Another critical parameter we need to consider when scraping. This controls whether we obey the site rules specified in robots.txt file or not.

    2. robots.txt tells search engine crawlers which URLs the crawler can access on the site

    3. If the site defines a rule not to allow scraping, Scrapy will ignore this URL.

  4. DOWNLOAD_DELAY

    1. We can introduce a download delay to avoid overloading the server and getting blocked. This is the number of seconds Scrapy should wait between consecutive requests.

    2. The more this number is, the slower it will take to finish scraping our site. However, it will be safer, and we should always avoid overloading any website with crawling requests.

  5. CONCURRENT_REQUESTS:

    1. This controls the maximum number of concurrent requests Scrapy can run.

    2. The default number is 16, meaning Scrapy is only allowed to have 16 HTTP requests to any domain at any moment.

    3. This can be more customized by defining either CONCURRENT_REQUESTS_PER_DOMAIN or CONCURRENT_REQUESTS_PER_IP

  6. DOWNLOADER_MIDDLEWARES

    1. This is where we define custom downloader middleware as we used to do in custom_settings.

    2. Same with SPIDER_MIDDLEWARES, ITEMS_PIPELINES.

Let's apply the example of cookies by using some of these parameters.

# Automatically created by: scrapy startproject
#
# For more information about the [deploy] section see:
# https://scrapyd.readthedocs.io/en/latest/deploy.html

[settings]
default = scraper.settings

[deploy]
#url = http://localhost:6800/
project = scraper
Scraping quotes by utilizing the settings.py file

Conclusion

In summary, we've explored the essential key points and parameters in settings.py. While we've covered crucial aspects, it's important to note that numerous other parameters influence outputs, logging, and additional modules, which can be further explored in the Scrapy documentation.

With a comprehensive understanding of all Scrapy modules, we are now well equipped to undertake complete scraping projects and leverage any element to suit our needs.

Get hands-on with 1300+ tech skills courses.