Scrapy Settings
Discover the settings file and its role in managing Scrapy configurations.
The final element of Scrapy that we'll delve into is the settings.py
file. This is the pivotal space where we fine-tune our web scraping project to tailor it to our specific needs, ranging from the user agent to middleware settings. Properly configuring the settings can significantly impact our scraper's performance, politeness, and functionality.
Populating settings in Scrapy
In Scrapy, settings can be populated from various sources, each with a specific precedence. Let's explore the mechanisms for populating settings, starting with the highest precedence.
1. Command line options
Command line options take precedence, allowing us to override any other setting. We can explicitly set a value using the -s
or --set
command line option. For instance:
scrapy crawl myspider -s LOG_FILE=scrapy.log
This command will override the LOG_FILE
setting for the specific spider.
2. Settings per spider
Individual spiders can define their settings, taking precedence over project-wide settings. As we did before, we can achieve this by utilizing the custom_settings
attribute or by implementing the update_settings()
method:
import scrapyclass Spider(scrapy.Spider):name = "scraper"custom_settings = {"SETTING": "value",}# or using update_settings()@classmethoddef update_settings(cls, settings):super().update_settings(settings)settings.set("SETTING", "value", priority="spider")
Additionally, we can modify settings dynamically in the from_crawler()
method based on spider arguments or other logic:
import scrapyclass Spider(scrapy.Spider):name = "scraper"@classmethoddef from_crawler(cls, crawler, *args, **kwargs):spider = super().from_crawler(crawler, *args, **kwargs)if "USER_AGENT" in kwargs:spider.settings.set("SETTING", kwargs["USER_AGENT"], priority="spider")return spider
And then we provide arguments using -a
command line option:
scrapy crawl scraper -a USER_AGENT="Agent"
3. Project settings module
The project settings module, typically residing in the settings.py
file, defines most custom settings for our Scrapy project. This serves as the standard configuration file. Adjustments and additions to settings should be made within this module.
The settings.py
file
This is the hub for all predefined settings within our project. Scrapy relies on the parameters stored in this file to tailor its internal modules according to our preferences. Let's familiarize ourselves with some key parameters that are crucial in shaping our scraping endeavors.
BOT_NAME = "scraper"USER_AGENT = ""# Obey robots.txt rulesROBOTSTXT_OBEY = TrueCONCURRENT_REQUESTS = 32DOWNLOAD_DELAY = 3CONCURRENT_REQUESTS_PER_DOMAIN = 16COOKIES_ENABLED = FalseSPIDER_MIDDLEWARES = {"scraper.middlewares.ScraperSpiderMiddleware": 543,}
Key parameters in settings.py
BOT_NAME = 'Name'
This parameter specifies the name of our Scrapy spider. It should be unique to our project.
The name defined here is what we use with the running command
scrapy crawl name
.
USER_AGENT
This is where we defined the unified
USER_AGENT
value for any spider in our project.We can change this per spider internally using
custom_Settings
.
ROBOTSTXT_OBEY
Another critical parameter we need to consider when scraping. This controls whether we obey the site rules specified in
robots.txt
file or not.robots.txt
tells search engine crawlers which URLs the crawler can access on the siteIf the site defines a rule not to allow scraping, Scrapy will ignore this URL.
DOWNLOAD_DELAY
We can introduce a download delay to avoid overloading the server and getting blocked. This is the number of seconds Scrapy should wait between consecutive requests.
The more this number is, the slower it will take to finish scraping our site. However, it will be safer, and we should always avoid overloading any website with crawling requests.
CONCURRENT_REQUESTS
:This controls the maximum number of concurrent requests Scrapy can run.
The default number is 16, meaning Scrapy is only allowed to have 16 HTTP requests to any domain at any moment.
This can be more customized by defining either
CONCURRENT_REQUESTS_PER_DOMAIN
orCONCURRENT_REQUESTS_PER_IP
DOWNLOADER_MIDDLEWARES
This is where we define custom downloader middleware as we used to do in
custom_settings
.Same with
SPIDER_MIDDLEWARES
,ITEMS_PIPELINES
.
Let's apply the example of cookies by using some of these parameters.
# Automatically created by: scrapy startproject # # For more information about the [deploy] section see: # https://scrapyd.readthedocs.io/en/latest/deploy.html [settings] default = scraper.settings [deploy] #url = http://localhost:6800/ project = scraper
Conclusion
In summary, we've explored the essential key points and parameters in settings.py
. While we've covered crucial aspects, it's important to note that numerous other parameters influence outputs, logging, and additional modules, which can be further explored in the Scrapy documentation.
With a comprehensive understanding of all Scrapy modules, we are now well equipped to undertake complete scraping projects and leverage any element to suit our needs.
Get hands-on with 1300+ tech skills courses.