Validation with JSON Schema
Learn about JSON Schema and how to validate the scraped data.
Up to this point, we have been scraping data from various websites without verifying its accuracy. Data is not always complete and error-free; there will often be missing fields or incorrect values. Given that we are developing an automated script to scrape millions of records, it is crucial to implement a validation mechanism to ensure the quality of the scraped data.
The JSON Schema library in Python
JSON Schema is another powerful tool that Python provides. It is an implementation of JSON Schema that allows us to check if our JSON data is structured correctly. Since we often use dictionaries to organize data in our web scraping scripts, JSONSchema can come in handy to ensure our data is in the correct format before we do anything else.
It can also be used with other libraries such as Selenium or Beautiful Soup. Scrapy items pipeline
is the perfect fit for this situation. With the items pipeline, we can set up a process to validate each scraped data before moving on to further processing.
Installation
We can install the jsonschema
library in any Python environment by running the following command:
pip install jsonschema
Syntax
The jsonschema
uses a JSON-based syntax to define the structure and constraints of JSON data. For instance, if we are scraping product information, our schema might define properties like name
, price
, description
, and their respective data types.
{"type": "object","properties": {"name": { "type": "string" },"price": { "type": "number" },"description": { "type": "string" }},"required": ["name", "price"]}
Essential components
The essential components of JSON Schema syntax include:
Type constraints: We can specify the expected data type using the
type
keyword. JSON Schema supports several data types, including:string
: A stringnumber
: A numeric value (integer or floating-point)integer
: A whole number (no decimals)boolean
: A true or false valueobject
: A JSON objectarray
: An array of values
Properties: We can describe the structure of JSON objects using the
properties
keyword. Each property is defined as a key-value pair, where the key is the property name, and the value is a schema that describes the property's constraints.Required properties: We use the
required
keyword to specify an array of property names that must be present in the JSON object.Additional properties: The
additionalProperties
keyword allows us to specify whether additional properties not explicitly defined in the schema are allowed (true
) or not (false
).Array constraints: For arrays or lists, we can use keywords like
items
to specify the schema for a list of items,minItems
to set the minimum number of items andmaxItems
to set the maximum number of items.Validation keywords: JSON Schema provides various validation keywords, such as
minimum
,maximum
,maxLength
,minLength
,pattern
(for regular expressions), andenum
(to specify a list of allowed values).Combining schemas: We can combine multiple schemas using keywords like
allOf
(all conditions must be met),anyOf
(at least one condition must be met), andoneOf
(exactly one condition must be met).References: JSON Schema allows us to reference other parts of the schema using the
$ref
keyword, enabling schema reuse and modularization.
Examples
Let's provide more detailed examples of various JSON Schema structures:
{
"type": "object",
"properties": {
"name": { "type": "string" },
"age": { "type": "integer" },
"isStudent": { "type": "boolean" }
},
"required": ["name", "age"]
}
In this schema, we define an object with properties name
, age
, and isStudent
. name
must be a string, age
must be an integer, isStudent
is a boolean, and both name
and age
are required properties.
{
"type": "array",
"items": { "type": "string" },
"minItems": 1,
"uniqueItems": true
}
This schema defines an array of strings with a minimum of one item, and all items must be unique.
{
"type": "object",
"properties": {
"price": { "type": "number", "minimum": 0 },
"email": { "type": "string", "format": "email" }
}
}
In this schema, we validate that the price
is a non-negative number and email
is a valid email address.
{
"type": "object",
"properties": {
"answer": {
"$ref": "#/definitions/non-empty-string"},
"question": {
"$ref": "#/definitions/non-empty-string"}
}
"definitions": {
"non-empty-string": {
"type": "string",
"minLength": 2,
"pattern": r"(\S){2,}"},
}
This schema defines a general definition for non-empty strings, where it applies to strings with a minimum length of two and follows the defined regex pattern. Then this definition is referenced in the answer
and question
elements.
Usage
Because jsonschema
is an independent library, we can employ it independently of our web scraping scripts. The most straightforward method is to make use of the jsonschema.validate(schema, data)
function:
import jsonschema# Define a JSONSchema schemaschema = {"type": "object","properties": {"name": {"type": "string"},"age": {"type": "integer"},"isStudent": {"type": "boolean"}},"required": ["name", "age"]}# Sample dictionary item to validatecorrect_data = {"name": "John Doe","age": 25,"isStudent": True}wrong_data = {"name": "John Doe","age": "25","isStudent": False}for data in [correct_data, wrong_data]:try:# Validate the data against the schemajsonschema.validate(instance=data, schema=schema)print(f"Validation successful for {data}. Data is valid.")except jsonschema.exceptions.ValidationError as e:print(f"\nValidation Error: {e}")
As demonstrated by the code execution, the library's error message provides comprehensive information, allowing us to pinpoint the specific value that deviates from the schema and understand the underlying issue. In our particular case, the error highlighted that the age
attribute was inappropriately formatted as a string instead of an integer.
Utilizing with Scrapy
Item pipelines are one of the most effective applications of JSON Schema in Scrapy. As we know, these pipelines are applied after the data is scraped and before it is sent to the output destination.
# Automatically created by: scrapy startproject # # For more information about the [deploy] section see: # https://scrapyd.readthedocs.io/en/latest/deploy.html [settings] default = scraper.settings [deploy] #url = http://localhost:6800/ project = scraper
Code explanation
Here is the breakdown of the above code files:
pipelines.py
We define an item pipeline named
ValidatePipeline
. Following a similar approach as before, we implement theprocess_item()
function. Within this function, we aim to filter out items that do not conform to the specifiedbook_schema
conditions.Our schema is straightforward. It enforces that elements must contain a non-empty string with a minimum length of 2 characters.
books.py
Here, we'll trigger the schema validation process by removing the
price
element from each item.
When examining the code's output, we can observe in the logs how the pipeline logs a warning for the dropped items and provides information about the reasons behind their exclusion.
Conclusion
In this lesson, we looked into a crucial aspect of web scraping: validation. We explored how we can harness the power of the jsonschema
library in conjunction with Scrapy to implement validation mechanisms. While the library’s official documentation offers comprehensive insights into this concept, it often proves more practical to seek out and study similar implementations from online examples.
Get hands-on with 1300+ tech skills courses.