Up to this point, we have been scraping data from various websites without verifying its accuracy. Data is not always complete and error-free; there will often be missing fields or incorrect values. Given that we are developing an automated script to scrape millions of records, it is crucial to implement a validation mechanism to ensure the quality of the scraped data.

The JSON Schema library in Python

JSON Schema is another powerful tool that Python provides. It is an implementation of JSON Schema that allows us to check if our JSON data is structured correctly. Since we often use dictionaries to organize data in our web scraping scripts, JSONSchema can come in handy to ensure our data is in the correct format before we do anything else.

It can also be used with other libraries such as Selenium or Beautiful Soup. Scrapy items pipeline is the perfect fit for this situation. With the items pipeline, we can set up a process to validate each scraped data before moving on to further processing.

Press + to interact
JSON Schema with Python
JSON Schema with Python

Installation

We can install the jsonschema library in any Python environment by running the following command:

pip install jsonschema

Syntax

The jsonschema uses a JSON-based syntax to define the structure and constraints of JSON data. For instance, if we are scraping product information, our schema might define properties like name, price, description, and their respective data types.

{
"type": "object",
"properties": {
"name": { "type": "string" },
"price": { "type": "number" },
"description": { "type": "string" }
},
"required": ["name", "price"]
}
A sample schema

Essential components

The essential components of JSON Schema syntax include:

  1. Type constraints: We can specify the expected data type using the type keyword. JSON Schema supports several data types, including:

    1. string: A string

    2. number: A numeric value (integer or floating-point)

    3. integer: A whole number (no decimals)

    4. boolean: A true or false value

    5. object: A JSON object

    6. array: An array of values

  2. Properties: We can describe the structure of JSON objects using the properties keyword. Each property is defined as a key-value pair, where the key is the property name, and the value is a schema that describes the property's constraints.

  3. Required properties: We use the required keyword to specify an array of property names that must be present in the JSON object.

  4. Additional properties: The additionalProperties keyword allows us to specify whether additional properties not explicitly defined in the schema are allowed (true) or not (false).

  5. Array constraints: For arrays or lists, we can use keywords like items to specify the schema for a list of items, minItems to set the minimum number of items and maxItems to set the maximum number of items.

  6. Validation keywords: JSON Schema provides various validation keywords, such as minimum, maximum, maxLength, minLength, pattern (for regular expressions), and enum (to specify a list of allowed values).

  7. Combining schemas: We can combine multiple schemas using keywords like allOf (all conditions must be met), anyOf (at least one condition must be met), and oneOf (exactly one condition must be met).

  8. References: JSON Schema allows us to reference other parts of the schema using the $ref keyword, enabling schema reuse and modularization.

Examples

Let's provide more detailed examples of various JSON Schema structures:

{
  "type": "object",
  "properties": {
    "name": { "type": "string" },
    "age": { "type": "integer" },
    "isStudent": { "type": "boolean" }
  },
  "required": ["name", "age"]
}

In this schema, we define an object with properties name, age, and isStudent. name must be a string, age must be an integer, isStudent is a boolean, and both name and age are required properties.

{
  "type": "array",
  "items": { "type": "string" },
  "minItems": 1,
  "uniqueItems": true
}

This schema defines an array of strings with a minimum of one item, and all items must be unique.

{
  "type": "object",
  "properties": {
    "price": { "type": "number", "minimum": 0 },
    "email": { "type": "string", "format": "email" }
  }
}

In this schema, we validate that the price is a non-negative number and email is a valid email address.

{
  "type": "object",
  "properties": {
      "answer": {
        "$ref": "#/definitions/non-empty-string"},
      "question": {
        "$ref": "#/definitions/non-empty-string"}
}
  "definitions": {
     "non-empty-string": {
       "type": "string",
       "minLength": 2,
       "pattern": r"(\S){2,}"},
              }

This schema defines a general definition for non-empty strings, where it applies to strings with a minimum length of two and follows the defined regex pattern. Then this definition is referenced in the answer and question elements.

Usage

Because jsonschema is an independent library, we can employ it independently of our web scraping scripts. The most straightforward method is to make use of the jsonschema.validate(schema, data) function:

Press + to interact
import jsonschema
# Define a JSONSchema schema
schema = {
"type": "object",
"properties": {
"name": {"type": "string"},
"age": {"type": "integer"},
"isStudent": {"type": "boolean"}
},
"required": ["name", "age"]
}
# Sample dictionary item to validate
correct_data = {
"name": "John Doe",
"age": 25,
"isStudent": True
}
wrong_data = {
"name": "John Doe",
"age": "25",
"isStudent": False
}
for data in [correct_data, wrong_data]:
try:
# Validate the data against the schema
jsonschema.validate(instance=data, schema=schema)
print(f"Validation successful for {data}. Data is valid.")
except jsonschema.exceptions.ValidationError as e:
print(f"\nValidation Error: {e}")

As demonstrated by the code execution, the library's error message provides comprehensive information, allowing us to pinpoint the specific value that deviates from the schema and understand the underlying issue. In our particular case, the error highlighted that the age attribute was inappropriately formatted as a string instead of an integer.

Utilizing with Scrapy

Item pipelines are one of the most effective applications of JSON Schema in Scrapy. As we know, these pipelines are applied after the data is scraped and before it is sent to the output destination.

# Automatically created by: scrapy startproject
#
# For more information about the [deploy] section see:
# https://scrapyd.readthedocs.io/en/latest/deploy.html

[settings]
default = scraper.settings

[deploy]
#url = http://localhost:6800/
project = scraper
Validating books website spider with JSON Schema

Code explanation

Here is the breakdown of the above code files:

  1. pipelines.py

    1. We define an item pipeline named ValidatePipeline. Following a similar approach as before, we implement the process_item() function. Within this function, we aim to filter out items that do not conform to the specified book_schema conditions.

    2. Our schema is straightforward. It enforces that elements must contain a non-empty string with a minimum length of 2 characters.

  2. books.py

    1. Here, we'll trigger the schema validation process by removing the price element from each item.

When examining the code's output, we can observe in the logs how the pipeline logs a warning for the dropped items and provides information about the reasons behind their exclusion.

widget

Conclusion

In this lesson, we looked into a crucial aspect of web scraping: validation. We explored how we can harness the power of the jsonschema library in conjunction with Scrapy to implement validation mechanisms. While the library’s official documentation offers comprehensive insights into this concept, it often proves more practical to seek out and study similar implementations from online examples.

Get hands-on with 1300+ tech skills courses.