JSON Structure of Scraped Data

Learn how to export the scraped data using JSON.

We have nearly completed the basics, so now, let's conclude the fundamentals by arranging the scraped data and storing it in a well-structured manner.

Writing to JSON

After extracting our data from the website, we store the data in a dictionary, which enables us to export the data as a JSON file like the sample code below:

Press + to interact
import json
import requests
from bs4 import BeautifulSoup
# maintain the main URL to use when joining page url
base_url = "https://quotes.toscrape.com"
data = []
def scrape(url):
"""
request on the URL, get the quotes, find the next page info, recurse.
"""
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
quotes = [x.string for x in soup.select("div.quote span.text")]
authors = [x.string for x in soup.select("small.author")]
tags = soup.select("div.tags")
for quote, author, tag in zip(quotes, authors, tags):
d = dict()
d["quote"] = quote.strip()
d["author"] = author.strip()
d["tags"] = [x.string.strip() for x in tag.select("a.tag")]
data.append(d)
next_page = soup.select_one("ul.pager > li.next")
# check if we reached the last page or not.
if next_page:
# join the main url with the page sub url
# ex: "https://quotes.toscrape.com" + "/page/2/"
next_page_url = requests.compat.urljoin(base_url, next_page.a['href'])
scrape(next_page_url)
return
scrape(base_url)
if len(data):
print(len(data), " scraped from https://quotes.toscrape.com/")
with open('output/data.json', 'w') as f:
json.dump(data , f)
  • Lines 21–25: Initialize a dictionary, add the scraped data, and append it to the final list.

  • Lines 39–42: Write the scraped data to a JSON file.

Try it yourself

Scrape all 1000 books on the site for each page from the Books to Scrape website. The scraped data should be saved as a JSON file named "output/data.json" and use the exact path to pass the test cases.

Tasks:

  1. Write the CSS Selectors pattern for each requested item.

  2. Extract the item text from each pattern using lists.

  3. Write the next page pattern and fix the error in the next page URL to ensure access to all pages.

  4. Save the scraped data to the JSON file.

Press + to interact
import json
import requests
from requests.compat import urljoin
from bs4 import BeautifulSoup
base_url = "https://books.toscrape.com/"
data = []
def scrape(url):
"""
request on the URL, get the books, find the next page info, recurse.
"""
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
#ToDo
# 1. write the correct CSS patterns here to retrieve the information
titles_pattern = ""
images_pattern = ""
rates_pattern = ""
prices_pattern = ""
titles = [] #[x['title'] for x in soup.select(titles_pattern)]
#ToDo
#2. write the list comprehension expression to get the needed element after using the patterns
images = []
rates = []
prices = []
for title, image, rate, price in zip(titles, images, rates, prices):
d = dict()
d["title"] = title.strip()
d["image"] = image
d["rate"] = rate.strip()
d["price"] = price.strip()
data.append(d)
#ToDo
#3. write the CSS pattern here to get the next page element
next_page = None #soup.select_one("")
if next_page:
#parse the next page URL from the element correctly so that when it is joined
#with the base url it moves to the next page.
page_part = ''
next_page_url = requests.compat.urljoin(base_url, page_part)
scrape(next_page_url)
return
scrape(base_url)
if len(data):
print(len(data), " scraped from https://books.toscrape.com/")
#ToDo
#4. write the code to save the data to "outputs/data.json"

If the data conforms to the required structure, we can export it to various formats, such as CSV.

Conclusion

We should be able to scrape any static website and export the outputs to any data format using Beautiful Soup and CSS Selectors. In the next section, we will learn about dynamic websites and how to change our scripts to suit these sites.

Get hands-on with 1300+ tech skills courses.