JSON Structure of Scraped Data
Learn how to export the scraped data using JSON.
We'll cover the following
We have nearly completed the basics, so now, let's conclude the fundamentals by arranging the scraped data and storing it in a well-structured manner.
Writing to JSON
After extracting our data from the website, we store the data in a dictionary, which enables us to export the data as a JSON file like the sample code below:
import jsonimport requestsfrom bs4 import BeautifulSoup# maintain the main URL to use when joining page urlbase_url = "https://quotes.toscrape.com"data = []def scrape(url):"""request on the URL, get the quotes, find the next page info, recurse."""response = requests.get(url)soup = BeautifulSoup(response.content, 'html.parser')quotes = [x.string for x in soup.select("div.quote span.text")]authors = [x.string for x in soup.select("small.author")]tags = soup.select("div.tags")for quote, author, tag in zip(quotes, authors, tags):d = dict()d["quote"] = quote.strip()d["author"] = author.strip()d["tags"] = [x.string.strip() for x in tag.select("a.tag")]data.append(d)next_page = soup.select_one("ul.pager > li.next")# check if we reached the last page or not.if next_page:# join the main url with the page sub url# ex: "https://quotes.toscrape.com" + "/page/2/"next_page_url = requests.compat.urljoin(base_url, next_page.a['href'])scrape(next_page_url)returnscrape(base_url)if len(data):print(len(data), " scraped from https://quotes.toscrape.com/")with open('output/data.json', 'w') as f:json.dump(data , f)
Lines 21–25: Initialize a dictionary, add the scraped data, and append it to the final list.
Lines 39–42: Write the scraped data to a JSON file.
Try it yourself
Scrape all 1000 books on the site for each page from the Books to Scrape website. The scraped data should be saved as a JSON file named "output/data.json"
and use the exact path to pass the test cases.
Tasks:
Write the CSS Selectors pattern for each requested item.
Extract the item text from each pattern using lists.
Write the next page pattern and fix the error in the next page URL to ensure access to all pages.
Save the scraped data to the JSON file.
import jsonimport requestsfrom requests.compat import urljoinfrom bs4 import BeautifulSoupbase_url = "https://books.toscrape.com/"data = []def scrape(url):"""request on the URL, get the books, find the next page info, recurse."""response = requests.get(url)soup = BeautifulSoup(response.content, 'html.parser')#ToDo# 1. write the correct CSS patterns here to retrieve the informationtitles_pattern = ""images_pattern = ""rates_pattern = ""prices_pattern = ""titles = [] #[x['title'] for x in soup.select(titles_pattern)]#ToDo#2. write the list comprehension expression to get the needed element after using the patternsimages = []rates = []prices = []for title, image, rate, price in zip(titles, images, rates, prices):d = dict()d["title"] = title.strip()d["image"] = imaged["rate"] = rate.strip()d["price"] = price.strip()data.append(d)#ToDo#3. write the CSS pattern here to get the next page elementnext_page = None #soup.select_one("")if next_page:#parse the next page URL from the element correctly so that when it is joined#with the base url it moves to the next page.page_part = ''next_page_url = requests.compat.urljoin(base_url, page_part)scrape(next_page_url)returnscrape(base_url)if len(data):print(len(data), " scraped from https://books.toscrape.com/")#ToDo#4. write the code to save the data to "outputs/data.json"
If the data conforms to the required structure, we can export it to various formats, such as CSV.
Conclusion
We should be able to scrape any static website and export the outputs to any data format using Beautiful Soup and CSS Selectors. In the next section, we will learn about dynamic websites and how to change our scripts to suit these sites.
Get hands-on with 1300+ tech skills courses.