Solution Review: Scrape the Web Page Using Beautiful Soup

Review the solution for the book information scraping task.

We'll cover the following

Solution

We start by inspecting the web page and finding the elements we want.

Press + to interact
Inspecting the DOM of the first page
Inspecting the DOM of the first page
Press + to interact
import requests
from requests.compat import urljoin
from bs4 import BeautifulSoup
base_url = "https://books.toscrape.com/"
titles = []
images = []
rates = []
prices = []
# Solution
response = requests.get(base_url)
soup = BeautifulSoup(response.content, 'html.parser')
articles = soup.find_all("article", {"class":"product_pod"})
for article in articles:
image = urljoin(base_url,
article.find("div", {"class":"image_container"}).a.img['src'])
rate = article.find("p", {"class":"star-rating"})['class'][1]
title = article.find("h3").a['title']
price = article.find("div", {"class":"product_price"}).p.string
titles.append(title)
images.append(image)
rates.append(rate)
prices.append(price)
print("Length of scraped titles: ", len(titles))
print("Length of scraped images: ", len(images))
print("Length of scraped rates: ", len(rates))
print("Length of scraped prices: ", len(prices))
print(titles)

Code explanation

  • Lines 13–14: We request the site URL using request.get() and pass the response.content to BeautifulSoup().

  • Line 16: Then, we find all the parent elements that hold the book's information, which in our case are <article class='product_pod'>.

  • Lines 18–23: Now we loop through each article and extract the elements we want from each one as follows:

    • Lines 19–20: The image link we can get by following the path <div class='image_container'> <a> <img>. Then we join the URL we get from src attribute with the base_url with urjoin().

    • Line 21: The rating is a bit tricky; the <i> elements under <p> doesn't hold the information. It only represents a star shape and the CSS styling colors it using the class name as a number.

    • To get the class name, we find the element that holds this class and retrieve it using the class attribute, which returns a list of all the classes assigned to the element.

    • Line 22: Finding the title is straightforward; we find the <h3> tag and access the <a> tag under it and get its title attribute.

    • Line 23: Lastly, the price is in the text of the <p> tag under the <div class='product_price'> tag.

  • Lines 25–28: We just update the list with the scraped data for each item in these lines.

Get hands-on with 1300+ tech skills courses.