Solution Review: Scrape the Web Page Using Beautiful Soup
Review the solution for the book information scraping task.
We'll cover the following
Solution
We start by inspecting the web page and finding the elements we want.
import requestsfrom requests.compat import urljoinfrom bs4 import BeautifulSoupbase_url = "https://books.toscrape.com/"titles = []images = []rates = []prices = []# Solutionresponse = requests.get(base_url)soup = BeautifulSoup(response.content, 'html.parser')articles = soup.find_all("article", {"class":"product_pod"})for article in articles:image = urljoin(base_url,article.find("div", {"class":"image_container"}).a.img['src'])rate = article.find("p", {"class":"star-rating"})['class'][1]title = article.find("h3").a['title']price = article.find("div", {"class":"product_price"}).p.stringtitles.append(title)images.append(image)rates.append(rate)prices.append(price)print("Length of scraped titles: ", len(titles))print("Length of scraped images: ", len(images))print("Length of scraped rates: ", len(rates))print("Length of scraped prices: ", len(prices))print(titles)
Code explanation
Lines 13–14: We request the site URL using
request.get()
and pass theresponse.content
toBeautifulSoup()
.Line 16: Then, we find all the parent elements that hold the book's information, which in our case are
<article class='product_pod'>
.Lines 18–23: Now we loop through each article and extract the elements we want from each one as follows:
Lines 19–20: The image link we can get by following the path
<div class='image_container'> <a> <img>
. Then we join the URL we get fromsrc
attribute with thebase_url
withurjoin()
.Line 21: The rating is a bit tricky; the
<i>
elements under<p>
doesn't hold the information. It only represents a star shape and the CSS styling colors it using the class name as a number.To get the class name, we find the element that holds this class and retrieve it using the
class
attribute, which returns a list of all the classes assigned to the element.Line 22: Finding the title is straightforward; we find the
<h3>
tag and access the<a>
tag under it and get itstitle
attribute.Line 23: Lastly, the price is in the text of the
<p>
tag under the<div class='product_price'>
tag.
Lines 25–28: We just update the list with the scraped data for each item in these lines.
Get hands-on with 1300+ tech skills courses.