Web Scraping with Beautiful Soup
Discover the key features and applications of the Beautiful Soup library.
Up to this point, we have acquired the necessary skills to make HTTP requests and retrieve the HTML document from a website. It's time to delve deeper and extract the relevant information from the DOM.
Introduction
Beautiful Soup is a widely used Python library for web scraping and parsing HTML and XML documents. It offers a straightforward and flexible method to navigate and extract data from web pages, making it an indispensable tool for individuals who are required to gather and analyze data from the internet. Beautiful Soup can handle various parsing tasks, such as searching for and manipulating tags, attributes, and text within HTML documents. Due to its user-friendly syntax and robust functionality, it has become a preferred choice for developers and data scientists seeking to extract and process web data efficiently. In this lesson, we will explore the key features and applications of the Beautiful Soup library.
Note: It is recommended to inspect the URLs we will use in this lesson in a separate tab to gain a better understanding of the code paths.
Installation
We can install the Beautiful Soup library in any Python environment by running the command pip install beautifulsoup4
.
Usage
Let’s briefly look at using it. The prettify()
method produces a
import requestsfrom bs4 import BeautifulSoupresponse = requests.get("https://quotes.toscrape.com/")soup = BeautifulSoup(response.content, 'html.parser')print(soup.prettify())
Note: To handle the decoding process effectively, it is always better to use
.content
instead of.text
while using Beautiful Soup.
Once the document is parsed, the output can be handled as a data structure (tree), and we can access its elements like any other Python object attribute.
import requestsfrom bs4 import BeautifulSoupresponse = requests.get("https://quotes.toscrape.com/")soup = BeautifulSoup(response.content, 'html.parser')print("Head tag's children: ", list(soup.head.children), "\n")print("Page title: ", soup.title.string, "\n")print("Sample quote: ", soup.find_all("span", {"class": "text"})[0].text, "\n")
Attributes
When discussing objects and attributes, reviewing several significant attributes and their outputs from the created tree is important.
.tag
Returns HTML object with the tag selected
It can be used consecutively to reach a specific tab by following its children
.contents
vs.children
Children of tags can be found in the
.content
list. Instead of retrieving the list, we may use the.children
generator to iterate through a tag’s children.
.descendants
Recursively returns all the children and their children (all the sub-HTML trees) of the tag
.strings
vs.stripped_strings
.strings
returns all strings in the HTML document, including whitespace characters and strings nested within tags, while.stripped_strings
returns only non-empty strings that contain visible text and strips leading and trailing whitespace from each string.
.parent
vs.parents
.parent
returns the immediate parent of the current tag, while.parents
returns an iterator that allows iterating over all the parents of the current tag.
.next_sibling
vs.previous_sibling
.next_sibling
returns the following sibling tag of the current tag, while.previous_sibling
returns the previous sibling tag of the current tag.
.next_element
vs.previous_element
.next_element
returns the next element in the parse tree after the current element while.previous_element
returns the previous element in the parse tree before the current element.
# .tagsoup.body.div.div.span = <span class='text'>"The world as.."</span># .contents<div class='quote'>.contents = [<span class='text'>, <span class='tags'>,..]# .descendants<div class='quote'>.descendants = [<span class='text'>,"the world we have created" ,<span class= 'tags'>, <a href=> ....]# .string<span class='text'>.strings = [" the world we have created.. "]# .stripped_Strings<span class='text'>.string = ["the world we have created.."]# .parent<a href='/tag/deep-thoughts/'>.parent = <span class='tags'>...</span># .next_sibling<span class='text'>.next_sibling = <span class='tags'>...</span># .previous_sibling<span class='text'>.previous_sibling = <div class='quote'>...</div># .next_element<a href='/tag/deep-thoughts/'>.next_element = "deep-thoughts"# .previous_element<a href='/tag/deep-thoughts/'>.next_element = <span class='tags'>...</span>
Try it yourself
Explore some of the above attributes using the editor below and Quotes to Scrape:
import requestsfrom bs4 import BeautifulSoupresponse = requests.get("https://quotes.toscrape.com/")soup = BeautifulSoup(response.content, 'html.parser')first_quote_element = soup.find("div", {"class":"quote"})print(type(first_quote_element))
Searching the DOM
In Beautiful Soup, find_all()
is a method that searches the entire parse tree of an HTML or XML document and returns a list of all the matching elements. It is a compelling method that can be used to search for any element in the document based on its tag name, attributes, values, and other criteria. The method find()
returns the first element of the provided tag, while the find all()
method returns all elements of a given tag.
Let’s scrape the data from the Quotes to Scrape website using the find_all
method:
import requestsfrom bs4 import BeautifulSoupresponse = requests.get("https://quotes.toscrape.com/")soup = BeautifulSoup(response.content, 'html.parser')# returns all the elements with class="quote"all_quotes_div_elements = soup.find_all("div", {"class":"quote"})quotes = []for div in all_quotes_div_elements:# find() will always return the first matchtext_span = div.find("span", {"class":"text"})quotes.append(text_span.string)print(quotes[:5])
Line 8: We first search for all the
<div>
elements of the quote's information.Lines 11–14: Then we iterate through all of them and search for the
<span>
tag that holds the quote's text for each one. We then extract it using.string
attribute.
Try it yourself
Try doing the same in the code below. Scrape all the authors' names from the first page.
import requestsfrom bs4 import BeautifulSoupresponse = requests.get("https://quotes.toscrape.com/")soup = BeautifulSoup(response.content, 'html.parser')all_quotes_div_elements = soup.find_all("div", {"class":"quote"})# don't remove the list, just append to itauthors = []# TODO# Ignore the "by" word just append the string of the tagprint(set(authors))
We have successfully retrieved information from the first page, but our goal is to scrape the entire site. To accomplish this, we need to iterate through all the page URLs and retrieve the quotes from each one.
import requestsfrom bs4 import BeautifulSoup# maintain the main URL to use when joining page urlbase_url = "https://quotes.toscrape.com"all_quotes = []def get_quotes(soup):"""retrieve the quotes from the soup of the current page"""all_quotes_div_elements = soup.find_all("div", {"class":"quote"})quotes = []for div in all_quotes_div_elements:text_span = div.find("span", {"class":"text"})quotes.append(text_span.string)return quotesdef scrape(url):"""request on the URL, get the quotes, find the next page info, recurse."""response = requests.get(url)soup = BeautifulSoup(response.content, 'html.parser')all_quotes.extend(get_quotes(soup))# we got this info after inspecting the next buttonnext_page = soup.find("ul", {"class":"pager"}).find("li", {"class":"next"})# check if we reached the last page or not.if next_page:# join the main url with the page sub url# ex: "https://quotes.toscrape.com" + "/page/2/"next_page_url = requests.compat.urljoin(base_url, next_page.a['href'])scrape(next_page_url)returnscrape(base_url)print("Total quotes scraped: ", len(all_quotes))print(all_quotes[:5])
Lines 7–18: In these lines, we define a function
get_quotes()
that takes the soup object and scrapes all the quotes text using the code we built above.Line 29: We then inspect the next page element and get that element by specifying its path from the DOM.
Line 31: The last page won't have a next page element, so we check if the
next_page
variable holds something or has aNONE
value.Line 34: We extract the following page URL from the element. However, the URL doesn't contain the domain name, so we use
requests.compat.urljoin()
function, which joins two URLs together.Line 35: Lastly, we call the
scrape()
function with the following page URL and do the whole process again until we reach the last page from the site.
There is an easier way to do the task above. Using a simple for-loop, we can get a list of all the page URLs and then request each one. However, implementing the earlier method helps us understand different approaches that can be useful in more complex scenarios.
Try it yourself
The Quotes to Scrape website displays the top ten tags on the right side. Can you scrape all the URLs for these tags?
import requestsfrom requests.compat import urljoinfrom bs4 import BeautifulSoupbase_url = "https://quotes.toscrape.com/"response = requests.get(base_url)soup = BeautifulSoup(response.content, 'html.parser')# don't remove the list, just append to ittop_ten_tags_URLs = []# TODO# don't forget to join the urls with the base_urlprint(top_ten_tags_URLs)
Other useful functions
Some other functions can be used in more complex scenarios as follows:
find_parent()
/find_parents()
find_next_sibling()
/find_next_siblings()
find_previous_sibling()
/find_previous_siblings()
find_next()
/find_all_next()
find_previous()
/find_all_previous()
import requestsfrom bs4 import BeautifulSoupresponse = requests.get("https://quotes.toscrape.com/")soup = BeautifulSoup(response.content, 'html.parser')# retrieve the "Top ten tags" textfirst_tag = soup.find("span", {"class":"tag-item"})print(first_tag.find_previous_sibling().string)# extract the "by" word + author name in first quotefirst_quote_span = soup.find("div", {"class":"quote"}).find("span", {"class":"text"})by_word = first_quote_span.find_next_sibling().find_next(string=True)author_name = soup.find("small", {"class":"author"}).stringprint(by_word + author_name)
Lines 8–9: We want to extract the
"Top ten tags"
text. One way to do it is to get the first tag itemLove
and then find its previous sibling using the functionfind_previous_sibling()
, which will return the<h2>
tag that holds the text.Lines 12–15: We want to extract the author's name but with the
"by"
word.Line 12: First, we get the quote
<span class='text'>
element by following its path starting from the<div class='quote'>
.Line 13: The
"by"
word is the string that immediately follows the<span>
element, and this<span>
element is the next sibling to<span class='text'>
. Thus, we will get the next sibling of the text span and then get its next sibling usingfind_next_sibling()
. Lastly, we usefind_next()
to return the next element and passstring=True
to include strings as the following elements.
The above example may be more than necessary for the specific use case, but it demonstrates the utilization of these functions to extract any desired information from the page.
Conclusion
This lesson covered searching and navigating the DOM structure and scraping website information. With this knowledge, it is possible to retrieve the desired data from any website by making appropriate requests and utilizing the functions provided by the Beautiful Soup library.