Web Scraping with Beautiful Soup

Up to this point, we have acquired the necessary skills to make HTTP requests and retrieve the HTML document from a website. It's time to delve deeper and extract the relevant information from the DOM.

Introduction

Beautiful Soup is a widely used Python library for web scraping and parsing HTML and XML documents. It offers a straightforward and flexible method to navigate and extract data from web pages, making it an indispensable tool for individuals who are required to gather and analyze data from the internet. Beautiful Soup can handle various parsing tasks, such as searching for and manipulating tags, attributes, and text within HTML documents. Due to its user-friendly syntax and robust functionality, it has become a preferred choice for developers and data scientists seeking to extract and process web data efficiently. In this lesson, we will explore the key features and applications of the Beautiful Soup library.

Note: It is recommended to inspect the URLs we will use in this lesson in a separate tab to gain a better understanding of the code paths.

Installation

We can install the Beautiful Soup library in any Python environment by running the command pip install beautifulsoup4.

Usage

Let’s briefly look at using it. The prettify() method produces a UnicodeUnicode is a standard encoding system that is used to represent characters from almost all languages. string nicely formatted with clear indentation, displaying the HTML in an organized manner.

Press + to interact

.tag
- Returns HTML object with the tag selected
- It can be used consecutively to reach a specific tab by following its children
.contents vs .children
- Children of tags can be found in the .content list. Instead of retrieving the list, we may use the .children generator to iterate through a tag’s children.
.descendants
- Recursively returns all the children and their children (all the sub-HTML trees) of the tag
.strings vs .stripped_strings
- .strings returns all strings in the HTML document, including whitespace characters and strings nested within tags, while .stripped_strings returns only non-empty strings that contain visible text and strips leading and trailing whitespace from each string.
.parent vs .parents
- .parent returns the immediate parent of the current tag, while .parents returns an iterator that allows iterating over all the parents of the current tag.
.next_sibling vs .previous_sibling
- .next_sibling returns the following sibling tag of the current tag, while .previous_sibling returns the previous sibling tag of the current tag.
.next_element vs .previous_element
- .next_element returns the next element in the parse tree after the current element while .previous_element returns the previous element in the parse tree before the current element.

Press + to interact

# .tag
soup.body.div.div.span = <span class='text'>"The world as.."</span>
# .contents
<div class='quote'>.contents = [<span class='text'>, <span class='tags'>,..]
# .descendants
<div class='quote'>.descendants = [<span class='text'>,"the world we have created" ,
                                    <span class= 'tags'>, <a href=> ....]
# .string
<span class='text'>.strings = [" the world we have created..  "]
# .stripped_Strings
<span class='text'>.string = ["the world we have created.."]
# .parent
<a href='/tag/deep-thoughts/'>.parent = <span class='tags'>...</span>
# .next_sibling
<span class='text'>.next_sibling = <span class='tags'>...</span>
# .previous_sibling
<span class='text'>.previous_sibling = <div class='quote'>...</div>
# .next_element
<a href='/tag/deep-thoughts/'>.next_element = "deep-thoughts"
# .previous_element
<a href='/tag/deep-thoughts/'>.next_element = <span class='tags'>...</span>

Press + to interact

Searching the DOM

In Beautiful Soup, find_all() is a method that searches the entire parse tree of an HTML or XML document and returns a list of all the matching elements. It is a compelling method that can be used to search for any element in the document based on its tag name, attributes, values, and other criteria. The method find() returns the first element of the provided tag, while the find all() method returns all elements of a given tag.

Let’s scrape the data from the Quotes to Scrape website using the find_all method:

Press + to interact

import requests
from bs4 import BeautifulSoup
# maintain the main URL to use when joining page url
base_url = "https://quotes.toscrape.com"
all_quotes = []
def get_quotes(soup):
    """
    retrieve the quotes from the soup of the current page
    """
    all_quotes_div_elements = soup.find_all("div", {"class":"quote"})
    quotes = []
    for div in all_quotes_div_elements:
        text_span = div.find("span", {"class":"text"})
        quotes.append(text_span.string)
    
    return quotes
def scrape(url):
    """
    request on the URL, get the quotes, find the next page info, recurse.
    """
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'html.parser')
    all_quotes.extend(get_quotes(soup))
    # we got this info after inspecting the next button
    next_page = soup.find("ul", {"class":"pager"}).find("li", {"class":"next"})
    # check if we reached the last page or not.
    if next_page:
        # join the main url with the page sub url
        # ex: "https://quotes.toscrape.com" + "/page/2/"
        next_page_url = requests.compat.urljoin(base_url, next_page.a['href'])
        scrape(next_page_url)
    return 
scrape(base_url)
print("Total quotes scraped: ", len(all_quotes))
print(all_quotes[:5])

Lines 7–18: In these lines, we define a function get_quotes() that takes the soup object and scrapes all the quotes text using the code we built above.
Line 29: We then inspect the next page element and get that element by specifying its path from the DOM.
Line 31: The last page won't have a next page element, so we check if the next_page variable holds something or has a NONE value.
Line 34: We extract the following page URL from the element. However, the URL doesn't contain the domain name, so we use requests.compat.urljoin() function, which joins two URLs together.
Line 35: Lastly, we call the scrape() function with the following page URL and do the whole process again until we reach the last page from the site.

There is an easier way to do the task above. Using a simple for-loop, we can get a list of all the page URLs and then request each one. However, implementing the earlier method helps us understand different approaches that can be useful in more complex scenarios.

Try it yourself

The Quotes to Scrape website displays the top ten tags on the right side. Can you scrape all the URLs for these tags?

Press + to interact

import requests
from bs4 import BeautifulSoup
response = requests.get("https://quotes.toscrape.com/")
soup = BeautifulSoup(response.content, 'html.parser')
# retrieve the "Top ten tags" text
first_tag = soup.find("span", {"class":"tag-item"})
print(first_tag.find_previous_sibling().string)
# extract the "by" word + author name in first quote
first_quote_span = soup.find("div", {"class":"quote"}).find("span", {"class":"text"})
by_word = first_quote_span.find_next_sibling().find_next(string=True)
author_name = soup.find("small", {"class":"author"}).string
print(by_word + author_name)

Lines 8–9: We want to extract the "Top ten tags" text. One way to do it is to get the first tag item Love and then find its previous sibling using the function find_previous_sibling(), which will return the <h2> tag that holds the text.
Lines 12–15: We want to extract the author's name but with the "by" word.
- Line 12: First, we get the quote <span class='text'> element by following its path starting from the <div class='quote'>.
- Line 13: The "by" word is the string that immediately follows the <span> element, and this <span> element is the next sibling to <span class='text'>. Thus, we will get the next sibling of the text span and then get its next sibling using find_next_sibling(). Lastly, we use find_next() to return the next element and pass string=True to include strings as the following elements.

The above example may be more than necessary for the specific use case, but it demonstrates the utilization of these functions to extract any desired information from the page.

Conclusion

This lesson covered searching and navigating the DOM structure and scraping website information. With this knowledge, it is possible to retrieve the desired data from any website by making appropriate requests and utilizing the functions provided by the Beautiful Soup library.

Introduction to Course Content and Web Scraping

Fundamental Concepts of Web Scraping

Dynamic Sites with Selenium

Assessment: Python Scraping

Scrapy Framework

Scraping Educative’s Courses Information

Wrap Up

Introduction

Installation

Usage

Attributes

Try it yourself

Searching the DOM

Try it yourself

Try it yourself

Other useful functions

Conclusion