Introduction to lxml

Learn how to scrape and navigate the HTML DOM using XPath.

Now that we have covered XPath, it's time to put our knowledge into practice and explore its practical applications in extracting data from static and dynamic websites.

lxml

Although Beautiful Soup alone does not have built-in support for XPath, we can leverage another library to harness the power of XPath. lxml is a highly valuable Python library for web scraping. While its primary focus is parsing XML, it also offers support for HTML. Notably, lxml allows us to utilize both XPath and CSS selectors, making it a versatile tool for data extraction. As a result, it serves as an excellent alternative to Beautiful Soup.

Usage

Let's take a look at how we can use it.

Press + to interact
import requests
from lxml import html
response = requests.get("https://books.toscrape.com/")
DOM = html.fromstring(response.content)
print(DOM.cssselect("title")[0].text)
print(DOM.xpath("//h3/a/@title")[:10])

The code is quite similar to Beautiful Soup parsing code:

  • Line 5: We begin by requesting the URL, and then we pass the content to the parser, in this case, lxml.html.

    • This step constructs the familiar DOM tree, allowing us to navigate it as we normally would.

  • Line 7: One advantage is the ability to employ CSS selectors by utilizing the .cssselect() method.

    • This method returns a list of matching elements, and we can access the text of the elements using the .text attribute.

  • Line 9: However, the main advantage and the primary reason for utilizing this library is the support for XPath.

    • We can now use the .xpath method, providing the desired path of the elements we wish to retrieve.

Scraping with XPath

Let's scrape all the information from Books to Scrape using XPath and lxml.

Press + to interact
import json
import requests
from requests.compat import urljoin
from lxml import html
base_url = "https://books.toscrape.com/"
data = []
def scrape(url):
response = requests.get(url)
DOM = html.fromstring(response.content)
titles_xpath = "//article[@class='product_pod']/h3/a/@title"
images_xpath = "//article[@class='product_pod']/div[@class='image_container']/a/img/@src"
rates_xpath = "//article[@class='product_pod']/p/@class"
prices_xpath = "//article[@class='product_pod']/div[@class='product_price']/p[1]/text()"
titles = DOM.xpath(titles_xpath)
images = [urljoin(base_url, x) for x in DOM.xpath(images_xpath)]
rates = [x.split()[1] for x in DOM.xpath(rates_xpath)]
prices = DOM.xpath(prices_xpath)
for title, image, rate, price in zip(titles, images, rates, prices):
d = dict()
d["title"] = title.strip()
d["image"] = image
d["rate"] = rate.strip()
d["price"] = price.strip()
data.append(d)
next_page = DOM.xpath("//ul[@class='pager']/li[@class='next']/a/@href")
if next_page:
page_part = '/catalogue/' + next_page[0].split('/')[-1]
next_page_url = requests.compat.urljoin(base_url, page_part)
scrape(next_page_url)
return
scrape(base_url)
if len(data):
print("Total numbers of scraped books: ", len(data))
with open('output/data.json', 'w') as f:
json.dump(data , f)
Press + to interact
Sample output of the data.json file
Sample output of the data.json file

This is the same as we did using CSS electors, the difference is only in the patterns or the paths we are using.

  • Lines 11–12: We first request the URL then we pass the response to html.fromstring module to construct the DOM tree.

  • Line 14: To retrieve the titles we look for the <article> anywhere in the DOM, followed by <h3> as its direct child then we get the @title attribute of the <a> tag.

  • Line 15: We do the same with images by following the <img> path and getting its @src attribute.

  • Line 16: For the rates, we get the @class attribute that holds the rate. Since we can't split it directly using XPaths we will retrieve it later.

  • Line 17: Lastly, we get the price item by following the path and selecting the first <p> tag with the required information using order predicates.

  • Lines 19–22: We post-process the extracted info by striping texts and fixing the image URL. In addition, we get the rate class.

  • Line 31: We get the next page URL by following the path a get the @href attribute.

We can see that, using the attribute axis and predicates helps reduce the amount of post-processing we need to do for each element, as these functions get applied to all the retrieved elements directly before returning the result.

Using XPath with Selenium

We won't need to add any external or additional requirements to use XPath with Selenium since it supports XPath by default. The only change we need to make is to use the By.XPATH method instead of By.CSS_SELECTOR.

Scraping Yahoo using XPath

Let's rework our Yahoo example by scraping the first rendered stock market news from Yahoo Finance.

driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=options)
driver.get("https://finance.yahoo.com/topic/stock-market-news/")


try:
    news = WebDriverWait(driver, 10).until(
        EC.presence_of_all_elements_located((By.XPATH, '//div[@class="Py(14px) Pos(r)"]')))
except TimeoutException:
    raise TimeoutException("Elements are not loaded")

print("len of news: ", len(news))

data = []
print(len(news))
for n in news:
    title = n.find_element(By.XPATH,
    ".//div//h3").text
    link = n.find_element(By.XPATH,
    ".//div//h3/a").get_attribute("href")

    d = {'title': title, 'link': link}
    data.append(d)

print("len of scraped data: ", len(data))
print("sample: ", data[0])

# We are using this only for demonstration purpose.
time.sleep(2)

driver.close()
Scraping first rendered posts from Yahoo stock market news

Note: The output tab will only show the website for the 3 seconds we defined, then the result will be printed in the terminal tab.

Again, the code is not very different, except that we now use XPath rather than CSS selectors.

Note: We don't use any attribute at the end of paths because find_element() expects the result of the path to be an element, not a value or a text.

  • Line 7: We wait for the news elements by looking for div elements with Py(14px) Pos(r) class.

  • Line 16: We get the news titles by following the <div> element that is a child of div element with .//div//h3 XPath.

    • The . here is necessary to refer to the original element; this is different from CSS selectors; without it, the XPath would search again from the original DOM, not from the element.

  • Line 20: Lastly, we get the link item by following the same XPath and selecting the <a> tag then get the href element.

Conclusion

We have covered using XPath to scrape static and dynamic websites using lxml and Selenium as an alternative to Beautiful Soup. Now, we will rely more on XPath due to its simpler and more efficient syntax, especially when dealing with complex scenarios.

Get hands-on with 1300+ tech skills courses.