Introduction to lxml
Learn how to scrape and navigate the HTML DOM using XPath.
We'll cover the following
Now that we have covered XPath, it's time to put our knowledge into practice and explore its practical applications in extracting data from static and dynamic websites.
lxml
Although Beautiful Soup alone does not have built-in support for XPath, we can leverage another library to harness the power of XPath. lxml is a highly valuable Python library for web scraping. While its primary focus is parsing XML, it also offers support for HTML. Notably, lxml allows us to utilize both XPath and CSS selectors, making it a versatile tool for data extraction. As a result, it serves as an excellent alternative to Beautiful Soup.
Usage
Let's take a look at how we can use it.
import requestsfrom lxml import htmlresponse = requests.get("https://books.toscrape.com/")DOM = html.fromstring(response.content)print(DOM.cssselect("title")[0].text)print(DOM.xpath("//h3/a/@title")[:10])
The code is quite similar to Beautiful Soup parsing code:
Line 5: We begin by requesting the URL, and then we pass the content to the parser, in this case,
lxml.html
.This step constructs the familiar DOM tree, allowing us to navigate it as we normally would.
Line 7: One advantage is the ability to employ CSS selectors by utilizing the
.cssselect()
method.This method returns a list of matching elements, and we can access the text of the elements using the
.text
attribute.
Line 9: However, the main advantage and the primary reason for utilizing this library is the support for XPath.
We can now use the
.xpath
method, providing the desired path of the elements we wish to retrieve.
Scraping with XPath
Let's scrape all the information from Books to Scrape using XPath and lxml.
import jsonimport requestsfrom requests.compat import urljoinfrom lxml import htmlbase_url = "https://books.toscrape.com/"data = []def scrape(url):response = requests.get(url)DOM = html.fromstring(response.content)titles_xpath = "//article[@class='product_pod']/h3/a/@title"images_xpath = "//article[@class='product_pod']/div[@class='image_container']/a/img/@src"rates_xpath = "//article[@class='product_pod']/p/@class"prices_xpath = "//article[@class='product_pod']/div[@class='product_price']/p[1]/text()"titles = DOM.xpath(titles_xpath)images = [urljoin(base_url, x) for x in DOM.xpath(images_xpath)]rates = [x.split()[1] for x in DOM.xpath(rates_xpath)]prices = DOM.xpath(prices_xpath)for title, image, rate, price in zip(titles, images, rates, prices):d = dict()d["title"] = title.strip()d["image"] = imaged["rate"] = rate.strip()d["price"] = price.strip()data.append(d)next_page = DOM.xpath("//ul[@class='pager']/li[@class='next']/a/@href")if next_page:page_part = '/catalogue/' + next_page[0].split('/')[-1]next_page_url = requests.compat.urljoin(base_url, page_part)scrape(next_page_url)returnscrape(base_url)if len(data):print("Total numbers of scraped books: ", len(data))with open('output/data.json', 'w') as f:json.dump(data , f)
This is the same as we did using CSS electors, the difference is only in the patterns or the paths we are using.
Lines 11–12: We first request the URL then we pass the response to
html.fromstring
module to construct the DOM tree.Line 14: To retrieve the titles we look for the
<article>
anywhere in the DOM, followed by<h3>
as its direct child then we get the@title
attribute of the<a>
tag.Line 15: We do the same with images by following the
<img>
path and getting its@src
attribute.Line 16: For the rates, we get the
@class
attribute that holds the rate. Since we can't split it directly using XPaths we will retrieve it later.Line 17: Lastly, we get the price item by following the path and selecting the first
<p>
tag with the required information using order predicates.Lines 19–22: We post-process the extracted info by striping texts and fixing the image URL. In addition, we get the
rate
class.Line 31: We get the next page URL by following the path a get the
@href
attribute.
We can see that, using the attribute axis and predicates helps reduce the amount of post-processing we need to do for each element, as these functions get applied to all the retrieved elements directly before returning the result.
Using XPath with Selenium
We won't need to add any external or additional requirements to use XPath with Selenium since it supports XPath by default. The only change we need to make is to use the By.XPATH
method instead of By.CSS_SELECTOR
.
Scraping Yahoo using XPath
Let's rework our Yahoo example by scraping the first rendered stock market news from Yahoo Finance.
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=options) driver.get("https://finance.yahoo.com/topic/stock-market-news/") try: news = WebDriverWait(driver, 10).until( EC.presence_of_all_elements_located((By.XPATH, '//div[@class="Py(14px) Pos(r)"]'))) except TimeoutException: raise TimeoutException("Elements are not loaded") print("len of news: ", len(news)) data = [] print(len(news)) for n in news: title = n.find_element(By.XPATH, ".//div//h3").text link = n.find_element(By.XPATH, ".//div//h3/a").get_attribute("href") d = {'title': title, 'link': link} data.append(d) print("len of scraped data: ", len(data)) print("sample: ", data[0]) # We are using this only for demonstration purpose. time.sleep(2) driver.close()
Note: The output tab will only show the website for the 3 seconds we defined, then the result will be printed in the terminal tab.
Again, the code is not very different, except that we now use XPath rather than CSS selectors.
Note: We don't use any attribute at the end of paths because
find_element()
expects the result of the path to be an element, not a value or a text.
Line 7: We wait for the news elements by looking for
div
elements withPy(14px) Pos(r)
class.Line 16: We get the news titles by following the
<div>
element that is a child ofdiv
element with.//div//h3
XPath.The
.
here is necessary to refer to the original element; this is different from CSS selectors; without it, the XPath would search again from the original DOM, not from the element.
Line 20: Lastly, we get the link item by following the same XPath and selecting the
<a>
tag then get thehref
element.
Conclusion
We have covered using XPath to scrape static and dynamic websites using lxml and Selenium as an alternative to Beautiful Soup. Now, we will rely more on XPath due to its simpler and more efficient syntax, especially when dealing with complex scenarios.
Get hands-on with 1300+ tech skills courses.