Introduction to XML and XPath
Delve into the realm of XML and XPath.
In this lesson, we'll look at an additional DOM structure known as XML, and introduce a distinct yet more efficient approach to traversing XML and HTML documents.
Introduction to XML
XML (Extensible Markup Language) is another markup language used to organize and structure data in a hierarchical format. It is a versatile tool for storing and transmitting information between different systems, platforms, and applications. In contrast to HTML, XML doesn't inherently support data interaction or display. Additionally, formatting is less significant in XML, as the interpreter comprehends the syntax regardless of the format.
<?xml version="1.0" encoding="UTF-8"?><movies><movie category="Drama"><title lang="en"> Birdman </title><year> 2014 </year><awards> 4 </awards><nominations> 9 </nominations></movie><movie category="Family"><title lang="en"> Inside Out </title><year> 2015 </year><awards> 2 </awards><nominations> 1 </nominations></movie></movies>
XML vs. HTML
Here are some differences between XML and HTML:
XML | HTML |
XML is primarily used for storing and transporting data. | HTML is primarily used for displaying the data and how it should look in a browser. |
It doesn't have any predefined tags or syntax. Tags can be custom-mode to fit specific needs. | The tags are predefined and must follow the correct syntax in HTML. |
The closing tags are mandatory in XML. | The closing tags are recommended but not mandatory in HTML. |
Navigating XML DOM
Navigating XML documents follows a process similar to using CSS selectors and Beautiful Soup.
The key difference lies in passing the appropriate XML parser to the Beautiful Soup object before initiating navigation.
<?xml version="1.0" encoding="UTF-8"?><movies><movie category="Drama"><title lang="en"> Birdman </title><year> 2014 </year><awards> 4 </awards><nominations> 9 </nominations></movie><movie category="Drama"><title lang="en"> The Imitation Game </title><year> 2014 </year><awards> 8 </awards><nominations> 1 </nominations></movie><movie category="Family"><title lang="en"> Inside Out </title><year> 2015 </year><awards> 2 </awards><nominations> 1 </nominations></movie></movies>
Note: The HTML declaration
<?xml version="1.0" encoding="UTF-8"?>
in the abovepage.xml
file is also called a prologue. It is optional but it must comes first in the document and it doesn't have a closing tag. The prologue defines the XML version and the encoding method.
Although CSS selectors are commonly associated with styling HTML documents, they can also be utilized with XML. However, it's important to note that the class
attribute, frequently used in HTML for applying styles, may not be as prevalent in XML.
Introduction to XPath
XPath (XML Path Language) provides a powerful and flexible means of navigating and querying XML documents. With XPath, we can traverse the hierarchical structure of XML documents, accessing elements based on their element names, attributes, relationships, or positions within the document. XPath is not limited to XML alone; it can also be employed with Beautiful Soup and Selenium as an alternative to CSS Selectors, our primary focus in this learning module.
XPath syntax
XPath functions similarly to the expressions we use with computer file systems. By following a sequence of steps or a path, we can precisely select the desired node or nodes.
The main structure of XPath is as follows:
Axis | Step | Axis | Step |
ex: |
|
|
|
Let's explore each part in detail by applying using this example.
<html><body><h1 id="title"> title </h1><div><p class="paragraph"> Paragraph 1 </p><label for="username"> user name </label><input name="username"><a href="https://www.uni.com/lecture.pdf"> link </a><p class="paragraph"> Paragrpah 2</p></div><ul><li> Item 1 </li><li> Item 2 </li><li> Item 3 </li></ul></body></html>
Axes (Selectors)
XPath employs path expressions for node selection within an XML document, achieved by traversing a sequence of steps or paths. The following is a compilation of some of the most valuable path expressions for this purpose:
Expression | Description | Output |
ex: | Selects all the elements with this tag. |
|
ex: | Similar to absolute paths. Selects all the elements with this tagname starting from the root. What follows the single slash |
|
ex: | Similar to relative paths. Selects all the elements with this tagname that exist anywhere in the DOM. What follows the double slash |
|
ex: | Matches any tag or all the tags if it used alone. |
|
Steps (Predicates)
Predicates are functions we use to find a tag with a specific value or order. We can divide them into four different types:
Order predicates
Attribute predicates
Sibling predicates
Functional predicates
Order predicates
Syntax | Description | Output |
ex: | Selects the element with the specified index. |
|
ex: | Selects the last element of this tag. |
|
ex: | Selects all the elements that match the required indices. |
|
Attribute predicates
Syntax | Description | Output |
ex: | Selects any element with the specified ID. |
|
ex: | Selects any element with the specified class. |
|
ex: | Selects any element with the attribute value. |
|
ex: | Selects any element with an attribute that starts with a specific value. |
|
ex: | Selects any element with an attribute that ends with a specific value. |
|
ex: | Selects any element with an attribute that contains a specific value. |
|
Sibling predicates
Syntax | Description | Output |
ex: | Selects the element with direct child of the specified tag. |
|
ex: | Selects the parent of the current tag, as we do with relative paths. |
|
ex: | Selects all the elements that are descendants with the specified tag. Similarly, |
|
ex: | Returns the same result as Similarly, |
|
Functional Predicates
Syntax | Description | Output |
ex: | Returns the text of the specified elements. |
|
ex: | Returns the element that matches the number specified. |
|
ex: | Returns the string value of the element by concatenating all of its descendants texts. |
|
There are about 37 functions we can utilize with XPath.
Advanced predicates
An important advantage of using XPath predicates is the ability to chain and combine multiple predicates to achieve the desired result. With XPath, a predicate can return another element, which allows us to apply additional predicates to narrow down our selection. This flexibility enables us to construct complex and precise queries to obtain our target data.
Suppose we want to select smartphones with a price greater than $800 and a rating equal to or higher than 4.5 from this HTML sample.
<div><span class="product">Apple iPhone 12</span><span class="price">$999</span><span class="rating">4.5</span></div><div><span class="product">Samsung Galaxy S21</span><span class="price">$899</span><span class="rating">4.8</span></div><div><span class="product">Google Pixel 5</span><span class="price">$799</span><span class="rating">4.2</span></div><div><span class="product">OnePlus 9 Pro</span><span class="price">$899</span><span class="rating">4.7</span></div>
XPath solution:
//div[span[@class='price' and number(substring-after(text(), '$')) > 800] and span[@class='rating' and number(text()) >= 4.5]]
//div
: This selects all<div>
elements in the HTML document, regardless of their position or hierarchy.[span[@class='price' and number(substring-after(text(), '$')) > 800]
: This is the first predicate that filters the selected<div>
elements based on the price condition.span[@class='price']
: This selects the<span>
element with the class attribute"price"
that is a direct child of the current<div>
element.number(substring-after(text(), '$')) > 800
: This condition checks if the numeric value of the price (obtained by removing the '$' symbol) is greater than 800. It does this by using thesubstring-after()
function to extract the text after the '$' symbol and then converting it to a number using thenumber()
function.
and
: This is the logical operator that combines the two predicates.span[@class='rating' and number(text()) >= 4.5]
: This is the second predicate that filters the<div>
elements further based on the rating condition.span[@class='rating']
: This selects the<span>
element with the class attribute "rating" that is a direct child of the current<div>
element.number(text()) >= 4.5
: This condition checks if the rating value, obtained as text, is equal to or higher than 4.5. It uses thenumber()
function to convert the rating text to a number for comparison.
XPath vs. CSS selectors
CSS selectors primarily rely on the syntax and patterns defined by CSS specifications. They are commonly used to select elements based on their attributes, classes, IDs, or hierarchical relationships within the document's structure. On the other hand, XPath (XML Path Language) provides a more extensive and flexible querying capability. XPath is not limited to HTML documents but can also be used with XML and other structured data formats. With XPath, we can navigate the DOM tree using a combination of element names, attributes, relationships, and predicates. This allows for highly specific and granular element selection, even when dealing with complex document structures.
Here are some common differences between XPath and CSS selectors:
Syntax | CSS | XPath |
All elements |
|
|
All |
|
|
All child |
|
|
All direct |
|
|
Element by |
|
|
Element by |
|
|
Element by |
|
|
Attribute contains specific value |
|
|
Elements with specific order |
|
|
Next element |
|
|
Previous element | Not possible |
|
Note: Depending on the specific requirements of a web scraping task, the choice between XPath and CSS selectors will depend on the complexity of the target document and the precision needed to extract the desired data.
Get hands-on with 1300+ tech skills courses.