In this lesson, we'll look at an additional DOM structure known as XML, and introduce a distinct yet more efficient approach to traversing XML and HTML documents.

Introduction to XML

XML (Extensible Markup Language) is another markup language used to organize and structure data in a hierarchical format. It is a versatile tool for storing and transmitting information between different systems, platforms, and applications. In contrast to HTML, XML doesn't inherently support data interaction or display. Additionally, formatting is less significant in XML, as the interpreter comprehends the syntax regardless of the format.

Press + to interact
<?xml version="1.0" encoding="UTF-8"?>
<movies>
<movie category="Drama">
<title lang="en"> Birdman </title>
<year> 2014 </year>
<awards> 4 </awards>
<nominations> 9 </nominations>
</movie>
<movie category="Family">
<title lang="en"> Inside Out </title>
<year> 2015 </year>
<awards> 2 </awards>
<nominations> 1 </nominations>
</movie>
</movies>

XML vs. HTML

Here are some differences between XML and HTML:

XML

HTML

XML is primarily used for storing and transporting data.

HTML is primarily used for displaying the data and how it should look in a browser.

It doesn't have any predefined tags or syntax. Tags can be custom-mode to fit specific needs.

The tags are predefined and must follow the correct syntax in HTML.

The closing tags are mandatory in XML.

The closing tags are recommended but not mandatory in HTML.

Navigating XML DOM

Navigating XML documents follows a process similar to using CSS selectors and Beautiful Soup.

Press + to interact
XML DOM
XML DOM

The key difference lies in passing the appropriate XML parser to the Beautiful Soup object before initiating navigation.

Press + to interact
main.py
page.xml
<?xml version="1.0" encoding="UTF-8"?>
<movies>
<movie category="Drama">
<title lang="en"> Birdman </title>
<year> 2014 </year>
<awards> 4 </awards>
<nominations> 9 </nominations>
</movie>
<movie category="Drama">
<title lang="en"> The Imitation Game </title>
<year> 2014 </year>
<awards> 8 </awards>
<nominations> 1 </nominations>
</movie>
<movie category="Family">
<title lang="en"> Inside Out </title>
<year> 2015 </year>
<awards> 2 </awards>
<nominations> 1 </nominations>
</movie>
</movies>

Note: The HTML declaration <?xml version="1.0" encoding="UTF-8"?> in the above page.xml file is also called a prologue. It is optional but it must comes first in the document and it doesn't have a closing tag. The prologue defines the XML version and the encoding method.

Although CSS selectors are commonly associated with styling HTML documents, they can also be utilized with XML. However, it's important to note that the class attribute, frequently used in HTML for applying styles, may not be as prevalent in XML.

Introduction to XPath

XPath (XML Path Language) provides a powerful and flexible means of navigating and querying XML documents. With XPath, we can traverse the hierarchical structure of XML documents, accessing elements based on their element names, attributes, relationships, or positions within the document. XPath is not limited to XML alone; it can also be employed with Beautiful Soup and Selenium as an alternative to CSS Selectors, our primary focus in this learning module.

XPath syntax

XPath functions similarly to the expressions we use with computer file systems. By following a sequence of steps or a path, we can precisely select the desired node or nodes.

Press + to interact
XML as a system file path
XML as a system file path

The main structure of XPath is as follows:

Axis

Step

Axis

Step

ex: //

li

/

[1]

Let's explore each part in detail by applying using this example.

Press + to interact
<html>
<body>
<h1 id="title"> title </h1>
<div>
<p class="paragraph"> Paragraph 1 </p>
<label for="username"> user name </label>
<input name="username">
<a href="https://www.uni.com/lecture.pdf"> link </a>
<p class="paragraph"> Paragrpah 2</p>
</div>
<ul>
<li> Item 1 </li>
<li> Item 2 </li>
<li> Item 3 </li>
</ul>
</body>
</html>

Axes (Selectors)

XPath employs path expressions for node selection within an XML document, achieved by traversing a sequence of steps or paths. The following is a compilation of some of the most valuable path expressions for this purpose:

Expression

Description

Output

tagname

ex: li

Selects all the elements with this tag.

<li>Item 1</li>

<li>Item 2</li>

<li>Item 3</li>

/tagname

ex: /html/body/div/p


Similar to absolute paths.

Selects all the elements with this tagname starting from the root.

What follows the single slash / must have immediate child relationship with the preceding tag.

<p class='paragraph'> Paragraph 1</p>

<p class='paragraph'> Paragraph 2</p>

//tagname

ex: //p

Similar to relative paths.

Selects all the elements with this tagname that exist anywhere in the DOM.

What follows the double slash // can be at any level of depth below the preceding tag.


<p class='paragraph'> Paragraph 1</p>

<p class='paragraph'> Paragraph 2</p>

*

ex: //ul/*

Matches any tag or all the tags if it used alone.

<li>Item 1</li>

<li>Item 2</li>

<li>Item 3</li>

Steps (Predicates)

Predicates are functions we use to find a tag with a specific value or order. We can divide them into four different types:

  • Order predicates

  • Attribute predicates

  • Sibling predicates

  • Functional predicates

Order predicates

Syntax

Description

Output

[order]

ex: //ul/li[2]

Selects the element with the specified index.

<li>Item 2</li>

[last()]

ex: //ul/li[last()]


Selects the last element of this tag.

<li>Item 3</li>

[position(){=, >, <}index]

ex: //ul/li[position()<3]

Selects all the elements that match the required indices.

<li> Item 1</li>

<li> Item 2</li>

Attribute predicates

Syntax

Description

Output

[@id='id']

ex: //*[@id='title']

Selects any element with the specified ID.

<h1 id="title">title</h1>

[@class='class']

ex: //p[@class='paragraph']


Selects any element with the specified class.

<p class="paragraph">Paragraph 1</p>

<p class="paragraph">Paragraph 2</p>

[@attribute='value']

ex: //input[@name='username']

Selects any element with the attribute value.

<input type="username">

[starts-with(@attribute, 'value')]

ex: //a[starts-with(@href, 'https')]

Selects any element with an attribute that starts with a specific value.

<a href="https://www.uni.com/lecture.pdf">

Lecture</a>

[ends-with(@attribute, 'value')]

ex: //a[ends-with(@href, '.pdf')]

Selects any element with an attribute that ends with a specific value.

<a href="https://www.uni.com/lecture.pdf">

Lecture</a>


[contains(@attribute, 'value')]

ex: //*[contains(@for, 'username')]

Selects any element with an attribute that contains a specific value.

<label for="username">user name</label>

Sibling predicates

Syntax

Description

Output

child::tag

ex: //div/child::p

Selects the element with direct child of the specified tag.

<p class="paragraph">Paragraph 1</p>

<p class="paragraph">Paragraph 2</p>

..

ex: //div//p/../input


Selects the parent of the current tag, as we do with relative paths.

<input name="username">

descendant::tag

ex: //div/descendant::*

Selects all the elements that are descendants with the specified tag.

Similarly, ancestor, following, following-sibling, preceding, preceding-sibling

<p class="paragraph"> Paragraph 1 </p>

<label for="username"> user name </label>

<a href="https://www.uni.com/lecture.pdf"> link </a>

<p class="paragraph"> Paragrpah 2</p>

descendant-or-self::tag

ex: //div/descendant-or-self::a

Returns the same result as descendant besides the grand tag.

Similarly, ancestor-or-self

//div/descendant-or-self::a

Functional Predicates

Syntax

Description

Output

text()

ex: //p/text()

Returns the text of the specified elements.

[Paragraph 1, Paragraph 2]

count()

ex: //ul[count(li)>


Returns the element that matches the number specified.

<ul>

<li>Item 1</li>

<li>Item 2</li>

<li>Item 3</li>

</ul>



string()

ex: string(//div)

Returns the string value of the element by concatenating all of its descendants texts.

Paragraph 1

user name

link

Paragrpah 2

There are about 37 functions we can utilize with XPath.

Advanced predicates

An important advantage of using XPath predicates is the ability to chain and combine multiple predicates to achieve the desired result. With XPath, a predicate can return another element, which allows us to apply additional predicates to narrow down our selection. This flexibility enables us to construct complex and precise queries to obtain our target data.

Suppose we want to select smartphones with a price greater than $800 and a rating equal to or higher than 4.5 from this HTML sample.

Press + to interact
<div>
<span class="product">Apple iPhone 12</span>
<span class="price">$999</span>
<span class="rating">4.5</span>
</div>
<div>
<span class="product">Samsung Galaxy S21</span>
<span class="price">$899</span>
<span class="rating">4.8</span>
</div>
<div>
<span class="product">Google Pixel 5</span>
<span class="price">$799</span>
<span class="rating">4.2</span>
</div>
<div>
<span class="product">OnePlus 9 Pro</span>
<span class="price">$899</span>
<span class="rating">4.7</span>
</div>

XPath solution:

//div[span[@class='price' and number(substring-after(text(), '$')) > 800] and span[@class='rating' and number(text()) >= 4.5]]
Stacked XPath
  • //div: This selects all <div> elements in the HTML document, regardless of their position or hierarchy.

  • [span[@class='price' and number(substring-after(text(), '$')) > 800]: This is the first predicate that filters the selected <div> elements based on the price condition.

    • span[@class='price']: This selects the <span> element with the class attribute "price" that is a direct child of the current <div> element.

    • number(substring-after(text(), '$')) > 800: This condition checks if the numeric value of the price (obtained by removing the '$' symbol) is greater than 800. It does this by using the substring-after() function to extract the text after the '$' symbol and then converting it to a number using the number() function.

  • and: This is the logical operator that combines the two predicates.

  • span[@class='rating' and number(text()) >= 4.5]: This is the second predicate that filters the <div> elements further based on the rating condition.

    • span[@class='rating']: This selects the <span> element with the class attribute "rating" that is a direct child of the current <div> element.

    • number(text()) >= 4.5: This condition checks if the rating value, obtained as text, is equal to or higher than 4.5. It uses the number() function to convert the rating text to a number for comparison.

XPath vs. CSS selectors

CSS selectors primarily rely on the syntax and patterns defined by CSS specifications. They are commonly used to select elements based on their attributes, classes, IDs, or hierarchical relationships within the document's structure. On the other hand, XPath (XML Path Language) provides a more extensive and flexible querying capability. XPath is not limited to HTML documents but can also be used with XML and other structured data formats. With XPath, we can navigate the DOM tree using a combination of element names, attributes, relationships, and predicates. This allows for highly specific and granular element selection, even when dealing with complex document structures.

Here are some common differences between XPath and CSS selectors:

Syntax

CSS

XPath

All elements

*

//*

All div elements

div

//div

All child p elements inside div

div p

//div//p

All direct li childs to ul tag

ul >li

//ul/li

Element by id

#id

//*[@id='id']

Element by class

.class

//*[@class="class"]

Element by attribute

input[type="submit"]

//input[@type="submit"]

Attribute contains specific value

a[href^='/']

a[href$='pdf']

a[href*='://']

//a[starts-with(@href, '/')]

//a[ends-with(@href, '.pdf')]

//a[contains(@href, '://')]

Elements with specific order

ul > li:first-of-type

ul > li:nth-of-type(2)

ul > li:last-of-type

//ul/li[1]

//ul/li[2]

//ul/li[last()]

Next element

h1 + ul

//h1/following-sibling::ul[1]

Previous element

Not possible

//h1/preceding-sibling::ul[1]

Note: Depending on the specific requirements of a web scraping task, the choice between XPath and CSS selectors will depend on the complexity of the target document and the precision needed to extract the desired data.

Get hands-on with 1300+ tech skills courses.