Introduction to XML and XPath

Press + to interact

<?xml version="1.0" encoding="UTF-8"?> 
<movies>
  <movie category="Drama">
    <title lang="en"> Birdman </title>
    <year> 2014 </year>
    <awards> 4 </awards>
    <nominations> 9 </nominations>
  </movie>
  <movie category="Family">
    <title lang="en"> Inside Out </title>
    <year> 2015 </year>
    <awards> 2 </awards>
    <nominations> 1 </nominations>
  </movie>
  
</movies>

Press + to interact

main.py

page.xml

<?xml version="1.0" encoding="UTF-8"?> 
<movies>

  <movie category="Drama">
    <title lang="en"> Birdman </title>
    <year> 2014 </year>
    <awards> 4 </awards>
    <nominations> 9 </nominations>
  </movie>

  <movie category="Drama">
    <title lang="en"> The Imitation Game </title>
    <year> 2014 </year>
    <awards> 8 </awards>
    <nominations> 1 </nominations>
  </movie>

  <movie category="Family">
    <title lang="en"> Inside Out </title>
    <year> 2015 </year>
    <awards> 2 </awards>
    <nominations> 1 </nominations>
  </movie>
  
</movies>

Note: The HTML declaration <?xml version="1.0" encoding="UTF-8"?> in the above page.xml file is also called a prologue. It is optional but it must comes first in the document and it doesn't have a closing tag. The prologue defines the XML version and the encoding method.

Although CSS selectors are commonly associated with styling HTML documents, they can also be utilized with XML. However, it's important to note that the class attribute, frequently used in HTML for applying styles, may not be as prevalent in XML.

Introduction to XPath

XPath (XML Path Language) provides a powerful and flexible means of navigating and querying XML documents. With XPath, we can traverse the hierarchical structure of XML documents, accessing elements based on their element names, attributes, relationships, or positions within the document. XPath is not limited to XML alone; it can also be employed with Beautiful Soup and Selenium as an alternative to CSS Selectors, our primary focus in this learning module.

XPath syntax

XPath functions similarly to the expressions we use with computer file systems. By following a sequence of steps or a path, we can precisely select the desired node or nodes.

Press + to interact

Expression	Description	Output
`tagname` ex: `li`	Selects all the elements with this tag.	`<li>Item 1</li>` `<li>Item 2</li>` `<li>Item 3</li>`
`/tagname` ex: `/html/body/div/p`	Similar to absolute paths. Selects all the elements with this tagname starting from the root. What follows the single slash `/` must have immediate child relationship with the preceding tag.	`<p class='paragraph'> Paragraph 1</p>` `<p class='paragraph'> Paragraph 2</p>`
`//tagname` ex: `//p`	Similar to relative paths. Selects all the elements with this tagname that exist anywhere in the DOM. What follows the double slash `//` can be at any level of depth below the preceding tag.	`<p class='paragraph'> Paragraph 1</p>` `<p class='paragraph'> Paragraph 2</p>`
`` ex: `//ul/`	Matches any tag or all the tags if it used alone.	`<li>Item 1</li>` `<li>Item 2</li>` `<li>Item 3</li>`

Syntax	Description	Output
`[@id='id']` ex: `//*[@id='title']`	Selects any element with the specified ID.	`<h1 id="title">title</h1>`
`[@class='class']` ex: `//p[@class='paragraph']`	Selects any element with the specified class.	`<p class="paragraph">Paragraph 1</p>` `<p class="paragraph">Paragraph 2</p>`
`[@attribute='value']` ex: `//input[@name='username']`	Selects any element with the attribute value.	`<input type="username">`
`[starts-with(@attribute, 'value')]` ex: `//a[starts-with(@href, 'https')]`	Selects any element with an attribute that starts with a specific value.	`<a href="https://www.uni.com/lecture.pdf">` `Lecture</a>`
`[ends-with(@attribute, 'value')]` ex: `//a[ends-with(@href, '.pdf')]`	Selects any element with an attribute that ends with a specific value.	`<a href="https://www.uni.com/lecture.pdf">` `Lecture</a>`
`[contains(@attribute, 'value')]` ex: `//*[contains(@for, 'username')]`	Selects any element with an attribute that contains a specific value.	`<label for="username">user name</label>`

Syntax	Description	Output
`child::tag` ex: `//div/child::p`	Selects the element with direct child of the specified tag.	`<p class="paragraph">Paragraph 1</p>` `<p class="paragraph">Paragraph 2</p>`
`..` ex: `//div//p/../input`	Selects the parent of the current tag, as we do with relative paths.	`<input name="username">`
`descendant::tag` ex: `//div/descendant::*`	Selects all the elements that are descendants with the specified tag. Similarly, `ancestor`, `following`, `following-sibling`, `preceding`, `preceding-sibling`	`<p class="paragraph"> Paragraph 1 </p>` `<label for="username"> user name </label>` `<a href="https://www.uni.com/lecture.pdf"> link </a>` `<p class="paragraph"> Paragrpah 2</p>`
`descendant-or-self::tag` ex: `//div/descendant-or-self::a`	Returns the same result as `descendant` besides the grand tag. Similarly, `ancestor-or-self`	`//div/descendant-or-self::a`

There are about 37 functions we can utilize with XPath.

Advanced predicates

An important advantage of using XPath predicates is the ability to chain and combine multiple predicates to achieve the desired result. With XPath, a predicate can return another element, which allows us to apply additional predicates to narrow down our selection. This flexibility enables us to construct complex and precise queries to obtain our target data.

Suppose we want to select smartphones with a price greater than $800 and a rating equal to or higher than 4.5 from this HTML sample.

Press + to interact

//div: This selects all <div> elements in the HTML document, regardless of their position or hierarchy.
[span[@class='price' and number(substring-after(text(), '$')) > 800]: This is the first predicate that filters the selected <div> elements based on the price condition.
- span[@class='price']: This selects the <span> element with the class attribute "price" that is a direct child of the current <div> element.
- number(substring-after(text(), '$')) > 800: This condition checks if the numeric value of the price (obtained by removing the '$' symbol) is greater than 800. It does this by using the substring-after() function to extract the text after the '$' symbol and then converting it to a number using the number() function.
and: This is the logical operator that combines the two predicates.
span[@class='rating' and number(text()) >= 4.5]: This is the second predicate that filters the <div> elements further based on the rating condition.
- span[@class='rating']: This selects the <span> element with the class attribute "rating" that is a direct child of the current <div> element.
- number(text()) >= 4.5: This condition checks if the rating value, obtained as text, is equal to or higher than 4.5. It uses the number() function to convert the rating text to a number for comparison.

XPath vs. CSS selectors

CSS selectors primarily rely on the syntax and patterns defined by CSS specifications. They are commonly used to select elements based on their attributes, classes, IDs, or hierarchical relationships within the document's structure. On the other hand, XPath (XML Path Language) provides a more extensive and flexible querying capability. XPath is not limited to HTML documents but can also be used with XML and other structured data formats. With XPath, we can navigate the DOM tree using a combination of element names, attributes, relationships, and predicates. This allows for highly specific and granular element selection, even when dealing with complex document structures.

Here are some common differences between XPath and CSS selectors:

Syntax	CSS	XPath
All elements	`*`	`//*`
All `div` elements	`div`	`//div`
All child `p` elements inside `div`	`div p`	`//div//p`
All direct `li` childs to `ul` tag	`ul >li`	`//ul/li`
Element by `id`	`#id`	`//*[@id='id']`
Element by `class`	`.class`	`//*[@class="class"]`
Element by `attribute`	`input[type="submit"]`	`//input[@type="submit"]`
Attribute contains specific value	`a[href^='/']` `a[href$='pdf']` `a[href*='://']`	`//a[starts-with(@href, '/')]` `//a[ends-with(@href, '.pdf')]` `//a[contains(@href, '://')]`
Elements with specific order	`ul > li:first-of-type` `ul > li:nth-of-type(2)` `ul > li:last-of-type`	`//ul/li[1]` `//ul/li[2]` `//ul/li[last()]`
Next element	`h1 + ul`	`//h1/following-sibling::ul[1]`
Previous element	Not possible	`//h1/preceding-sibling::ul[1]`

XML	HTML
XML is primarily used for storing and transporting data.	HTML is primarily used for displaying the data and how it should look in a browser.
It doesn't have any predefined tags or syntax. Tags can be custom-mode to fit specific needs.	The tags are predefined and must follow the correct syntax in HTML.
The closing tags are mandatory in XML.	The closing tags are recommended but not mandatory in HTML.

Axis	Step	Axis	Step
ex: `//`	`li`	`/`	`[1]`

Syntax	Description	Output
`[order]` ex: `//ul/li[2]`	Selects the element with the specified index.	`<li>Item 2</li>`
`[last()]` ex: `//ul/li[last()]`	Selects the last element of this tag.	`<li>Item 3</li>`
`[position(){=, >, <}index]` ex: `//ul/li[position()<3]`	Selects all the elements that match the required indices.	`<li> Item 1</li>` `<li> Item 2</li>`

Syntax	Description	Output
`text()` ex: `//p/text()`	Returns the text of the specified elements.	`[Paragraph 1, Paragraph 2]`
`count()` ex: `//ul[count(li)>`	Returns the element that matches the number specified.	`<ul>` `<li>Item 1</li>` `<li>Item 2</li>` `<li>Item 3</li>` `</ul>`
`string()` ex: `string(//div)`	Returns the string value of the element by concatenating all of its descendants texts.	`Paragraph 1` `user name` `link` `Paragrpah 2`

Introduction to Course Content and Web Scraping

Fundamental Concepts of Web Scraping

Dynamic Sites with Selenium

Assessment: Python Scraping

Scrapy Framework

Scraping Educative’s Courses Information

Wrap Up