Essentials of HTML

Mastering HTML is a requirement for pursuing a profession in web development. However, we will look at the basics required to achieve our goal of web scraping.

Introduction

HyperText Markup Language (HTML) is a standard markup language used to create web pages. It explains the structure of the web pages using tags. These tags inform the browser on how to display the information, such as a title, heading, image, link, or any other type.

Markup languages vs. programming languages

Here are some common differences between markup and programming languages:

Markup Language

Programming Language

Primarily used for defining and describing content

Primarily used for writing executable programs

Does not require compilation or execution

Requires compilation and/or execution to produce output

Used to format text, images, and multimedia content

Used to build applications, software, and systems

Examples: HTML, XML, Markdown

Examples: Python, Java, C++, JavaScript

Structure

Each HTML document can be considered a document tree. We define the tree's components in the same way that we would describe a family tree, with each node in the tree being an HTML tag that can contain other tags as children.

Press + to interact
HTML document tree
HTML document tree

The browser’s job is to understand each element’s purpose and display it correctly. This tree structure is called Document Object Model (DOM), which treats an HTML doc as a tree. Once we have the DOM, we can easily search and retrieve any element (node) we want. We will not do the conversion or write the search algorithms ourselves; Python libraries and tools will do the job for us. All we need to have is the path for the tag.

Press + to interact
main.py
page.html
from bs4 import BeautifulSoup
# reading the page content
with open('page.html', 'r') as f:
html_doc = f.read()
# Parse using BeautifulSoup
tree = BeautifulSoup(html_doc, 'html.parser')
# print the p element content
print(tree.body.div.p.text)
  • Lines 4–5: We start by reading the HTML document.

  • Lines 7–9: We parse the document and convert it to a tree structure using a Python library called BeautifulSoup, which we will discuss later. Once we have the tree, we can navigate it and reach any element using its path from the root.

Tags

Most of the HTML tags have the same structure:

<tagname> Content... </tagname>

Let's go through some of the essential tags we should be aware of:

  • <!DOCTYPE html>

    • It is a document-type declaration. It helps the browser display the web pages correctly.

  • <html>...</html>

    • Each HTML doc must start with the root tag in the tree.

  • <head>...</head>

    • This tag is the container for all the metadata tags such as <title>, <meta>, <style>.

    • The browser doesn't display the <meta> tag. It only holds information about the page, such as title and external style sheets, character set, etc.

  • <title>...</title>

    • It is a child of the <head> tag. It is the page's title, which is shown on the browser tab.

  • <body>...</body>

    • It is the main container for all the content displayed in the browser, the body of all the other elements.

  • <div>...</div>

    • Consider it as a subsection inside the doc. If <body> represents a heading, then <div> is a subheading under it.

    • The <div> element is typically used as a container for other HTML elements rather than text content. While it is not intended to hold text directly, it can include text within <div> elements using the appropriate HTML tags such as <div> text … </div>.

  • <ul>...</ul>

    • It represents the unordered list format. The data inside it is displayed as bullet points.

  • <ol>...</ol>

    • It represents the data as an ordered list with a numbering style.

  • <table>...</table>

    • It defines a table structure consisting of table cells inside rows and columns.

    • <tr>...</tr> represents the content of the table cell

    • <td>...</td> access cells to be table header cells

    • <th>..</th> represents a heading cell

  • <h1>...</h1>

    • HTML heading tags <h1> to <h6> are used to create headings and subheadings on a web page. <h1> is the most important and used for the main title, while the tags <h2> to <h6> are used for section headings and subheadings.

  • <p>...</p>

    • It is used to create a paragraph of text on a web page and is one of the most commonly used HTML tags. It is a simple and essential element for structuring the content of a web page.

  • <a href="">...</a>

    • The <a> tag creates a hyperlink between web pages, allowing users to click on a link and be directed to another page. The most crucial aspect of the <a> tag is its href attribute specifies the URL or web address of the destination page to which the link points.

  • <img src="" alt="">

    • The <img> tag is utilized to incorporate images into a web page. Rather than being inserted, this tag links images to the web page. There are two mandatory attributes associated with the <img> tag:

      • src determines the image’s path

      • alt specifies an alternative text for the image in case it cannot be displayed

    • The <img> tag is called an empty tag. It doesn’t have a closing tag or content.

  • <iframe>...</iframe>

    • This tag is used to display a web page within a web page, commonly used in streaming sites where it encapsulates the video window inside the webpage

There are other empty tags, such as inserting a new line with <br>. Most of the time, these tags are not an issue as we will only deal with elements that hold information inside them and remove all the formatting.

Attributes

As we have seen above, some tags require attributes such as href in <a> or src in <img>. These attributes are functional types, and the tags won’t be displayed correctly without them. It holds information about the tag. For example, if we want to scrape the URL of an image, we do tree.body.div.img.get(src).

Along with the first type of attributes, it is crucial to understand the significance of styling or identifying attributes to determine the desired element.

Some of the most common and well-known attributes are:

  • <tagname id="unique id">...</tagename>

    • The id attribute specifies a unique ID for each tag in the HTML doc. It is used as an identifier for the element whenever we want to apply a specific style to this element alone.

  • <tagename class="class name">...</tagename>

    • The class attribute is used to specify the style class applied to this element.

    • A style class is a group of instructions that defines a class, such as font, color, width, and height.

  • <tagename data-*="custom data">...</tagename>

    • The data-* attribute allows for the storage of extra information related to this specific element. This data can then be accessed by simply querying the tag.

  • <tagename hidden>...</tagename>

    • The hidden attribute does not accept any values. It is a boolean value that the browser either displays or ignores.

Importance of attributes

Consider a case where there are multiple <li> elements, and we want to scrape only Element 3 as shown in the code snippet below:

Press + to interact
<!DOCTYPE html>
<html>
<head>
<title> Web Scraping Course </title>
</head>
<body>
<div>
<ul> A list of elements</ul>
<li id=1> Element 1</li>
<li id=2> Element 2</li>
<li id=3> Element 3</li>
<li id=4> Element 4</li>
</div>
</body>
</html>

The importance of attributes becomes evident in this scenario, where we can leverage the id attribute to identify an element using its id value.

Conclusion

This lesson covered the essential HTML concepts and elements required for retrieving data from a web page. While there are more elements to know, understanding each element's function is unnecessary as the focus is solely on the web page's data, without any formatting. In future lessons, any new elements encountered will be briefly discussed.

Get hands-on with 1300+ tech skills courses.