Introduction to Requests

Discover the Requests library and header spoofing.

We have covered how a browser communicates with a website server by sending an HTTP request and receiving an HTML response that includes the Document Object Model (DOM) structure. Now, we plan to implement the same procedure in our script to ensure that it accurately emulates the actions of a browser. Our primary objective is to replicate the behavior of a browser to accomplish our desired outcomes.

The requests library

It is a Python library that enables us to send HTTP requests to website servers and quickly receive the response objects.

Press + to interact
import requests
r = requests.get('https://books.toscrape.com/')
print("Request URL: ", r.url)
print("Request status code: ", r.status_code)
print("Response headers: ", r.headers)
# Prints the text chuck that holds the <title> tag in the HTML DOM returned.
print("Page's title: ", r.text[360:425])

The above code sends an HTTP request to the Books to Scrape website and retrieves the response object.

The response object has several attributes, such as:

  • object.URL: The address of the site being requested.

  • object.status_code: The status of the request and the server's response.

  • object.history: A list of the response object’s history after redirection.

  • object.header: Information about the server response that does not relate to the content, such as the date.

  • object.text: The content of the response as a string.

  • object.content: The content of the response in bytes.

Headers spoofing

Spoofing refers to sending false headers with values that match those of a typical browser. We can include additional information when requesting a website to help the server understand the request and customize its response. However, websites often use this header information to block requests that don't match what a typical browser would send.

Let’s explore the ShellHacks website using the network tool to see these headers.

Press + to interact
Example of Header spoofing
Example of Header spoofing

There are a bunch of attributes here; we don't need to learn about all of them. Some critical attributes we will cover are "Cookie" and "User-Agent."

User-Agent

It is an identifier that informs the server about the entity making the request. If the server fails to recognize the request as coming from a browser, it may be blocked.

Press + to interact
import requests
# we provide headers as dictionary with key:value pairs
headers = {'User-Agent': 'python-requests'}
r = requests.get('https://www.shellhacks.com/', headers=headers)
print(r.text)

We encounter a 403 Forbidden error because the value we send as the user-agent does not identify our request as coming from a browser. However, by changing the value to one commonly associated with a browser, we can bypass this restriction.

Press + to interact
import requests
# This time we change it to a similar browser value
headers = {'User-Agent': "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36"}
r = requests.get('https://www.shellhacks.com/', headers=headers)
print(r.status_code)
Note: Other attributes are also important. We probably sometimes will need to search for which one to send to bypass the blocks.

Try it yourself

Run the code below and check the result. It doesn't recognize the request. To make it work, use the correct headers and send them with the request.

Press + to interact
import requests
# We provide headers as dictionary with key:value pairs
headers = {}
r = requests.get('https://www.scrapethissite.com/pages/advanced/?gotcha=headers', headers=headers)
if not headers:
print(r.text)
status = r.status_code
print(status)

HTTP-Cookies

It contains data about the user's browsing history, which the server sends to the browser. It can then be stored and returned with subsequent requests to the same server. The values in the cookies can be used to identify the sender’s identity.

Note: Check out this blog for additional knowledge about HTTP cookies.

We need to carefully set the cookie’s values to match those of the browser and make our script bypass any cookie restrictions.

Try it yourself

After logging into Quotes to Scrape, inspect the network tool and copy the value of the session cookie. Then, paste it into the code widget below. We should see the logout label as if we had previously logged in.

Press + to interact
import requests
cookies = {'session':''}
r = requests.get('http://quotes.toscrape.com', cookies=cookies)
print(r.text[520:2000])
#prints all the old URLs used then the current one after passing the login wall
print([obj.url for obj in r.history]+[r.url])

Post request

We can use post requests to send any data to the server along with the HTTP request.

Press + to interact
import requests
# We provide data as dictionary with keys:values pairs
# The keys must be the same as the ones the server uses to be recognized.
# we get these keys names after inspecting the URL.
data = {
'username': 'test',
'password': 'test'
}
r = requests.post('http://quotes.toscrape.com/login', data=data)
print(r.status_code)
#prints the history of URLs the user visted up to this point
print([obj.url for obj in r.history]+[r.url])
Note: The <input> tag specifies an input field where the user can enter data and the name attribute is used as a reference when data is submitted.

Conclusion

We have the basics of the requests library and its ability to send HTTP requests to websites and retrieve the DOM response. While the library has additional features that can be explored in its documentation, these are the essential components for our web scraping journey.

Get hands-on with 1300+ tech skills courses.