Introduction to Requests
Discover the Requests library and header spoofing.
We have covered how a browser communicates with a website server by sending an HTTP request and receiving an HTML response that includes the Document Object Model (DOM) structure. Now, we plan to implement the same procedure in our script to ensure that it accurately emulates the actions of a browser. Our primary objective is to replicate the behavior of a browser to accomplish our desired outcomes.
The requests
library
It is a Python library that enables us to send HTTP requests to website servers and quickly receive the response objects.
import requestsr = requests.get('https://books.toscrape.com/')print("Request URL: ", r.url)print("Request status code: ", r.status_code)print("Response headers: ", r.headers)# Prints the text chuck that holds the <title> tag in the HTML DOM returned.print("Page's title: ", r.text[360:425])
The above code sends an HTTP request to the Books to Scrape website and retrieves the response object.
The response object has several attributes, such as:
object.URL
: The address of the site being requested.object.status_code
: The status of the request and the server's response.object.history
: A list of the response object’s history after redirection.object.header
: Information about the server response that does not relate to the content, such as the date.object.text
: The content of the response as a string.object.content
: The content of the response in bytes.
Headers spoofing
Spoofing refers to sending false headers with values that match those of a typical browser. We can include additional information when requesting a website to help the server understand the request and customize its response. However, websites often use this header information to block requests that don't match what a typical browser would send.
Let’s explore the ShellHacks website using the network tool to see these headers.
There are a bunch of attributes here; we don't need to learn about all of them. Some critical attributes we will cover are "Cookie" and "User-Agent."
User-Agent
It is an identifier that informs the server about the entity making the request. If the server fails to recognize the request as coming from a browser, it may be blocked.
import requests# we provide headers as dictionary with key:value pairsheaders = {'User-Agent': 'python-requests'}r = requests.get('https://www.shellhacks.com/', headers=headers)print(r.text)
We encounter a 403 Forbidden
error because the value we send as the user-agent does not identify our request as coming from a browser. However, by changing the value to one commonly associated with a browser, we can bypass this restriction.
import requests# This time we change it to a similar browser valueheaders = {'User-Agent': "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36"}r = requests.get('https://www.shellhacks.com/', headers=headers)print(r.status_code)
Note: Other attributes are also important. We probably sometimes will need to search for which one to send to bypass the blocks.
Try it yourself
Run the code below and check the result. It doesn't recognize the request. To make it work, use the correct headers and send them with the request.
import requests# We provide headers as dictionary with key:value pairsheaders = {}r = requests.get('https://www.scrapethissite.com/pages/advanced/?gotcha=headers', headers=headers)if not headers:print(r.text)status = r.status_codeprint(status)
HTTP-Cookies
It contains data about the user's browsing history, which the server sends to the browser. It can then be stored and returned with subsequent requests to the same server. The values in the cookies can be used to identify the sender’s identity.
Note: Check out this blog for additional knowledge about HTTP cookies.
We need to carefully set the cookie’s values to match those of the browser and make our script bypass any cookie restrictions.
Try it yourself
After logging into Quotes to Scrape, inspect the network tool and copy the value of the session cookie. Then, paste it into the code widget below. We should see the logout label as if we had previously logged in.
import requestscookies = {'session':''}r = requests.get('http://quotes.toscrape.com', cookies=cookies)print(r.text[520:2000])#prints all the old URLs used then the current one after passing the login wallprint([obj.url for obj in r.history]+[r.url])
Post request
We can use post
requests to send any data to the server along with the HTTP request.
import requests# We provide data as dictionary with keys:values pairs# The keys must be the same as the ones the server uses to be recognized.# we get these keys names after inspecting the URL.data = {'username': 'test','password': 'test'}r = requests.post('http://quotes.toscrape.com/login', data=data)print(r.status_code)#prints the history of URLs the user visted up to this pointprint([obj.url for obj in r.history]+[r.url])
Note: The<input>
tag specifies an input field where the user can enter data and thename
attribute is used as a reference when data is submitted.
Conclusion
We have the basics of the requests library and its ability to send HTTP requests to websites and retrieve the DOM response. While the library has additional features that can be explored in its documentation, these are the essential components for our web scraping journey.
Get hands-on with 1300+ tech skills courses.