Solution Review: Scrape PayPal FAQs Using Selenium and XPath
Explore the solution for scraping FAQs.
We'll cover the following
Solution approach
To scrape the questions, we will start by retrieving the elements of the topics from the sidebar.
Once we have obtained these elements, we will proceed to click each one and locate the paths of the questions to extract them.
A challenge that arises during web scraping with Selenium is handling invisible items. Unlike visible elements, dealing with hidden elements requires a different approach.
For instance, if we try to retrieve the text using the
.text
method, it will return empty values. Similarly, attempting to click such elements using the.click()
method will result in an exception. To overcome this issue, we will employ alternative techniques.Instead of using the
.text
method, we will utilize the.get_attribute("innerHTML")
method to extract the desired content from hidden elements. Similarly, instead of directly using the.click()
method, we will rely on thedriver.execute_script('arguments[0].click()', element)
approach to interact with these elements successfully.
from selenium.webdriver.common.by import By from selenium.webdriver.support import expected_conditions as EC from selenium.common.exceptions import TimeoutException, NoSuchElementException, ElementNotInteractableException from selenium.webdriver.support.wait import WebDriverWait def scrape(): """ This function starts the scraping process, Step 1: visit the main URL Step 2: click on each topic in the side table Step 3: scrape the questions that appears in the page """ driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=options) driver.get("https://www.paypal.com/us/cshelp/personal") data = [] #TODO #Wait for the topics on the side to load try: topics = WebDriverWait(driver, 10).until( EC.presence_of_all_elements_located((By.XPATH, "//ul/li/a[contains(@href, 'topic')]"))) except TimeoutException: raise TimeoutException("Elements are not loaded") i = 0 while i < len(topics): try: driver.execute_script('arguments[0].click()', topics[i]) except ElementNotInteractableException: i+=1 continue #TODO # Scrape the questions and append the dictionary to the data list try: questions = WebDriverWait(driver, 10).until( EC.presence_of_all_elements_located((By.XPATH, "//a[contains(@href, 'article')]"))) except TimeoutException: raise TimeoutException("Elements are not loaded") items = [{"question": x.get_attribute("innerHTML"), "url":x.get_attribute("href")} for x in questions] data.extend(items) i+=1 #TODO #Get the topics on the side again since it is deattached after the click try: topics = WebDriverWait(driver, 10).until( EC.presence_of_all_elements_located((By.XPATH, "//ul/li/a[contains(@href, 'topic')]"))) except TimeoutException: raise TimeoutException("Elements are not loaded") time.sleep(1) # This is for demonstration purpose only and can be removed. driver.close() return data output = scrape() print("len of scrapped items: ", len(output)) print("Output sample: ", output[0])
Code explanation
Line 19–21: We wait and load the topics' elements. We can accomplish this by using the XPath of the
<a>
tag associated with the topics. In this case, we utilize thecontains(@href, value)
function since the classes of the elements prone to change, and we need a fixed value to search for, which in this case the URL value'topic/article'
.
Note: It's important to note that if we had waited on the parent
<li>
tag instead, it would not work as expected. This is because we utilize thedriver.execute_script('arguments[0].click()', topics[i])
function, which requires the element to be clickable. Waiting on the parent<li>
tag would not pass the click behavior to any other elements, which is necessary for our purpose, unlike the.click
method.
Lines 34–38: We wait and load the questions items. We follow a straightforward path from the
<li>
tag, then the<a>
tag that holds the link information and the question text.Line 39: We extract the information from the Selenium element using the
.get_attribute()
function. To get the text, we get the"innerHTML"
and"href"
for the link.Lines 45–50: Lastly, we get the topics' elements again since clicking the element will cause de-attaching the previous elements from the current session.
Get hands-on with 1300+ tech skills courses.