Solution Review: Scrape PayPal FAQs Using Selenium and XPath

Explore the solution for scraping FAQs.

We'll cover the following

Solution approach

  • To scrape the questions, we will start by retrieving the elements of the topics from the sidebar.

  • Once we have obtained these elements, we will proceed to click each one and locate the paths of the questions to extract them.

  • A challenge that arises during web scraping with Selenium is handling invisible items. Unlike visible elements, dealing with hidden elements requires a different approach.

    • For instance, if we try to retrieve the text using the .text method, it will return empty values. Similarly, attempting to click such elements using the .click() method will result in an exception. To overcome this issue, we will employ alternative techniques.

    • Instead of using the .text method, we will utilize the .get_attribute("innerHTML") method to extract the desired content from hidden elements. Similarly, instead of directly using the .click() method, we will rely on the driver.execute_script('arguments[0].click()', element) approach to interact with these elements successfully.

Press + to interact
Inspecting the DOM structure of the PayPal FAQs
Inspecting the DOM structure of the PayPal FAQs
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException, NoSuchElementException, ElementNotInteractableException
from selenium.webdriver.support.wait import WebDriverWait

def scrape():
    """
    This function starts the scraping process,
    Step 1: visit the main URL
    Step 2: click on each topic in the side table
    Step 3: scrape the questions that appears in the page
    """
    driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=options)
    driver.get("https://www.paypal.com/us/cshelp/personal")
    data = []
    #TODO
    #Wait for the topics on the side to load
    try:
        topics =  WebDriverWait(driver, 10).until(
          EC.presence_of_all_elements_located((By.XPATH,
          "//ul/li/a[contains(@href, 'topic')]")))
    except TimeoutException:
        raise TimeoutException("Elements are not loaded")
    i = 0
    while i < len(topics):
        try:
            driver.execute_script('arguments[0].click()', topics[i])
        except ElementNotInteractableException:
            i+=1
            continue
        #TODO
        # Scrape the questions and append the dictionary to the data list
        try:
            questions = WebDriverWait(driver, 10).until(
                EC.presence_of_all_elements_located((By.XPATH,
                "//a[contains(@href, 'article')]")))
        except TimeoutException:
            raise TimeoutException("Elements are not loaded")
        items = [{"question": x.get_attribute("innerHTML"), "url":x.get_attribute("href")} for x in questions]
        data.extend(items)
        i+=1

        #TODO
        #Get the topics on the side again since it is deattached after the click
        try:
            topics =  WebDriverWait(driver, 10).until(
                EC.presence_of_all_elements_located((By.XPATH,
                "//ul/li/a[contains(@href, 'topic')]")))
        except TimeoutException:
            raise TimeoutException("Elements are not loaded")
        
        time.sleep(1) # This is for demonstration purpose only and can be removed.

    driver.close()
    return data

output = scrape()
print("len of scrapped items: ", len(output))
print("Output sample: ", output[0])
Code solution

Code explanation

  • Line 19–21: We wait and load the topics' elements. We can accomplish this by using the XPath of the <a> tag associated with the topics. In this case, we utilize the contains(@href, value) function since the classes of the elements prone to change, and we need a fixed value to search for, which in this case the URL value 'topic/article'.

Note: It's important to note that if we had waited on the parent <li> tag instead, it would not work as expected. This is because we utilize the driver.execute_script('arguments[0].click()', topics[i]) function, which requires the element to be clickable. Waiting on the parent <li> tag would not pass the click behavior to any other elements, which is necessary for our purpose, unlike the .click method.

  • Lines 34–38: We wait and load the questions items. We follow a straightforward path from the <li> tag, then the <a> tag that holds the link information and the question text.

  • Line 39: We extract the information from the Selenium element using the .get_attribute() function. To get the text, we get the "innerHTML" and "href" for the link.

  • Lines 45–50: Lastly, we get the topics' elements again since clicking the element will cause de-attaching the previous elements from the current session.

Get hands-on with 1300+ tech skills courses.