Scraping Yahoo Finance with Selenium

Learn how to scrape financial data from Yahoo using Selenium.

We'll cover the following

Having acquired knowledge about Selenium, let's utilize this understanding to extract financial news data from Yahoo Finance. Yahoo is notorious for incorporating JavaScript on its website, rendering traditional scraping techniques ineffective.

To begin, we'll focus on extracting the first rendered news of stock market news:

Press + to interact
Investigating financial news using Inspection tool
Investigating financial news using Inspection tool
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=options)
driver.get("https://finance.yahoo.com/topic/stock-market-news/")


try:
    news = WebDriverWait(driver, 10).until(
        EC.presence_of_all_elements_located((By.CSS_SELECTOR, 'div[class="Py(14px) Pos(r)"]')))
except TimeoutException:
    raise TimeoutException("Elements are not loaded")

print("len of news: ", len(news))

data = []

for n in news:
    title = n.find_element(By.CSS_SELECTOR,
    "div h3").text
    link = n.find_element(By.CSS_SELECTOR,
    "div h3 a").get_attribute("href")

    d = {'title': title, 'link': link}
    data.append(d)

print("len of scraped data: ", len(data))
print("sample: ", data[0])

# We are using this only for demonstration purpose.
time.sleep(2)

driver.close()
Scraping first rendered posts from Yahoo stock market news

Note: In the provided code, there is a hidden section that handles the imports and options for the driver, as covered earlier. For the purpose of this lesson, we will solely focus on the scraping part.

  • Lines 1–2: We initialize the web driver and make a GET request to the subreddit URL.

  • Lines 5–9: Using the CSS path selector div[class="Py(14px) Pos(r)"], we use the presence_of_all_elements_located function to gather all the news elements.

  • Lines 15–25: Next, we iterate through the news. Each article is represented as a Selenium element, enabling us to use the find_element() method similarly to the driver instance. Here's how we extract specific information:

    • To retrieve the title, we locate the div h3 element within the dev element.

    • Lastly, we obtain the link from the <a> tag inside the title element. We then append this data as a dictionary to a list.

It's worth noting that in the provided code, we have opted to rely on short CSS paths rather than explicit ones. This decision is based on the fact that class names can be randomly generated, which may lead to potential script breakage. By targeting short paths without explicit names, unless the entire web architecture undergoes modifications, we ensure the stability of our script.

Scrolling

If our objective is to obtain more news, we can leverage the scrolling feature in the site design. This allows users to load additional news by scrolling down the page. Consequently, we will employ Selenium to automate the scrolling process and capture the information that becomes visible. Notably, the number of <li> elements containing news increases as we scroll. However, to optimize the scraping process, we will limit it to 50 news.

Press + to interact
Changes in sample DOM after scrolling
Changes in sample DOM after scrolling
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=options)
driver.get("https://finance.yahoo.com/topic/stock-market-news/")

news = []

while True:
    # Scroll down to bottom
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    # Wait to load page
    time.sleep(2)
    news.extend(driver.find_elements(By.CSS_SELECTOR,
            'div[class="Py(14px) Pos(r)"]'))

    # Calculate new scroll height and compare with last scroll height
    if len(news) > 100:
        break
data = []

for n in news:
    title = n.find_element(By.CSS_SELECTOR,
    "div h3").text
    link = n.find_element(By.CSS_SELECTOR,
    "div h3 a").get_attribute("href")

    d = {'title': title, 'link': link}
    data.append(d)

print("len of scraped data: ", len(data))
print("sample: ", data[-1])

time.sleep(1) # We are using this only for demonstration purpose.
driver.close()
Scraping first 100 news from Yahoo stock market news

Note: The code for retrieving the data remains mostly the same as what we used previously, with the only change being the approach to obtaining the post elements.

  • Line 6: We enter an infinite loop, continuously scrolling down the page and waiting for the data to load. We break out of this loop when we have gathered the required number of news items or more.

  • Line 8:

    • .excute_script() is another valuable feature provided by Selenium. It enables the driver to execute JavaScript code directly within the browser.

    • "window.scrollTo(0, document.body.scrollHeight);" is the JS code that instructs the browser to scroll the current window based on the given coordinates (x-coord, y-coord). The x-coord is 0 since we want to scroll vertically, and the y-coord is the current window height which depends on the screen resolution; we retrieve this value using "document.body.scrollHeight".

  • Line 11: We find the elements using CSS selectors path div[class="Py(14px) Pos(r)" and add them to the data list.

  • Line 15: We break if we reach the required number of elements.

  • Lines 19–26: We extract the needed information for each element.

In this scenario, we'll notice that we opted for time.sleep() instead of explicit or implicit waits. This choice is not merely for demonstration purposes but rather a necessity. The execution speed of the browser scrolling process compared to the data loading speed is considerable, which could cause the script to miss gathering the data. Besides, explicit/implicit waits are not useful here; there is no condition that we can use to instruct the browser to wait.

Conclusion

In this lesson, we have presented a practical example that demonstrates how Selenium can be utilized for web scraping on Yahoo Finance. At this stage, we should feel confident in our ability to scrape data from almost any website, whether it is static or dynamic.

Get hands-on with 1300+ tech skills courses.