Scraping Yahoo Finance with Selenium
Learn how to scrape financial data from Yahoo using Selenium.
We'll cover the following
Having acquired knowledge about Selenium, let's utilize this understanding to extract financial news data from Yahoo Finance. Yahoo is notorious for incorporating JavaScript on its website, rendering traditional scraping techniques ineffective.
To begin, we'll focus on extracting the first rendered news of stock market news:
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=options) driver.get("https://finance.yahoo.com/topic/stock-market-news/") try: news = WebDriverWait(driver, 10).until( EC.presence_of_all_elements_located((By.CSS_SELECTOR, 'div[class="Py(14px) Pos(r)"]'))) except TimeoutException: raise TimeoutException("Elements are not loaded") print("len of news: ", len(news)) data = [] for n in news: title = n.find_element(By.CSS_SELECTOR, "div h3").text link = n.find_element(By.CSS_SELECTOR, "div h3 a").get_attribute("href") d = {'title': title, 'link': link} data.append(d) print("len of scraped data: ", len(data)) print("sample: ", data[0]) # We are using this only for demonstration purpose. time.sleep(2) driver.close()
Note: In the provided code, there is a hidden section that handles the imports and options for the driver, as covered earlier. For the purpose of this lesson, we will solely focus on the scraping part.
Lines 1–2: We initialize the web driver and make a
GET
request to the subreddit URL.Lines 5–9: Using the CSS path selector
div[class="Py(14px) Pos(r)"]
, we use thepresence_of_all_elements_located
function to gather all the news elements.Lines 15–25: Next, we iterate through the news. Each article is represented as a Selenium element, enabling us to use the
find_element()
method similarly to the driver instance. Here's how we extract specific information:To retrieve the title, we locate the
div h3
element within thedev
element.Lastly, we obtain the link from the
<a>
tag inside the title element. We then append this data as a dictionary to a list.
It's worth noting that in the provided code, we have opted to rely on short CSS paths rather than explicit ones. This decision is based on the fact that class names can be randomly generated, which may lead to potential script breakage. By targeting short paths without explicit names, unless the entire web architecture undergoes modifications, we ensure the stability of our script.
Scrolling
If our objective is to obtain more news, we can leverage the scrolling feature in the site design. This allows users to load additional news by scrolling down the page. Consequently, we will employ Selenium to automate the scrolling process and capture the information that becomes visible. Notably, the number of <li>
elements containing news increases as we scroll. However, to optimize the scraping process, we will limit it to 50 news.
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=options) driver.get("https://finance.yahoo.com/topic/stock-market-news/") news = [] while True: # Scroll down to bottom driver.execute_script("window.scrollTo(0, document.body.scrollHeight);") # Wait to load page time.sleep(2) news.extend(driver.find_elements(By.CSS_SELECTOR, 'div[class="Py(14px) Pos(r)"]')) # Calculate new scroll height and compare with last scroll height if len(news) > 100: break data = [] for n in news: title = n.find_element(By.CSS_SELECTOR, "div h3").text link = n.find_element(By.CSS_SELECTOR, "div h3 a").get_attribute("href") d = {'title': title, 'link': link} data.append(d) print("len of scraped data: ", len(data)) print("sample: ", data[-1]) time.sleep(1) # We are using this only for demonstration purpose. driver.close()
Note: The code for retrieving the data remains mostly the same as what we used previously, with the only change being the approach to obtaining the post elements.
Line 6: We enter an infinite loop, continuously scrolling down the page and waiting for the data to load. We break out of this loop when we have gathered the required number of news items or more.
Line 8:
.excute_script()
is another valuable feature provided by Selenium. It enables the driver to execute JavaScript code directly within the browser."window.scrollTo(0, document.body.scrollHeight);"
is the JS code that instructs the browser to scroll the current window based on the given coordinates (x-coord, y-coord). The x-coord is0
since we want to scroll vertically, and the y-coord is the current windowheight
which depends on the screen resolution; we retrieve this value using"document.body.scrollHeight"
.
Line 11: We find the elements using CSS selectors path
div[class="Py(14px) Pos(r)"
and add them to the data list.Line 15: We break if we reach the required number of elements.
Lines 19–26: We extract the needed information for each element.
In this scenario, we'll notice that we opted for time.sleep()
instead of explicit or implicit waits. This choice is not merely for demonstration purposes but rather a necessity. The execution speed of the browser scrolling process compared to the data loading speed is considerable, which could cause the script to miss gathering the data. Besides, explicit/implicit
waits are not useful here; there is no condition that we can use to instruct the browser to wait.
Conclusion
In this lesson, we have presented a practical example that demonstrates how Selenium can be utilized for web scraping on Yahoo Finance. At this stage, we should feel confident in our ability to scrape data from almost any website, whether it is static or dynamic.
Get hands-on with 1300+ tech skills courses.