$ cat /projects/yahoo-finance-web-scraper.md

Yahoo Finance Web Scraper

2025-12-01

Web Scraper designed to scrape stock symbol data from yahoo finance.

This project is a stock market data scraper built around Yahoo Finance. It extracts structured financial and trading information for a large set of stock symbols, including core metrics such as price movement, volume, market capitalization, profitability ratios, balance-sheet values, and related news headlines.

Yahoo Finance was selected as the data source because it is widely used in real-world finance workflows and its target endpoint is accessible under the site’s robots.txt policy. Since Yahoo Finance pages rely heavily on JavaScript-rendered content, Selenium was chosen as the most reliable approach compared to static scraping tools like BeautifulSoup or Scrapy.

The scraper runs using GeckoDriver for Firefox-based automation, aligning with LibreWolf as the primary browser environment. In addition to numerical market data, headline collection was included to support future work in sentiment analysis and predictive modeling.

Yahoo Finance provides data for nearly 360 active stock symbols, with each listing page displaying a maximum of 220 symbols. To handle pagination, the scraper uses two endpoints:

https://finance.yahoo.com/markets/stocks/most-active/?start=0&count=220

https://finance.yahoo.com/markets/stocks/most-active/?start=220&count=220

Once scraping the first page is complete, Selenium automatically continues with the second URL.

Some fields, such as price, daily change, and market cap, are available directly on the main listing page. However, additional details like revenue, total cash, and news headlines require visiting each symbol’s dedicated detail page. To avoid stale element issues, the scraper opens these detail pages in new browser tabs while preserving previously collected elements.

Page elements are located using CSS selectors obtained through browser developer tools. The project also includes utility functions for cleaning and normalizing scraped values, such as removing unwanted characters (+, %, ,), converting abbreviated units (T, M, B) into full numeric values, and handling missing entries.

Because Yahoo Finance headlines sometimes fail to load on the first attempt, a retry loop was implemented to refresh the page and ensure reliable headline extraction.

All collected data is appended incrementally into a Pandas DataFrame and saved to a CSV file during each iteration. This prevents data loss and allows progress to be preserved even if an exception occurs mid-scrape.

Yahoo Finance Web Scraper

Created with <3 by Khashayar Khosrosourmi.