Key Takeaways

The Yahoo Finance API, accessed through the yfinance Python library, will serve as the primary market data provider. It provides historical and near-real-time OHLCV data (open, high, low, close, volume) for stocks, ETFs, and cryptocurrencies. While this API is publicly available, it does not have official documentation on rate limits, requiring optimization strategies such as caching, automated retries, and alternative storage solutions to ensure reliable data retrieval.

A key challenge is long-term data storage. Since we will collect and store vast amounts of data over time, this section introduces alternative storage options that are scalable, free, and fast while ensuring seamless integration with GitHub Actions for automated updates.

Phase 1: Setting Up the Environment

Setting Up the Project with a Virtual Environment

To keep dependencies isolated and avoid global installations, it is best practice to create a Python virtual environment (venv) for the project. This ensures all installed packages remain contained within the project directory, preventing conflicts with other Python projects on the system.

Create the Project Directory

Start by setting up a dedicated project folder to organize scripts, data, and configurations.

For macOS/Linux:

mkdir market_chrono && cd market_chrono
mkdir data  # Create a directory for storing JSON data files
touch fetch_data.py  # Create the script file

For Windows (Command Prompt):

mkdir market_chrono && cd market_chrono
mkdir data  # Create a directory for storing JSON data files
type nul > fetch_data.py  # Create an empty script file

For Windows (PowerShell):

New-Item -ItemType Directory -Path market_chrono
Set-Location market_chrono
New-Item -ItemType Directory -Path data
New-Item -ItemType File -Path fetch_data.py

Set Up a Virtual Environment

Create and activate a virtual environment within the project directory:

For macOS/Linux:

python3 -m venv venv  # Create a virtual environment
source venv/bin/activate  # Activate the virtual environment

For Windows (Command Prompt):

python -m venv venv  # Create a virtual environment
venv\Scripts\activate  # Activate the virtual environment

For Windows (PowerShell):

python -m venv venv
.\venv\Scripts\activate

When the virtual environment is activated, the terminal prompt may change to include (venv), indicating that all installed packages will now be contained within the project folder.


Install Required Dependencies

With the virtual environment activated, install the following required packages:

pip install yfinance requests ipfshttpclient
  • yfinance: Fetches market data from Yahoo Finance.
  • requests: Handles HTTP requests (useful for interacting with alternative APIs if needed).
  • ipfshttpclient: Allows uploading and managing files on IPFS (InterPlanetary File System).

To confirm installation:

pip list

Fetch and Store Market Data

This section explains how to retrieve stock market data using the Yahoo Finance API and prepare it for storage. We are going to retrieve daily OHLCV (open, high, low, close, volume) data for tracked assets, then upload the JSON files to IPFS for decentralized storage.

What This Code Does

  1. Connects to Yahoo Finance to get stock/crypto price data.
  2. Gets the last trading day’s data (Open, High, Low, Close, Volume).
  3. Formats the data into a structured JSON object.
  4. Returns the data (or prints an error if something goes wrong).

Import Required Libraries

import yfinance as yf  # Yahoo Finance API wrapper
import json  # Handles saving data in JSON format
import ipfshttpclient  # Uploads data to IPFS (InterPlanetary File System)
from datetime import datetime  # Gets the current date
  • yfinance: Will be used to fetch market data from Yahoo Finance.
  • json: Converts the data into a structured format for storage.
  • ipfshttpclient: Uploads the JSON file to IPFS for decentralized storage.
  • datetime: Gets today’s date so we know when the data was fetched

Define the Function to Fetch Data

def fetch_yahoo_finance_data(ticker):
    """Fetches OHLCV data for a given stock ticker from Yahoo Finance."""
  • This function takes one inputticker (stock symbol like "AAPL" for Apple).
  • It will return the latest market data for that stock.

Connect to Yahoo Finance

    try:
        stock = yf.Ticker(ticker)
        hist = stock.history(period="1d")
  • yf.Ticker(ticker) → Creates a Yahoo Finance object for the given stock.
  • stock.history(period="1d") → Fetches the last available trading day’s data.

Example: If you run fetch_yahoo_finance_data("AAPL") on March 15, 2025, the function will get Apple’s stock prices from March 14, 2025.

Check if Data Exists

        if hist.empty:
            print(f"No data available for {ticker}.")
            return None
  • hist.empty checks if the data is missing.
  • If the data is empty, print a message and return None.

🔹 Why?
Sometimes, Yahoo Finance may not have data for a stock (e.g., if the market was closed).

Extract the Latest Market Data

        latest_data = hist.iloc[-1]
  • .iloc[-1] → Gets the last row of the table (which contains yesterday’s stock prices).

🔹 Example Output from Yahoo Finance (hist table):

Date Open High Low Close Volume
2025-03-14 175.2 178.9 174.5 177.6 89,000,000
  • The function extracts the last row (2025-03-14) and stores it in latest_data.

Format the Extracted Data for Storage

Once we have the latest trading day’s data, it needs to be structured in a dictionary format so it can be stored and used later.

		return {
		    "open": float(latest_data["Open"]),
		    "high": float(latest_data["High"]),
		    "low": float(latest_data["Low"]),
		    "close": float(latest_data["Close"]),
		    "volume": int(latest_data["Volume"])
		}
  • The extracted values are converted into appropriate data types:
    • float(latest_data["Open"]) → Ensures stock prices remain decimal values.
    • int(latest_data["Volume"]) → Volume is stored as an integer since it represents whole shares traded.
  • This structured dictionary makes it easier to store and retrieve data efficiently.

🔹 Example Output of the Dictionary Returned by the Function:

{
    "open": 175.2,
    "high": 178.9,
    "low": 174.5,
    "close": 177.6,
    "volume": 89000000
}

Why Don’t We Just Return latest_data?

At first glance, returning latest_data directly might seem like a good idea because it already holds the latest stock market data. However, there are several key reasons why we process and format the data instead of returning it as-is:

  1. Data Type & Structure Issues

    • latest_data is a Pandas Series object, which is not inherently JSON-serializable.
    • JSON files can only store lists, dictionaries, numbers, and strings, so returning latest_data directly would cause issues when trying to save it.
  2. Ensuring Data Consistency

    • Pandas Series stores values in NumPy-specific types (e.g., numpy.float64, numpy.int64), which are not directly compatible with JSON.
    • To prevent serialization errors, we convert them to Python-native types (float for prices and int for volume).
  3. Removing Unnecessary Metadata

    • latest_data contains extra information such as index labels and other internal Pandas attributes that we don’t need.
    • By returning only the key market metrics (open, high, low, close, volume), we streamline storage and retrieval.

What Type is latest_data?

latest_data is a Pandas Series object. A Series in Pandas is similar to a dictionary, where each value is labeled by an index.

To check the type of latest_data, we can add the following line to our function:

print(type(latest_data))

If we run the function with this, it will print:

<class 'pandas.core.series.Series'>

This confirms that latest_data is a Pandas Series, meaning it needs to be converted before being stored in a JSON file.

How to Inspect latest_data?

If we want to see what latest_data contains, we can print it:

print(latest_data)

🔹 Example Output of latest_data:

Open      175.2
High      178.9
Low       174.5
Close     177.6
Volume    89000000
Name: 2025-03-14 00:00:00, dtype: object
  • The index (Name: 2025-03-14 00:00:00) represents the date.
  • The columns (Open, High, Low, Close, Volume) contain the stock data.
  • The dtype: object at the end indicates that the data types might not be fully compatible with JSON.

Now to Handle Errors Gracefully

Stock data retrieval might fail due to several reasons, such as network issues, invalid tickers, or Yahoo Finance API downtimes. To prevent the script from crashing, we wrap the function in a try-except block:

except Exception as e:
    print(f"Error fetching data for {ticker}: {str(e)}")
    return None
  • If something goes wrong, the function:
    • Prints an error message to indicate which stock caused the failure.
    • Returns None, so the script can skip failed stocks instead of stopping entirely.

🔹 Example Error Output if the API Fails or the Ticker is Invalid:

Error fetching data for XYZ123: No data available for the given ticker.

Now that we have structured our data properly and included error handling, let’s move on to optimizing how we store and manage stock data efficiently.

Optimizing Data Storage for Scalability

At this point, we have successfully retrieved stock data, but how we store it determines long-term efficiency. Initially, our approach was to save each stock’s data in a separate file, leading to excessive storage use and slow data retrieval. Instead, we will consolidate all stock data for a given day into a single JSON file, making the system more scalable and organized. Let’s refine how we format and structure this data to ensure it remains efficient as we expand to tracking hundreds or thousands of stocks.

Why Our Original Data Format Was Inefficient and How to Fix It

Before we jump into optimizing the way we store stock market data, let’s first understand how our original method worked and why it wasn’t scalable. Initially, the approach was to store each stock’s data in a separate JSON file, meaning for every single stock, a new file was created each day.

The Old Format: One File per Stock Per Day

Imagine you’re tracking 500 stocks (which is common for serious traders). If we saved data one file per stock, we would have:

  • 500 separate files per day
  • 15,000+ files per month
  • 180,000+ files per year

Example: Files in the “data/” Folder (March 15, 2025)

data/
│── AAPL_2025-03-15.json
│── TSLA_2025-03-15.json
│── GOOGL_2025-03-15.json
│── MSFT_2025-03-15.json
│── AMZN_2025-03-15.json
│── NVDA_2025-03-15.json
│── META_2025-03-15.json
│── ...

Example JSON for Each Stock

Each stock had its own file with a repeated "date" field:

{
    "date": "2025-03-15",
    "ticker": "AAPL",
    "open": 175.2,
    "high": 178.9,
    "low": 174.5,
    "close": 177.6,
    "volume": 89000000
}
{
    "date": "2025-03-15",
    "ticker": "TSLA",
    "open": 860.5,
    "high": 875.0,
    "low": 855.2,
    "close": 868.3,
    "volume": 65000000
}

The Problem

📂 File Bloat: Too many files, making it hard to manage.
🚀 Slow Lookups: If we need data for multiple stocks, we have to open and parse hundreds of files.
🔄 Redundant Metadata: The "date" field is repeated in every file, making storage inefficient.

The Better Format: A Nested Dictionary (One File Per Day)

Instead of creating separate files for every stock, we store all stock data for a given day in a single JSON file. This makes data access faster, cleaner, and more organized.

Example: New File Structure

data/
│── market_data_2025-03-15.json
│── market_data_2025-03-16.json
│── market_data_2025-03-17.json
│── ...

Example JSON Format (All Stocks in One File)

{
    "date": "2025-03-15",
    "stocks": {
        "AAPL": {
            "open": 175.2,
            "high": 178.9,
            "low": 174.5,
            "close": 177.6,
            "volume": 89000000
        },
        "TSLA": {
            "open": 860.5,
            "high": 875.0,
            "low": 855.2,
            "close": 868.3,
            "volume": 65000000
        },
        "GOOGL": {
            "open": 2805.0,
            "high": 2850.4,
            "low": 2788.2,
            "close": 2832.1,
            "volume": 1800000
        }
    }
}

Why This Format is Better

Only One JSON File Per Day → No more hundreds of files cluttering storage.
Fast Lookups → We can load all stock data at once, reducing file read time.
Easier to Expand → We can add more data fields like "market_cap" or "sector" later.

Updating the Code to Use This New Format

Now that we know why this method is better, let’s update our code to reflect this structure.

Modify the Data Fetching Function

Instead of saving individual stock data, we return a dictionary without the "date" field (since it’s already in the main structure).

import yfinance as yf
import json
from datetime import datetime

def fetch_yahoo_finance_data(ticker):
    """Fetches OHLCV data for a given stock ticker from Yahoo Finance."""
    try:
        stock = yf.Ticker(ticker)
        hist = stock.history(period="1d")

        if hist.empty:
            print(f"No data available for {ticker}.")
            return None

        latest_data = hist.iloc[-1]

        return {
            "open": float(latest_data["Open"]),
            "high": float(latest_data["High"]),
            "low": float(latest_data["Low"]),
            "close": float(latest_data["Close"]),
            "volume": int(latest_data["Volume"])
        }

    except Exception as e:
        print(f"Error fetching data for {ticker}: {str(e)}")
        return None

Fetch Multiple Tickers and Organize Data

This function will fetch data for multiple stocks and store it in a single JSON file under the "stocks" key.

def fetch_multiple_tickers(tickers):
    """Fetches OHLCV data for multiple tickers and stores it in a nested JSON format."""
    market_data = {"date": datetime.now().strftime("%Y-%m-%d"), "stocks": {}}

    for ticker in tickers:
        stock_data = fetch_yahoo_finance_data(ticker)
        if stock_data:
            market_data["stocks"][ticker] = stock_data

    return market_data

Understanding market_data["stocks"][ticker] = stock_data

The assignment statement inside the loop:

market_data["stocks"][ticker] = stock_data

is storing data in a nested dictionary.

Breaking Down market_data

At the beginning of the function, market_data is initialized as:

market_data = {"date": datetime.now().strftime("%Y-%m-%d"), "stocks": {}}

This creates a dictionary with:

  • A "date" key → storing today’s date.
  • A "stocks" key → initialized as an empty dictionary {}, where we will store stock data.

Example after initialization:

{
    "date": "2025-03-15",
    "stocks": {}
}

Understanding market_data["stocks"]

  • market_data["stocks"] refers to the nested dictionary inside market_data that will store stock information.
  • Initially, it is empty:
"stocks": {}

What Happens Inside the Loop?

for ticker in tickers:
    stock_data = fetch_yahoo_finance_data(ticker)
    if stock_data:
        market_data["stocks"][ticker] = stock_data
  1. The loop goes through each stock symbol (ticker) in tickers (e.g., "AAPL", "TSLA", "GOOGL").
  2. It fetches data for that ticker using fetch_yahoo_finance_data(ticker).
  3. If valid stock_data is returned, we store it inside market_data["stocks"] under its ticker.

Example:

If we run:

tickers = ["AAPL", "TSLA"]
market_data = fetch_multiple_tickers(tickers)

Here’s what happens: First loop iteration (ticker = "AAPL"):

stock_data = { "open": 175.2, "high": 178.9, "low": 174.5, "close": 177.6, "volume": 89000000 }
  • This gets stored as:
market_data["stocks"]["AAPL"] = stock_data

Second loop iteration (ticker = "TSLA"):

  • stock_data = { "open": 860.5, "high": 875.0, "low": 855.2, "close": 868.3, "volume": 65000000 }

  • This gets stored as:

market_data["stocks"]["TSLA"] = stock_data

Now, market_data looks like this:

{
    "date": "2025-03-15",
    "stocks": {
        "AAPL": {
            "open": 175.2,
            "high": 178.9,
            "low": 174.5,
            "close": 177.6,
            "volume": 89000000
        },
        "TSLA": {
            "open": 860.5,
            "high": 875.0,
            "low": 855.2,
            "close": 868.3,
            "volume": 65000000
        }
    }
}

So What Does [ticker] Do?

When we write:

market_data["stocks"][ticker] = stock_data
  • ticker is the stock symbol (e.g., "AAPL", "TSLA").
  • market_data["stocks"] is a dictionary of stock data.
  • market_data["stocks"][ticker] creates a new key-value pair inside stocks, using the ticker as the key.

Effectively, this dynamically assigns stock data to its corresponding ticker symbol in the dictionary.

Think of market_data["stocks"] as a folder (dictionary) that holds multiple stock records.
Each ticker (like "AAPL", "TSLA") is a separate file (another dictionary inside).

Instead of creating multiple files for each stock, we organize everything under one structure, making it easier to store and retrieve.

Save Data in a Single JSON File

Instead of saving separate files, we save one file per day.

def save_to_json(filename, data):
    """Saves the nested stock data to a JSON file."""
    with open(filename, "w") as json_file:
        json.dump(data, json_file, indent=4)
    print(f"Market data saved to {filename}")

Run the Code

Finally, we fetch data for multiple stocks and store it in one JSON file per day.

if __name__ == "__main__":
    tickers = ["AAPL", "TSLA", "GOOGL", "AMZN", "MSFT"]
    market_data = fetch_multiple_tickers(tickers)

    filename = f"data/market_data_{datetime.now().strftime('%Y-%m-%d')}.json"
    save_to_json(filename, market_data)

Full code:

import yfinance as yf			# Yahoo Finance API Wrapper
import json						# Handles saving data as a JSON
import ipfshttpclient			# Uploads data to IPFS (InterPlanetary File System)
from datetime import datetime	# Gets the current date

class Main:

	def __init__(self):
		ticker_list = ["AAPL", "TSLA", "GOOGL", "AMZN", "MSFT"]

		market_data = self.fetch_multiple_yahoo_finance_tickers(ticker_list)

		filename = f"data/market_data_{datetime.now().strftime('%Y-%m-%d')}.json"
		self.save_to_json(filename, market_data)

	def fetch_yahoo_finance_ticker_data(self, ticker):
		"""Fetches OHLCV for a given stock ticker from Yahoo Finance"""

		try:
			stock = yf.Ticker(ticker)
			hist = stock.history(period="1d")

			# check if data exists
			if hist.empty:
				print(f"{ticker} has no history available")
				return None 
			else:
				latest_data = hist.iloc[-1]

			return {
				"open": float(latest_data["Open"]),
				"high": float(latest_data["High"]),
				"low":  float(latest_data["Low"]),
				"close": float(latest_data["Close"]),
				"volume": int(latest_data["Volume"])
			}

		except Exception as e: 
			print(f"Error fetching data for {ticker}: str{e}")
			return None

	def fetch_multiple_yahoo_finance_tickers(self, ticker_list):
		"""Fetches OHLCV for a passed list of tickers from Yahoo Finance"""

		# Makes the dictionary that will store the pulled market data
		market_data = {"data": datetime.now().strftime("%Y-%m-%d"), "stocks": {}}

		for ticker in ticker_list:
			try:
				stock_data = self.fetch_yahoo_finance_ticker_data(ticker)
				if stock_data:
					market_data["stocks"][ticker] = stock_data

			except Exception as e:
				print(f"Stock data was not found for {ticker}")

		return market_data

	def save_to_json(self, filename, data):
		"""Saves the market data to a JSON file"""
		with open(filename, "w") as json_file:
			json.dump(data, json_file, indent=4)
		print(f"Market data saved to {filename}")


if __name__ == "__main__":
	m = Main()

This function:

  • Fetches the last available trading day’s OHLCV data.
  • Formats it as a JSON object.
  • Returns the processed data, or logs an error if retrieval fails.

Dynamically Handling Ticker Management

How Can We Retrieve All U.S. Ticker Symbols?

Instead of relying on APIs with potential rate limits or scraping websites, NASDAQ provides a public FTP server where it hosts daily-updated stock listing files. This is a free, reliable, and legal way to get all currently active ticker symbols without any API keys or costs.

Understanding NASDAQ’s FTP Data

NASDAQ maintains an FTP server that publicly provides ticker lists. The primary files of interest are:

Exchange File Name
NASDAQ nasdaqlisted.txt
NYSE & AMEX otherlisted.txt

These files contain a list of active ticker symbols along with company names and other metadata.

Public NASDAQ FTP Server URL:

ftp://ftp.nasdaqtrader.com/SymbolDirectory/

Download & Process the NASDAQ Ticker List

To fetch the ticker list, we will download and parse these files using Python’s requests or urllib.

Install Required Libraries

If you haven’t installed pandas, install it first:

pip install pandas

Fetch and Process NASDAQ Data

import pandas as pd

NASDAQ_TICKER_URL = "ftp://ftp.nasdaqtrader.com/SymbolDirectory/nasdaqlisted.txt"
OTHER_TICKER_URL = "ftp://ftp.nasdaqtrader.com/SymbolDirectory/otherlisted.txt"

def fetch_nasdaq_tickers():
    """Fetches all active U.S. stock tickers from NASDAQ's public FTP server."""
    try:
        # Load NASDAQ tickers
        nasdaq_df = pd.read_csv(NASDAQ_TICKER_URL, sep="|")
        nasdaq_tickers = nasdaq_df["Symbol"].tolist()
        
        # Load NYSE & AMEX tickers
        other_df = pd.read_csv(OTHER_TICKER_URL, sep="|")
        other_tickers = other_df["ACT Symbol"].tolist()

        # Combine all tickers
        all_tickers = nasdaq_tickers + other_tickers
        print(f"Successfully retrieved {len(all_tickers)} tickers.")
        return all_tickers

    except Exception as e:
        print(f"Error fetching tickers: {str(e)}")
        return None

# Example Usage
if __name__ == "__main__":
    tickers = fetch_nasdaq_tickers()
    if tickers:
        print(tickers[:10])  # Display the first 10 tickers

How This Works

  1. We download NASDAQ and NYSE/AMEX ticker files from NASDAQ’s public FTP server.
  2. Files are stored in a simple text format, separated by | (pipe symbol).
  3. Pandas reads the files and extracts only the ticker symbols from each dataset.
  4. We combine NASDAQ, NYSE, and AMEX tickers into one clean list.
  5. The function returns a full list of stock symbols, ensuring up-to-date coverage.

Expected Output

When you run the script, it should print:

Successfully retrieved 7200 tickers.
['AAPL', 'MSFT', 'GOOGL', 'AMZN', 'TSLA', 'NVDA', 'META', 'NFLX', 'AMD', 'PYPL']

You’ll now have every active stock ticker from NASDAQ, NYSE, and AMEX in one list.