Key Takeaways

The Yahoo Finance API, accessed through the yfinance Python library, will serve as the primary market data provider. It provides historical and near-real-time OHLCV data (open, high, low, close, volume) for stocks, ETFs, and cryptocurrencies. While this API is publicly available, it does not have official documentation on rate limits, requiring optimization strategies such as caching, automated retries, and alternative storage solutions to ensure reliable data retrieval.

A key challenge is long-term data storage. Since we will collect and store vast amounts of data over time, this section introduces alternative storage options that are scalable, free, and fast while ensuring seamless integration with GitHub Actions for automated updates.

Phase 1: Setting Up the Environment

Setting Up the Project with a Virtual Environment

To keep dependencies isolated and avoid global installations, it is best practice to create a Python virtual environment (venv) for the project. This ensures all installed packages remain contained within the project directory, preventing conflicts with other Python projects on the system.

Create the Project Directory

Start by setting up a dedicated project folder to organize scripts, data, and configurations.

For macOS/Linux:

mkdir market_chrono && cd market_chrono
mkdir data  # Create a directory for storing JSON data files
touch fetch_data.py  # Create the script file

For Windows (Command Prompt):

mkdir market_chrono && cd market_chrono
mkdir data  # Create a directory for storing JSON data files
type nul > fetch_data.py  # Create an empty script file

For Windows (PowerShell):

New-Item -ItemType Directory -Path market_chrono
Set-Location market_chrono
New-Item -ItemType Directory -Path data
New-Item -ItemType File -Path fetch_data.py

Set Up a Virtual Environment

Create and activate a virtual environment within the project directory:

For macOS/Linux:

python3 -m venv venv  # Create a virtual environment
source venv/bin/activate  # Activate the virtual environment

For Windows (Command Prompt):

python -m venv venv  # Create a virtual environment
venv\Scripts\activate  # Activate the virtual environment

For Windows (PowerShell):

python -m venv venv
.\venv\Scripts\activate

When the virtual environment is activated, the terminal prompt may change to include (venv), indicating that all installed packages will now be contained within the project folder.

Install Required Dependencies

With the virtual environment activated, install the following required packages:

pip install yfinance requests ipfshttpclient

yfinance: Fetches market data from Yahoo Finance.
requests: Handles HTTP requests (useful for interacting with alternative APIs if needed).
ipfshttpclient: Allows uploading and managing files on IPFS (InterPlanetary File System).

To confirm installation:

pip list

Fetch and Store Market Data

This section explains how to retrieve stock market data using the Yahoo Finance API and prepare it for storage. We are going to retrieve daily OHLCV (open, high, low, close, volume) data for tracked assets, then upload the JSON files to IPFS for decentralized storage.

What This Code Does

Connects to Yahoo Finance to get stock/crypto price data.
Gets the last trading day’s data (Open, High, Low, Close, Volume).
Formats the data into a structured JSON object.
Returns the data (or prints an error if something goes wrong).

Import Required Libraries

import yfinance as yf  # Yahoo Finance API wrapper
import json  # Handles saving data in JSON format
import ipfshttpclient  # Uploads data to IPFS (InterPlanetary File System)
from datetime import datetime  # Gets the current date

yfinance: Will be used to fetch market data from Yahoo Finance.
json: Converts the data into a structured format for storage.
ipfshttpclient: Uploads the JSON file to IPFS for decentralized storage.
datetime: Gets today’s date so we know when the data was fetched

Define the Function to Fetch Data

def fetch_yahoo_finance_data(ticker):
    """Fetches OHLCV data for a given stock ticker from Yahoo Finance."""

This function takes one input → ticker (stock symbol like "AAPL" for Apple).
It will return the latest market data for that stock.

Connect to Yahoo Finance

    try:
        stock = yf.Ticker(ticker)
        hist = stock.history(period="1d")

yf.Ticker(ticker) → Creates a Yahoo Finance object for the given stock.
stock.history(period="1d") → Fetches the last available trading day’s data.

❗ Example: If you run fetch_yahoo_finance_data("AAPL") on March 15, 2025, the function will get Apple’s stock prices from March 14, 2025.

Check if Data Exists

        if hist.empty:
            print(f"No data available for {ticker}.")
            return None

hist.empty checks if the data is missing.
If the data is empty, print a message and return None.

🔹 Why?
Sometimes, Yahoo Finance may not have data for a stock (e.g., if the market was closed).

Extract the Latest Market Data

        latest_data = hist.iloc[-1]

.iloc[-1] → Gets the last row of the table (which contains yesterday’s stock prices).

🔹 Example Output from Yahoo Finance (hist table):

Date	Open	High	Low	Close	Volume
2025-03-14	175.2	178.9	174.5	177.6	89,000,000

The function extracts the last row (2025-03-14) and stores it in latest_data.

Format the Extracted Data for Storage

Once we have the latest trading day’s data, it needs to be structured in a dictionary format so it can be stored and used later.

		return {
		    "open": float(latest_data["Open"]),
		    "high": float(latest_data["High"]),
		    "low": float(latest_data["Low"]),
		    "close": float(latest_data["Close"]),
		    "volume": int(latest_data["Volume"])
		}

The extracted values are converted into appropriate data types:
- float(latest_data["Open"]) → Ensures stock prices remain decimal values.
- int(latest_data["Volume"]) → Volume is stored as an integer since it represents whole shares traded.
This structured dictionary makes it easier to store and retrieve data efficiently.

🔹 Example Output of the Dictionary Returned by the Function:

{
    "open": 175.2,
    "high": 178.9,
    "low": 174.5,
    "close": 177.6,
    "volume": 89000000
}

Why Don’t We Just Return `latest_data`?

At first glance, returning latest_data directly might seem like a good idea because it already holds the latest stock market data. However, there are several key reasons why we process and format the data instead of returning it as-is:

Data Type & Structure Issues
- latest_data is a Pandas Series object, which is not inherently JSON-serializable.
- JSON files can only store lists, dictionaries, numbers, and strings, so returning latest_data directly would cause issues when trying to save it.
Ensuring Data Consistency
- Pandas Series stores values in NumPy-specific types (e.g., numpy.float64, numpy.int64), which are not directly compatible with JSON.
- To prevent serialization errors, we convert them to Python-native types (float for prices and int for volume).
Removing Unnecessary Metadata
- latest_data contains extra information such as index labels and other internal Pandas attributes that we don’t need.
- By returning only the key market metrics (open, high, low, close, volume), we streamline storage and retrieval.

What Type is `latest_data`?

latest_data is a Pandas Series object. A Series in Pandas is similar to a dictionary, where each value is labeled by an index.

To check the type of latest_data, we can add the following line to our function:

print(type(latest_data))

If we run the function with this, it will print:

<class 'pandas.core.series.Series'>

This confirms that latest_data is a Pandas Series, meaning it needs to be converted before being stored in a JSON file.

How to Inspect `latest_data`?

If we want to see what latest_data contains, we can print it:

print(latest_data)

🔹 Example Output of latest_data:

Open      175.2
High      178.9
Low       174.5
Close     177.6
Volume    89000000
Name: 2025-03-14 00:00:00, dtype: object

The index (Name: 2025-03-14 00:00:00) represents the date.
The columns (Open, High, Low, Close, Volume) contain the stock data.
The dtype: object at the end indicates that the data types might not be fully compatible with JSON.

Now to Handle Errors Gracefully

Stock data retrieval might fail due to several reasons, such as network issues, invalid tickers, or Yahoo Finance API downtimes. To prevent the script from crashing, we wrap the function in a try-except block:

except Exception as e:
    print(f"Error fetching data for {ticker}: {str(e)}")
    return None

If something goes wrong, the function:
- Prints an error message to indicate which stock caused the failure.
- Returns None, so the script can skip failed stocks instead of stopping entirely.

🔹 Example Error Output if the API Fails or the Ticker is Invalid:

Error fetching data for XYZ123: No data available for the given ticker.

Now that we have structured our data properly and included error handling, let’s move on to optimizing how we store and manage stock data efficiently.

Optimizing Data Storage for Scalability

At this point, we have successfully retrieved stock data, but how we store it determines long-term efficiency. Initially, our approach was to save each stock’s data in a separate file, leading to excessive storage use and slow data retrieval. Instead, we will consolidate all stock data for a given day into a single JSON file, making the system more scalable and organized. Let’s refine how we format and structure this data to ensure it remains efficient as we expand to tracking hundreds or thousands of stocks.

Why Our Original Data Format Was Inefficient and How to Fix It

Before we jump into optimizing the way we store stock market data, let’s first understand how our original method worked and why it wasn’t scalable. Initially, the approach was to store each stock’s data in a separate JSON file, meaning for every single stock, a new file was created each day.

The Old Format: One File per Stock Per Day

Imagine you’re tracking 500 stocks (which is common for serious traders). If we saved data one file per stock, we would have:

500 separate files per day
15,000+ files per month
180,000+ files per year

Example: Files in the “data/” Folder (March 15, 2025)

data/
│── AAPL_2025-03-15.json
│── TSLA_2025-03-15.json
│── GOOGL_2025-03-15.json
│── MSFT_2025-03-15.json
│── AMZN_2025-03-15.json
│── NVDA_2025-03-15.json
│── META_2025-03-15.json
│── ...

Example JSON for Each Stock

Each stock had its own file with a repeated "date" field:

{
    "date": "2025-03-15",
    "ticker": "AAPL",
    "open": 175.2,
    "high": 178.9,
    "low": 174.5,
    "close": 177.6,
    "volume": 89000000
}

{
    "date": "2025-03-15",
    "ticker": "TSLA",
    "open": 860.5,
    "high": 875.0,
    "low": 855.2,
    "close": 868.3,
    "volume": 65000000
}

The Problem

📂 File Bloat: Too many files, making it hard to manage.
🚀 Slow Lookups: If we need data for multiple stocks, we have to open and parse hundreds of files.
🔄 Redundant Metadata: The "date" field is repeated in every file, making storage inefficient.

The Better Format: A Nested Dictionary (One File Per Day)

Instead of creating separate files for every stock, we store all stock data for a given day in a single JSON file. This makes data access faster, cleaner, and more organized.

Example: New File Structure

data/
│── market_data_2025-03-15.json
│── market_data_2025-03-16.json
│── market_data_2025-03-17.json
│── ...

Example JSON Format (All Stocks in One File)

{
    "date": "2025-03-15",
    "stocks": {
        "AAPL": {
            "open": 175.2,
            "high": 178.9,
            "low": 174.5,
            "close": 177.6,
            "volume": 89000000
        },
        "TSLA": {
            "open": 860.5,
            "high": 875.0,
            "low": 855.2,
            "close": 868.3,
            "volume": 65000000
        },
        "GOOGL": {
            "open": 2805.0,
            "high": 2850.4,
            "low": 2788.2,
            "close": 2832.1,
            "volume": 1800000
        }
    }
}

Why This Format is Better

✅ Only One JSON File Per Day → No more hundreds of files cluttering storage.
✅ Fast Lookups → We can load all stock data at once, reducing file read time.
✅ Easier to Expand → We can add more data fields like "market_cap" or "sector" later.

Updating the Code to Use This New Format

Now that we know why this method is better, let’s update our code to reflect this structure.

Modify the Data Fetching Function

Instead of saving individual stock data, we return a dictionary without the "date" field (since it’s already in the main structure).

import yfinance as yf
import json
from datetime import datetime

def fetch_yahoo_finance_data(ticker):
    """Fetches OHLCV data for a given stock ticker from Yahoo Finance."""
    try:
        stock = yf.Ticker(ticker)
        hist = stock.history(period="1d")

        if hist.empty:
            print(f"No data available for {ticker}.")
            return None

        latest_data = hist.iloc[-1]

        return {
            "open": float(latest_data["Open"]),
            "high": float(latest_data["High"]),
            "low": float(latest_data["Low"]),
            "close": float(latest_data["Close"]),
            "volume": int(latest_data["Volume"])
        }

    except Exception as e:
        print(f"Error fetching data for {ticker}: {str(e)}")
        return None

Fetch Multiple Tickers and Organize Data

This function will fetch data for multiple stocks and store it in a single JSON file under the "stocks" key.

def fetch_multiple_tickers(tickers):
    """Fetches OHLCV data for multiple tickers and stores it in a nested JSON format."""
    market_data = {"date": datetime.now().strftime("%Y-%m-%d"), "stocks": {}}

    for ticker in tickers:
        stock_data = fetch_yahoo_finance_data(ticker)
        if stock_data:
            market_data["stocks"][ticker] = stock_data

    return market_data

Understanding `market_data["stocks"][ticker] = stock_data`

The assignment statement inside the loop:

market_data["stocks"][ticker] = stock_data

is storing data in a nested dictionary.

Breaking Down `market_data`

At the beginning of the function, market_data is initialized as:

market_data = {"date": datetime.now().strftime("%Y-%m-%d"), "stocks": {}}

This creates a dictionary with:

A "date" key → storing today’s date.
A "stocks" key → initialized as an empty dictionary {}, where we will store stock data.

Example after initialization:

{
    "date": "2025-03-15",
    "stocks": {}
}

Understanding `market_data["stocks"]`

market_data["stocks"] refers to the nested dictionary inside market_data that will store stock information.
Initially, it is empty:

"stocks": {}

What Happens Inside the Loop?

for ticker in tickers:
    stock_data = fetch_yahoo_finance_data(ticker)
    if stock_data:
        market_data["stocks"][ticker] = stock_data

The loop goes through each stock symbol (ticker) in tickers (e.g., "AAPL", "TSLA", "GOOGL").
It fetches data for that ticker using fetch_yahoo_finance_data(ticker).
If valid stock_data is returned, we store it inside market_data["stocks"] under its ticker.

Example:

If we run:

tickers = ["AAPL", "TSLA"]
market_data = fetch_multiple_tickers(tickers)

Here’s what happens: First loop iteration (ticker = "AAPL"):

stock_data = { "open": 175.2, "high": 178.9, "low": 174.5, "close": 177.6, "volume": 89000000 }

This gets stored as:

market_data["stocks"]["AAPL"] = stock_data

Second loop iteration (ticker = "TSLA"):

stock_data = { "open": 860.5, "high": 875.0, "low": 855.2, "close": 868.3, "volume": 65000000 }
This gets stored as:

market_data["stocks"]["TSLA"] = stock_data

Now, market_data looks like this:

{
    "date": "2025-03-15",
    "stocks": {
        "AAPL": {
            "open": 175.2,
            "high": 178.9,
            "low": 174.5,
            "close": 177.6,
            "volume": 89000000
        },
        "TSLA": {
            "open": 860.5,
            "high": 875.0,
            "low": 855.2,
            "close": 868.3,
            "volume": 65000000
        }
    }
}

So What Does `[ticker]` Do?

When we write:

market_data["stocks"][ticker] = stock_data

ticker is the stock symbol (e.g., "AAPL", "TSLA").
market_data["stocks"] is a dictionary of stock data.
market_data["stocks"][ticker] creates a new key-value pair inside stocks, using the ticker as the key.

Effectively, this dynamically assigns stock data to its corresponding ticker symbol in the dictionary.

Think of market_data["stocks"] as a folder (dictionary) that holds multiple stock records.
Each ticker (like "AAPL", "TSLA") is a separate file (another dictionary inside).

Instead of creating multiple files for each stock, we organize everything under one structure, making it easier to store and retrieve.

Save Data in a Single JSON File

Instead of saving separate files, we save one file per day.

def save_to_json(filename, data):
    """Saves the nested stock data to a JSON file."""
    with open(filename, "w") as json_file:
        json.dump(data, json_file, indent=4)
    print(f"Market data saved to {filename}")

Run the Code

Finally, we fetch data for multiple stocks and store it in one JSON file per day.

if __name__ == "__main__":
    tickers = ["AAPL", "TSLA", "GOOGL", "AMZN", "MSFT"]
    market_data = fetch_multiple_tickers(tickers)

    filename = f"data/market_data_{datetime.now().strftime('%Y-%m-%d')}.json"
    save_to_json(filename, market_data)

Full code:

import yfinance as yf			# Yahoo Finance API Wrapper
import json						# Handles saving data as a JSON
import ipfshttpclient			# Uploads data to IPFS (InterPlanetary File System)
from datetime import datetime	# Gets the current date

class Main:

	def __init__(self):
		ticker_list = ["AAPL", "TSLA", "GOOGL", "AMZN", "MSFT"]

		market_data = self.fetch_multiple_yahoo_finance_tickers(ticker_list)

		filename = f"data/market_data_{datetime.now().strftime('%Y-%m-%d')}.json"
		self.save_to_json(filename, market_data)

	def fetch_yahoo_finance_ticker_data(self, ticker):
		"""Fetches OHLCV for a given stock ticker from Yahoo Finance"""

		try:
			stock = yf.Ticker(ticker)
			hist = stock.history(period="1d")

			# check if data exists
			if hist.empty:
				print(f"{ticker} has no history available")
				return None 
			else:
				latest_data = hist.iloc[-1]

			return {
				"open": float(latest_data["Open"]),
				"high": float(latest_data["High"]),
				"low":  float(latest_data["Low"]),
				"close": float(latest_data["Close"]),
				"volume": int(latest_data["Volume"])
			}

		except Exception as e: 
			print(f"Error fetching data for {ticker}: str{e}")
			return None

	def fetch_multiple_yahoo_finance_tickers(self, ticker_list):
		"""Fetches OHLCV for a passed list of tickers from Yahoo Finance"""

		# Makes the dictionary that will store the pulled market data
		market_data = {"data": datetime.now().strftime("%Y-%m-%d"), "stocks": {}}

		for ticker in ticker_list:
			try:
				stock_data = self.fetch_yahoo_finance_ticker_data(ticker)
				if stock_data:
					market_data["stocks"][ticker] = stock_data

			except Exception as e:
				print(f"Stock data was not found for {ticker}")

		return market_data

	def save_to_json(self, filename, data):
		"""Saves the market data to a JSON file"""
		with open(filename, "w") as json_file:
			json.dump(data, json_file, indent=4)
		print(f"Market data saved to {filename}")


if __name__ == "__main__":
	m = Main()

This function:

Fetches the last available trading day’s OHLCV data.
Formats it as a JSON object.
Returns the processed data, or logs an error if retrieval fails.

Dynamically Handling Ticker Management

How Can We Retrieve All U.S. Ticker Symbols?

Instead of relying on APIs with potential rate limits or scraping websites, NASDAQ provides a public FTP server where it hosts daily-updated stock listing files. This is a free, reliable, and legal way to get all currently active ticker symbols without any API keys or costs.

Understanding NASDAQ’s FTP Data

NASDAQ maintains an FTP server that publicly provides ticker lists. The primary files of interest are:

Exchange	File Name
NASDAQ	`nasdaqlisted.txt`
NYSE & AMEX	`otherlisted.txt`

These files contain a list of active ticker symbols along with company names and other metadata.

Public NASDAQ FTP Server URL:

ftp://ftp.nasdaqtrader.com/SymbolDirectory/

Download & Process the NASDAQ Ticker List

To fetch the ticker list, we will download and parse these files using Python’s requests or urllib.

Install Required Libraries

If you haven’t installed pandas, install it first:

pip install pandas

Fetch and Process NASDAQ Data

import pandas as pd

NASDAQ_TICKER_URL = "ftp://ftp.nasdaqtrader.com/SymbolDirectory/nasdaqlisted.txt"
OTHER_TICKER_URL = "ftp://ftp.nasdaqtrader.com/SymbolDirectory/otherlisted.txt"

def fetch_nasdaq_tickers():
    """Fetches all active U.S. stock tickers from NASDAQ's public FTP server."""
    try:
        # Load NASDAQ tickers
        nasdaq_df = pd.read_csv(NASDAQ_TICKER_URL, sep="|")
        nasdaq_tickers = nasdaq_df["Symbol"].tolist()
        
        # Load NYSE & AMEX tickers
        other_df = pd.read_csv(OTHER_TICKER_URL, sep="|")
        other_tickers = other_df["ACT Symbol"].tolist()

        # Combine all tickers
        all_tickers = nasdaq_tickers + other_tickers
        print(f"Successfully retrieved {len(all_tickers)} tickers.")
        return all_tickers

    except Exception as e:
        print(f"Error fetching tickers: {str(e)}")
        return None

# Example Usage
if __name__ == "__main__":
    tickers = fetch_nasdaq_tickers()
    if tickers:
        print(tickers[:10])  # Display the first 10 tickers

How This Works

We download NASDAQ and NYSE/AMEX ticker files from NASDAQ’s public FTP server.
Files are stored in a simple text format, separated by | (pipe symbol).
Pandas reads the files and extracts only the ticker symbols from each dataset.
We combine NASDAQ, NYSE, and AMEX tickers into one clean list.
The function returns a full list of stock symbols, ensuring up-to-date coverage.

Expected Output

When you run the script, it should print:

Successfully retrieved 7200 tickers.
['AAPL', 'MSFT', 'GOOGL', 'AMZN', 'TSLA', 'NVDA', 'META', 'NFLX', 'AMD', 'PYPL']

You’ll now have every active stock ticker from NASDAQ, NYSE, and AMEX in one list.

Key Takeaways

Phase 1: Setting Up the Environment

Setting Up the Project with a Virtual Environment

Create the Project Directory

For macOS/Linux:

For Windows (Command Prompt):

For Windows (PowerShell):

Set Up a Virtual Environment

For macOS/Linux:

For Windows (Command Prompt):

For Windows (PowerShell):

Install Required Dependencies

Fetch and Store Market Data

What This Code Does

Import Required Libraries

Define the Function to Fetch Data

Connect to Yahoo Finance

Check if Data Exists

Extract the Latest Market Data

Format the Extracted Data for Storage

Why Don’t We Just Return latest_data?

What Type is latest_data?

How to Inspect latest_data?

Now to Handle Errors Gracefully

Optimizing Data Storage for Scalability

Why Our Original Data Format Was Inefficient and How to Fix It

The Old Format: One File per Stock Per Day

Example: Files in the “data/” Folder (March 15, 2025)

Example JSON for Each Stock

The Problem

The Better Format: A Nested Dictionary (One File Per Day)

Why This Format is Better

Updating the Code to Use This New Format

Modify the Data Fetching Function

Fetch Multiple Tickers and Organize Data

Understanding market_data["stocks"][ticker] = stock_data

Breaking Down market_data

Example after initialization:

Understanding market_data["stocks"]

Example:

So What Does [ticker] Do?

Save Data in a Single JSON File

Run the Code

Dynamically Handling Ticker Management

How Can We Retrieve All U.S. Ticker Symbols?

Understanding NASDAQ’s FTP Data

Download & Process the NASDAQ Ticker List

Install Required Libraries

Fetch and Process NASDAQ Data

How This Works

Expected Output

Why Don’t We Just Return `latest_data`?

What Type is `latest_data`?

How to Inspect `latest_data`?

Understanding `market_data["stocks"][ticker] = stock_data`

Breaking Down `market_data`

Understanding `market_data["stocks"]`

So What Does `[ticker]` Do?