This repository hosts the backend for the Capital Markets Loaders service. It is designed to handle the extraction, transformation, and loading (ETL) of financial data from various sources.
The Capital Markets Loaders Service is responsible for:
- Extracting market data from Yahoo Finance for equities, bonds, real estate, and commodities.
- Extracting cryptocurrency data from Binance API.
- Extracting macroeconomic data from the FRED API.
- Scraping and processing financial news from Yahoo News.
- Scraping and processing Reddit submissions from financial subreddits using PRAW (Python Reddit API Wrapper).
- Extracting stablecoin market cap data from CoinGecko.
- Generating portfolio performance (emulation).
- Transforming and loading the extracted data into MongoDB for further analysis.
MongoDB stands out as an ideal database solution for this Capital Markets Loaders service due to its exceptional ability to handle diverse data types within a single database platform:
-
Market Data (Time Series Collections): The service retrieves market data for various asset classes (equities, bonds, real estate, commodities) from Yahoo Finance. MongoDB's Time Series collections are perfectly suited for this data, offering optimized storage, efficient querying, and automatic data expiration for historical market information.
-
Cryptocurrency Data (Time Series Collections): Real-time cryptocurrency data from Binance API is stored in MongoDB Time Series collections, providing efficient storage and querying capabilities for high-frequency crypto market data.
-
Financial News (Unstructured Data & Vector Search): MongoDB excels at storing and querying unstructured data like financial news articles. Additionally, the document model seamlessly accommodates article embeddings generated by the service, enabling powerful Vector Search capabilities for semantic similarity searches and AI-driven insights.
-
Social Media Sentiment (Reddit Data & Vector Search): Reddit submissions from financial subreddits are stored with their embeddings and sentiment scores, leveraging MongoDB's document model and Vector Search capabilities to analyze social sentiment trends.
-
Macroeconomic Indicators: Data from the FRED API is efficiently stored in standard MongoDB collections, allowing for flexible querying and aggregation of economic indicators that can be correlated with market movements.
-
Stablecoin Market Caps: CoinGecko stablecoin market cap data is stored in MongoDB collections, providing insights into the stablecoin ecosystem and liquidity metrics.
-
Portfolio Performance Data: The emulated portfolio performance data fits naturally into MongoDB's document model, enabling efficient storage and retrieval of complex investment performance metrics.
This versatility eliminates the need for multiple specialized databases, reducing architectural complexity while providing performance-optimized storage for each data type—a significant advantage for financial applications that process and analyze diverse datasets.
-
Easy: MongoDB's document model naturally fits with object-oriented programming, utilizing BSON documents that closely resemble JSON. This design simplifies the management of complex data structures such as user accounts, allowing developers to build features like account creation, retrieval, and updates with greater ease.
-
Fast: Following the principle of "Data that is accessed together should be stored together," MongoDB enhances query performance. This approach ensures that related data—like user and account information—can be quickly retrieved, optimizing the speed of operations such as account look-ups or status checks, which is crucial in services demanding real-time access to operational data.
-
Flexible: MongoDB's schema flexibility allows account models to evolve with changing business requirements. This adaptability lets financial services update account structures or add features without expensive and disruptive schema migrations, thus avoiding costly downtime often associated with structural changes.
-
Versatile: The document model in MongoDB effectively handles a wide variety of data types, such as strings, numbers, booleans, arrays, objects, and even vectors. This versatility empowers applications to manage diverse account-related data, facilitating comprehensive solutions that integrate user, account, and transactional data seamlessly.
- Time Series - (More info): For storing market data in a time series format.
- Atlas Vector Search (More info): For enabling vector search on financial news data.
- MongoDB Atlas for the database.
- FastAPI for the backend framework.
- Poetry for dependency management.
- Uvicorn for ASGI server.
- Docker for containerization.
- yfinance for extracting market data from Yahoo Finance.
- pyfredapi for extracting macroeconomic data from the FRED API.
- requests for making HTTP requests to Binance and CoinGecko APIs.
- praw for accessing Reddit's API to scrape financial subreddit submissions.
- pandas for data manipulation.
- scheduler for job scheduling.
- transformers for NLP tasks.
- FinBERT for sentiment score calculation.
- voyage-finance-2 for generating article embeddings.
- Yahoo Finance Market Data ETL: Extracts, transforms, and loads market data for various asset types (equities, bonds, real estate, commodities) using the
yfinance
Python package. - Binance API Crypto Data ETL: Extracts, transforms, and loads cryptocurrency market data for major crypto assets using direct HTTP requests to Binance API endpoints.
- FRED API Macroeconomic Data ETL: Extracts, transforms, and loads macroeconomic data using the
pyfredapi
Python package. - Financial News Processing: Scrapes financial news from Yahoo News, generates embeddings using
voyage-finance-2
model from Voyage AI, and calculates sentiment scores using FinBERT, a pre-trained NLP model to analyze sentiment of financial text. - Reddit Submissions Processing: Scrapes submissions from financial subreddits using PRAW, generates embeddings using
voyage-finance-2
, and calculates sentiment scores using FinBERT. - CoinGecko Stablecoin Market Caps: Extracts daily stablecoin market cap data using direct HTTP requests to CoinGecko API endpoints.
- Portfolio Performance Generation: Generates simulated portfolio performance data based on asset allocations and market data.
The financial news sentiment analysis follows a sophisticated pipeline to extract meaningful insights from news articles:
-
Data Ingestion: The system scrapes financial news articles from Yahoo Search, storing them in the
financial_news
collection in MongoDB. For Phase 1, we use a fixed dataset of approximately 255 articles covering the 10 assets in the portfolio (about 20 articles per asset). -
Text Processing: For each article, we construct a comprehensive
article_string
by concatenating multiple fields:
Headline: QQQ Leads Inflows as VGIT, HYG Jump: ETF Flows as of Feb. 27
/n Description: Top 10 Creations (All ETFs) Ticker Name Net Flows ($, mm) AUM ($, mm) AUM % Change QQQ Invesco QQQ...
/n Source: etf.com · via Yahoo Finance
/n Ticker: HYG
/n Link: https://finance.yahoo.com/news/qqq-leads-inflows-vgit-hyg-005429051.html?fr=sycsrp_catchall
-
Sentiment Analysis: The
article_string
is processed by FinBERT, a financial-domain-specific language model trained to understand financial text sentiment. This generates a sentiment score for each article. -
Data Enrichment: The sentiment scores are stored back in the
financial_news
collection, associating each article with its computed sentiment. -
Vector Embedding Generation: The same
article_string
is passed to the voyage-finance-2 model, generating a 1024-dimensional vector representation (article_embedding
) that captures the semantic meaning of the article. -
Semantic Search Implementation: Using MongoDB's Vector Search capability, the system can find semantically similar news articles based on these embeddings—identifying both explicit mentions of a ticker symbol and contextually relevant articles that don't directly reference it.
-
Portfolio Sentiment Calculation: For each asset in the portfolio, the system calculates an average sentiment score from its related articles, providing a consolidated sentiment indicator that helps assess market perception of that asset.
This approach enables both explicit keyword matching and deeper semantic understanding of financial news, offering more comprehensive insights than traditional text-based searches.
The Reddit sentiment analysis pipeline processes social media data to capture market sentiment from retail investors:
-
Data Extraction: The system uses PRAW (Python Reddit API Wrapper) to scrape submissions from financial subreddits, collecting posts related to the tracked assets.
-
Content Processing: For each Reddit submission, a comprehensive text string is created by combining the title and text content of the post.
-
Embedding Generation: Similar to news articles, each submission is processed through the
voyage-finance-2
model to generate a 1024-dimensional vector embedding, capturing the semantic meaning of the Reddit post. -
Sentiment Analysis: The submission text is analyzed using FinBERT to generate sentiment scores, providing insights into the community's perception of specific assets.
-
Data Enrichment: The embeddings and sentiment scores are stored in the
subredditSubmissions
collection, enabling both keyword-based and semantic searches. -
Time-based Cleanup: A cleaner process removes Reddit submissions older than a specified threshold to maintain data relevance and optimize storage.
This social sentiment data complements traditional financial news analysis by providing grassroots investor sentiment, which can be particularly valuable for identifying emerging trends or retail investor behavior patterns.
The service uses the scheduler
Python package to schedule and manage ETL processes with the following schedule (all times in UTC):
- Financial News Processing: Weekly on Mondays at 4:00 AM UTC
- Yahoo Finance Market Data ETL: Tuesday-Saturday at 4:00 AM UTC
- PyFredAPI Macroeconomic Data ETL: Daily at 4:05 AM UTC
- Portfolio Performance Generation: Daily at 4:10 AM UTC
- CoinGecko Stablecoin Market Caps: Daily at 4:15 AM UTC
- Reddit Submissions Processing: Daily at 4:20 AM UTC
- Reddit Embedder (re-processing): Daily at 4:40 AM UTC
- Reddit Sentiment Analysis (re-processing): Daily at 4:45 AM UTC
- Reddit Data Cleanup: Daily at 5:00 AM UTC
- Binance API Crypto Data ETL: Daily at 5:10 AM UTC
Before you begin, ensure you have met the following requirements:
- MongoDB Atlas account - Register Here
- Python 3.10 or higher
- Poetry (install via Poetry's official documentation)
- Log in to MongoDB Atlas and create a database named
agentic_capital_markets
. Ensure the name is reflected in the environment variables. - Create the following collections:
financial_news
(for storing financial news data) - You can export some sample data to this collection usingbackend/loaders/db/collections/agentic_capital_markets.financial_news.json
file.pyfredapiMacroeconomicIndicators
(for storing macroeconomic data) - You can export some sample data to this collection usingbackend/loaders/db/collections/agentic_capital_markets.pyfredapiMacroeconomicIndicators.json
file.yfinanceMarketData
(for storing market data) - You can export some sample data to this collection usingbackend/loaders/db/collections/agentic_capital_markets.yfinanceMarketData.json
file. Additionally, there are some more backup files in this directory that you can use to populate the collection:backend/loaders/backup/*
binanceCryptoData
(for storing cryptocurrency data) - You can export some sample data to this collection usingbackend/loaders/db/collections/agentic_capital_markets.binanceCryptoData.json
file.subredditSubmissions
(for storing Reddit submissions with sentiment) - You can export some sample data to this collection usingbackend/loaders/db/collections/agentic_capital_markets.subredditSubmissions.json
file.stablecoin_market_caps
(for storing stablecoin market cap data) - You can export some sample data to this collection usingbackend/loaders/db/collections/agentic_capital_markets.stablecoin_market_caps.json
file.portfolio_allocation
(for storing portfolio allocation data)portfolio_performance
(for storing portfolio performance data) - You can export some sample data to this collection usingbackend/loaders/db/collections/agentic_capital_markets.portfolio_performance.json
file.chartMappings
(for storing chart mappings)
Note: For creating the time series collection, you can run the following python script located in the
backend/loaders/db/
directory:mdb_timeseries_coll_creator.py
. Make sure to parametrize the script accordingly.
- Create vector search indexes for the following collections:
financial_news
collectionsubredditSubmissions
collection
Note: For creating the vector search indexes, you can run the following python script located in the
backend/loaders/db/
directory:mdb_vector_search_idx_creator.py
. Make sure to parametrize the script accordingly for each collection.
Follow MongoDB's guide to create a user with readWrite access to the agentic_capital_markets
database.
Create a .env
file in the /backend
directory with the following content:
MONGODB_URI="your_mongodb_uri"
DATABASE_NAME="agentic_capital_markets"
APP_NAME="ist.demo.capital_markets.loaders"
VOYAGE_API_KEY=
FRED_API_KEY=
PORTFOLIO_PERFORMANCE_COLLECTION = "portfolio_performance"
YFINANCE_TIMESERIES_COLLECTION = "yfinanceMarketData"
BINANCE_TIMESERIES_COLLECTION = "binanceCryptoData"
PYFREDAPI_COLLECTION = "pyfredapiMacroeconomicIndicators"
NEWS_COLLECTION = "financial_news"
VECTOR_INDEX_NAME = "financial_news_VS_IDX"
VECTOR_FIELD = "article_embedding"
SCRAPE_NUM_ARTICLES = 1
REDDIT_CLIENT_ID=
REDDIT_SECRET=
REDDIT_USERNAME=
REDDIT_PASSWORD=
REDDIT_USER_AGENT=
REDDIT_REDIRECT_URI="http://localhost:8080"
REDDIT_STATE_RANDOM_STRING=
REDDIT_PERMANENT_AUTHORIZATION_CODE=
REDDIT_AUTH_URI=
ASSET_MAPPINGS_COLLECTION = "assetMappings"
SUBREDDIT_SUBMISSIONS_COLLECTION = "subredditSubmissions"
COINGECKO_STABLECOIN_COLLECTION = "stablecoin_market_caps"
- Open a terminal in the project root directory.
- Run the following commands:
make poetry_start make poetry_install
- Verify that the
.venv
folder has been generated within the/backend
directory.
To start the backend service, run:
poetry run uvicorn main:app --host 0.0.0.0 --port 8004
Default port is
8004
, modify the--port
flag if needed.
Run the following command in the root directory:
make build
To remove the container and image:
make clean
The service provides a comprehensive set of API endpoints for managing data loads and backfills. You can access the interactive API documentation (Swagger UI) by visiting:
http://localhost:<PORT_NUMBER>/docs
E.g. http://localhost:8004/docs
The API includes endpoints for:
- Loading and backfilling Yahoo Finance market data (by date or symbol)
- Loading and backfilling Binance cryptocurrency data (by date or symbol)
- Loading and backfilling PyFredAPI macroeconomic data (by date or series)
- Loading and backfilling portfolio performance data
- Loading CoinGecko stablecoin market cap data
- Loading recent financial news
- Loading and processing Reddit submissions (with embeddings and sentiment)
- Managing individual ETL processes
- Viewing scheduler overview
Note: Make sure to replace
<PORT_NUMBER>
with the port number you are using and ensure the backend is running. The Swagger UI provides detailed information about request/response schemas and allows you to test endpoints directly.
- Check that you've created an
.env
file that contains the required environment variables.
This project is for educational and demonstration purposes.