Developed by: Sean Choi (https://github.com/007seann)
This project was initiated in response to the increasing importance of real-time sentiment analysis in financial markets. As a retail investor with five years of experience and a 7.5% compound annual growth rate, I encountered firsthand how sensitive assets like Nvidia and Bitcoin are to sentiment shocks. Existing systems failed to capture the breadth and speed of multi-source financial sentiment. This gap motivated the development of an end-to-end, real-time sentiment scoring system capable of integrating multiple perspectives from financial text—such as regulatory filings, earnings call transcripts, and expert analysis reports—for better forecasting of stock returns and volatility.
- Financial sentiment is fragmented across multiple textual sources and stakeholder viewpoints.
- Existing sentiment systems often focus only on one source (e.g., news) and ignore broader market voices.
- Scalability: Handling 300,000+ documents (~1.5 billion tokens) across different formats and APIs.
- Orchestration: Building a real-time system that automates data collection, preprocessing, and model inference.
- Supervision: Labeling sentiment metrics for both return and volatility without human annotation.
- Efficiency: Real-time preprocessing and data transformation under heavy computational constraints.
- Lexicon-Based vs. Machine Learning Approaches: Lexicon methods offer transparency but lack nuance. ML models, particularly supervised learning with financial labels, provide better predictive power.
- Sequential vs. Parallel Data Processing: Given the volume of data, a parallel and distributed system using Apache Spark and Dockerized pipelines was explored.
- Single-Source vs. Multi-Source Input: Initial implementations using only 10-K filings (Minf1) were expanded in Minf2 to include earnings calls and expert reports for better performance and robustness.
- Cloud-Hosted vs. Local Deployment: Azure Cloud was chosen for scalability, supported by Docker and Apache Airflow for CI/CD and orchestration.
The final solution is a real-time, fully automated, and multi-source sentiment scoring system deployed on Microsoft Azure.
It comprises:
- FTRM (Financial Text Retrieval Model): Fetches data from EDGAR, Seeking Alpha, and APIs.
- PDCM (Parallel Data Construction Model): Uses Apache Spark to transform and structure data efficiently.
- SSPM (Sentiment Score Prediction Model): Produces return- and volatility-predictive sentiment scores using a supervised model adapted from Zheng et al.’s lexicon learning method.
This system not only supports real-time decision-making for traders and analysts but also provides a transparent, extensible, and scalable framework for future research and deployment.
conda activate seanchoi
If the system hasn't been set up yet, choose one of the following methods to install dependencies.
Recommended: Option 2 (Docker image)
Option 1: Install Locally
pip download --dest /data/seanchoi/airflow/airflow_docker/deps -r /data/seanchoi/airflow/airflow_docker/requirements.txt --use-deprecated=legacy-resolver
Option 2: Load Prebuilt Docker Image
docker load -i /data/seanchoi/airflow/airflow_docker/seanchoi_image.tar
Note for Hao: You can skip this step if you're testing the system on the server — the setup and image are already in place.
Navigate to the project directory and start the services:
cd /data/seanchoi/airflow
docker compose up
(Optional) If an image is not built, build an image first:
docker compose build
(Optional) If the system is already running, stop it first:
docker compose down
Open in browser:
http://20.77.80.201:3000/home
Login credentials:
- Username: Request to the author
- Password: Request to the author
Root path:
/data/seanchoi/airflow/data/SP500/
├── AnalysisReports/
├── constituents/
│ ├── firms/
│ └── market/
├── logs/
├── NASDAQ100/
└── SP500/
├── calls/
│ ├── firm/
│ └── market/
├── reports/
│ ├── firm/
│ └── market/
├── sec/
├── firm/
└── market/
Contains data for individual S&P 500 companies.
company_df/
— Firm-specific data by CIK (e.g.,0001045810/
)dtm/
— Document-Term Matrices_SUCCESS
: Spark job completion marker.parquet
files: DTM data
filtered/
— Preprocessed datasets (e.g.,batch_filtered_0.parquet
) which remains only active SP500 firm per year. The output of SP500 Dynamism filter (Please check the Financial Text Retrieval Model at the thesis)html/
— SEC filings in HTML formatintermediate/
— Temporary processing outputsoutcome/
— Final metricsmod_sent_ret.csv
: Sentiment-based returnsmod_sent_vol.csv
: Sentiment-based volatility
processed/
— Clean datasets ready for modelingtxt/
— Plain text SEC filingsjson/
— Transcripts and Reports in JSON format
Contains aggregated data across the S&P 500.
Structure mirrors the firm/
folder:
company_df/
: Aggregated data by CIKdtm/
,filtered/
,html/
,intermediate/
,outcome/
,processed/
,txt/
,json/
: Same function as infirm/
Place raw files into:
html/
for HTML filingstxt/
for plain text
- Store intermediate results in
intermediate/
,filtered/
- Save final metrics in
outcome/
Use:
outcome/mod_sent_ret.csv
andoutcome/mod_sent_vol.csv
for analysis and plotting.
Check filtered/
and intermediate/
to verify preprocessing steps and troubleshoot issues.
Root path:
/data/seanchoi/airflow/data/constituents/
-
nvidia_constituents_final.csv
- Final processed list of NVIDIA-related firms.
- Columns: CIK, Ticker, Company Name, Sector, Inclusion Date, Exclusion Date
- Use Case: Firm-level analysis of NVIDIA-related companies.
-
nvidia_constituents.csv
- Raw unfiltered list of NVIDIA-related firms.
- Columns: CIK, Ticker, Company Name, Sector
- Use Case: Starting point for preprocessing NVIDIA data.
-
QQQ_constituents_test.csv
- Test set of NASDAQ-100 constituents.
- Columns: CIK, Ticker, Company Name, Sector
- Use Case: Testing NASDAQ-100 workflows.
-
sp500_constituents.csv
- Current list of S&P 500 firms.
- Columns: CIK, Ticker, Company Name, Sector
- Use Case: Current firm-level S&P 500 analysis.
-
sp500_total_constituents.csv
- Historical list including past and present S&P 500 members.
- Columns: CIK, Ticker, Company Name, Sector, Inclusion Date, Exclusion Date
- Use Case: Historical S&P 500 analysis.
-
sp500_union_constituents.csv
- Unified dataset of all S&P 500 members.
- Columns: CIK, Ticker, Company Name, Sector
- Use Case: Comprehensive S&P 500 constituent analysis.
-
top10_QQQ_constituents_test.csv
- Test dataset for top 10 NASDAQ-100 firms.
- Columns: CIK, Ticker, Company Name, Sector
- Use Case: Workflow testing on top NASDAQ firms.
-
top10_QQQ_constituents.csv
- Top 10 NASDAQ-100 firms by market cap or other criteria.
- Columns: CIK, Ticker, Company Name, Sector
- Use Case: Targeted analysis of major NASDAQ constituents.
The plugins directory contains custom modules, scripts, and utilities that extend the functionality of the real-time system. It is organized into main subfolders:
/data/seanchoi/airflow/plugins
├── __init__.py
├── common/
├── packages/
└── shell/
Contains shared utility functions and scripts used across the system.
__init__.py
: Initializes the modulecommon_func.py
: General utilities (e.g., file importers, SFTP handlers)sec10k_item1a_extractor.py
: Extracts "Item 1A: Risk Factors" from SEC filings (text + HTML)time_log_decorator.py
: Decorator to log function execution times
process_files_for_cik
: Processes filings for a specific CIKprocess_files_for_cik_with_italic
: Handles italicized SEC contentsave_error_info
: Logs errors during processing
- Use
common_func.py
for file operations - Use
sec10k_item1a_extractor.py
for SEC section extraction and preprocessing
Specialized modules organized by functionality
-
FTRM/ (Financial Text Retrieval Model)
sec_crawler.py
: Crawls SEC filingsextract_report_ids.py
: Retrieves report IDsannual_report_reader.py
: Preprocesses annual reports- Use case: Retrieving and preparing financial documents
-
PDCM/ (Processing and Document Construction Module)
constructDTM.py
: Builds Document-Term Matricesvol_reader_fun.py
: Processes volatility and returns- Use case: Preprocessing and feature construction
-
SSPM/ (Sentiment and Signal Processing Module)
-
model.py
: Trains sentiment models; Kalman filters -
sent_predictor_firm.py
: Predicts sentiment at firm-level -
sent_predictor_market_deprecated.py
: (Deprecated) Market-level sentiment. Check the local version which is used in the system (sec_sent_predictor_local.py at /data/seanchoi/SSPM_local) -
Use case: Sentiment modeling and signal generation
-