This repository contains the end-to-end solution for building a master financial statement database for US public companies. The project supports fundamental analysis by using SEC financial statement data sets and encompasses several components including data scraping, storage design, validation, Airflow pipelines for ETL processes, and deployed applications for data access via Streamlit and FastAPI.
Below is the workflow diagram for the AI Application:
- User: The end-user interacts with the application via the Streamlit frontend.
- Streamlit App: The frontend built using Streamlit.
- FastAPI Backend: The backend server that handles data processing.
- Data Extraction:
- Python: For extracting multiple zips from SEC website.
- AWS S3 Bucket: Used for storing raw unprocessed data.
- Snowflake: Used for storing json and rdbms data.
- Google Cloud Run: Used for Deploying FastAPI applications
- Streamlit In-builtDeployment: Used for Deploying Streamlit application for UI/UX.
- Streamlit Application URL: http://34.56.233.252:8501/.
- Airflow Application URL: http://34.56.233.252:8080/login/?next=http%3A%2F%2F34.56.233.252%3A8080%2Fhome.
- Google CodeLabs URL: https://codelabs-preview.appspot.com/?file_id=1tVv4J83L46ZG3zxy8CkdNmQFbEnVXTgbeXhP5TdlYJ0#0
- Project_Brief_Video: https://teams.microsoft.com/l/meetingrecap?driveId=b%21FayNqOa36EqT25ce1C895cHOMeHKQoJHgOYzg_brHsgUmPc9DOTzRZxUhXvmml6L&driveItemId=01I6MVIPIVNFJLDNZ5PNEISEIFJIHAC2HZ&sitePath=https%3A%2F%2Fnortheastern-my.sharepoint.com%2F%3Av%3A%2Fg%2Fpersonal%2Fmate_r_northeastern_edu%2FERVpUrG3PXtIiREFSg4BaPkBJy8QEIC4uqOi_5xpJqXhZw&fileUrl=https%3A%2F%2Fnortheastern-my.sharepoint.com%2F%3Av%3A%2Fg%2Fpersonal%2Fmate_r_northeastern_edu%2FERVpUrG3PXtIiREFSg4BaPkBJy8QEIC4uqOi_5xpJqXhZw&iCalUid=040000008200e00074c5b7101a82e00800000000b7a5a34d6a6fdb010000000000000000100000001e7bfd83f44da9448a1dbbab193014f7&threadId=19%3Ameeting_MzlkMDNhNGItOWJjMi00Mjk3LThmOTUtN2FlMzRhYzdiOGZi%40thread.v2&organizerId=aa87e9b3-28f3-4532-93b8-06c36bf6da04&tenantId=a8eec281-aaa3-4dae-ac9b-9a398b9215e7&callId=4fb55145-7ea6-46ed-8e2d-c76fee86de98&threadType=Meeting&meetingType=Scheduled&subType=RecapSharingLink_RecapChiclet
- Python 3.7+
- Diagrams library for generating the workflow diagram.
- AWS account with S3 bucket access.
- Streamlit and FastAPI installed for frontend and backend development.
- Install Google Cloud SDK
- Install Docker