This project involves the creation of an Extract, Load, Transform (ELT) data pipeline that fetches data from the Polygon Finance website using its API, loads it into a PostgreSQL database, transfers it to a Google Cloud Storage (GCS) bucket, and finally moves it to BigQuery, the analytical database. The entire pipeline is orchestrated with Airflow using the Celery executor on Docker-compose. The tasks are grouped into three categories: stock, forex, and crypto, each with three different workflows - extraction to PostgreSQL, upload to GCS, and upload to BigQuery. in addition,How to load data from Polygon Stock API to PostgreSQL destination is found here
Do check out the below links that record the presentation of the project by me
-
For non-technical/Business individuals: Non-Technical_link
-
For technical individuals Part 1: Technical_link-part_1
-
For technical individuals Part 3: Technical_link-part_2
The pipeline provides a robust and automated way to gather, process, and analyze financial data from various markets, including stocks, forex, and cryptocurrencies. This data can be used by businesses and investors to make informed decisions based on real-time and historical market trends. The pipeline's automation reduces the time and effort required to gather and process this data, allowing businesses to focus more on analysis and decision-making.
For a detailed understanding of the data fields in the tables containing daily open, high, low, and close (OHLC) data, refer to the provided Data Dictionary Link. The dictionary encompasses information for stocks/equities, forex, and cryptocurrencies, with specific details for each market, such as adjustments for splits, request IDs, exchange symbols, and more.
A notable enhancement to this pipeline is the incorporation of robust data quality checks and testing within dbt. These checks ensure the reliability and accuracy of the processed data, contributing to the overall integrity of the analytics and insights derived from the pipeline.
The dbt tool is used to perform transformations and generate metrics. For each of the three finance data (stock, crypto, forex), a metric was developed to get the average trading prices and the number of transactions from increasing historical data. These metrics are represented in the agg_stock_avg, agg_crypto_avg, and agg_forex_avg tables.
The pipeline uses various technologies and tools to perform its tasks. The Polygon Finance API is used to extract the data, which is then loaded into a PostgreSQL database. The data is then transferred to a GCS bucket using the PostgresToGCSOperator in Airflow. Subsequently, the GCSToBigQuery Operator is employed to move the data from GCS to BigQuery. Finally, dbt is used within BigQuery for further transformations, metrics generation, and rigorous data quality checks. Orchestrated with Airflow from the data extraction down to the final stage involving dbt(A dbt job was utilized and it get triggered after the data finishes loading into BigQuery).
Below are images showing the Airflow logs, complete table lineage, data at BigQuery, and data at GCS. These images will provide a visual representation of the pipeline's operation and the data it processes.
-
Raw tables in BiQuery
There is a separate README file that provides detailed instructions on how to install and set up Airflow with the necessary connections. This file can be accessed via the provided link. Check out the Airflow setup instructions here
Just like any Data Engineer, I'm looking to always find improvement and scale my data solutions to efficiently solve business problems. some improvements I have in mind:
- Expansion of metrics based on specific business needs.
- Integration of additional financial markets for broader insights.
- Optimization of pipeline performance for scalability.
This project stands as a robust and automated solution, offering a streamlined approach to gather, process, and analyze financial data from diverse markets. Beyond providing valuable insights, it serves as a testament to the power and flexibility of modern data pipeline technologies. Open to feedback and improvement, the project maintains an adaptive stance, with opportunities for further enhancements, scalability, and a strong commitment to data quality. Feel free to contribute through pull requests to shape the evolution of this dynamic data engineering solution.







