An end to end data pipeline using Terraform, Airflow, Docker , Kubernetes and AWS resources.
This project consists of two separate layers. One for the data pipeline and the other for the CICD workflows.
The AWS resources used are as follows
1.) S3 buckets: There are two s3 modules created for this project. One for the raw S3 bucket and the other to store the transformed data.
2.) Redshift: Data from the transformed S3 bucket are sent to this Redshift cluster to be utilized by the downstream users such as data analysts, BI analysts or data scientists.
3.) ECR: Elastic Container Registry will be utilized by the CICD layer to push the built docker images
4.) EC2 : EC2 is used in this scope to run the Airflow docker container. This EC2 instance pulls the data from the latest image built and pushed to ECR by the Github Actions workflow.
5.) EKS: Elastic Kubernetes Service is used for orchestrating the docker containers running on the EC2 instance.
The architectural diagram for the data infra and CICD infra is shown below.
The Github Actions worfklow file contains two workflows. Workflow.yml file creates the terraform resources, authenticates to AWS using OIDC connect , builds the docker images and pushes the built docker image to AWS ECR.
Destroy.yml file destroys the terraform modules originally created.
ALso, in the code, there is a toggle. Enabling the toggle creates all the resources and destroying the toggle deletes all the resources.

The Airflow Webserver could be viewed by using the public IP of your EC2 instance. There, you can view your DAG graph or manually run your dags.
To verify that the data was actually uploaded to S3, the image below shows that data in the S3 bucket.
This is just the first version of the project. Subsequent builds will involve code improvements and adding extra modules for visualization. Going further, I will also create an RDS module whose database credentials will be used to configure the airflow.cfg. I will also change the default executor type from Sequential Executor to Celery Executor for improved parallelism.