Welcome to the documentation for our case study project, where we successfully built a robust Data Platform for one of BuildItAll clients in Belgium. This Data platform was designed to facilitate big data analytics, empowering the clients to become more data-driven by efficiently handling the massive data they generate daily. Below, you will find a detailed description of the problem statement, solution approach, tools used, and the impact of the project on the business problem.
The BuildItAll team approached us on behalf of one of its clients from Belgium with the need to establish a Data Platform capable of supporting big data analytics. The goal was to enhance their data-driven decision-making processes by efficiently managing and analyzing the substantial volumes of data they produce every day.
Our solution involved establishing a comprehensive data platform, utilizing a variety of tools to ensure seamless data processing and analytics.
- AWS CloudWatch: For resource monitoring
- S3 Bucket: For Data Storage
- Pyspark (AWS EMR): For Big Data Processing
- Airflow (AMWAA): For Data Orchestration
- Terraform: For Infrastructure Automation
- GitHub Actions: For CI-CD Deployment
Below is a step-by-step breakdown of our solution approach:
CloudWatch was employed for monitoring purposes to track the performance of the resources that were provisioned on AWS, including tracking the performance of the Amazon Managed Workflows for Apache Airflow Service (AMWAA) DAGs where email alerts were sent on failure. This allowed us to promptly address any failures or issues within the workflow. This was provisioned using Terraform.
The provisioned S3 bucket held different folders and files associated with the other resources provisioned on AWS including the cloudwatch metrics log files, AMWAA dags folder, AMWAA plugins folder, folder containing the pyspark scripts, AMWAA requirements file, and the folder containing the raw and processed forms of the dataset.This was also provisioned using Terraform.
Raw text data files in a zipped folder was downloaded programmatically from a public link and stored in S3 bucket.
AWS EMR (Elastic MapReduce) was created and deployed for data processing tasks, leveraging PySpark scripts for efficient handling of large datasets. The Smartphone and Smartwatch Activity and Biometrics Dataset used contains accelerometer and gyroscope time-series sensor data from 51 test subjects performing 18 activities for 3 minutes each.
PySpark was utilized to retrieve the raw data from S3, process and transform it, and then push it back to S3 as Parquet files. This approach ensured scalability and reliability in handling over 15 million records.
We used Apache Airflow to orchestrate the data pipeline. Initially, Airflow was tested locally to ensure stability and functionality before deploying it in production using AWS Managed Workflows for Apache Airflow (MWAA). It is important to note that the EMR cluster that hosted the execution of the pyspark scripts for data processing was created within Airflow and Terminated as soon as the data processing job was completed. Airflow's flexibility and ability to handle complex workflows made it an ideal choice for orchestrating our pipeline.
All infrastructure components were deployed in AWS using Terraform. Terraform's infrastructure-as-code approach allowed us to easily provision the required resources and resource dependencies, ensuring consistency and repeatability across environments.
We implemented a CI/CD pipeline using GitHub Actions which ensured that changes in the repository were tested before being deployed to Production.
To avoid rebuilding the pipelines due to unrelated code changes, our CI-CD pipeline was divided into two:
- The Terraform CI-CD Deployment Pipeline:
This pipeline is only triggered when any changes were made to the Terraform folder only. It lints the terraform scripts, Validates and Format the terraform scripts, shows a preview of the resources that would be provisioned by Terraform and deployed.
- The Airflow CI-CD Deployment Pipeline:
This deployment pipeline runs only when changes are made to the Airflow folder containing our dags and pyspark scripts. This deployment pipeline involves the linting of the scripts, code testing using some basic unit tests by pytest that checks for some expectations in the dags. When these tests pass, the dag, pyspark, and requirements folder is redeployed to the s3 bucket so that the AMWAA service syncs any changes made.
Having these two different deployments ensured greater control over deployment and saves time too since running the two pipelines together everytime is slower than running just one if changes are made to just either of the Airflow or Terraform folder. The use of unit tests also ensured code quality in the context of the dags.
- Apache Airflow: Offers flexibility and scalability in orchestrating complex workflows.
- Terraform: Provides automation and version control for infrastructure deployments.
- GitHub Actions: Facilitates seamless CI/CD processes with automated testing.
- AWS EMR: Efficiently handles big data processing with PySpark integration.
- Amazon S3: Ensures reliable storage and retrieval of large datasets.
- CloudWatch: Provides robust monitoring capabilities for workflow management.
- Apache Airflow: Initial setup can be complex and requires careful configuration.
- Terraform: Requires familiarity with infrastructure-as-code concepts for effective use.
- GitHub Actions: May involve a learning curve for setting up complex workflows.
- AWS EMR: Cost considerations for extensive usage
- Amazon S3: Requires management of storage costs as data volume increases.
- CloudWatch: Can incur additional costs for extensive monitoring.
├── .github/ # Holds the github workflow yaml files and pull_request template
├── airflow/ # Holds the airflow dags and pyspark scripts
├── terraform/ # Holds the terraform scripts that automatically provisions aws resources
├── .gitignore # Lists irrelevant files to be ignored
├── README.md # Documentation
├── fix-flake.sh # Fixes any code quality issues in files in the airflow folder
Before starting, ensure you have:
- An AWS account
- VS Code
- Terraform v1.10.4+ installed
- AWS CLI installed and configured
- Airflow (for local testing)
Clone the Repository:
git clone https://github.com/Nancy9ice/Data_Platform_Engineering_Case_Study
Download sample data (replace with your source): wget -O data/raw/dataset.zip [https://archive.ics.uci.edu/static/public/507/wisdm+smartphone+and+smartwatch+activity+and+biometrics+dataset.zip]
Upload data to S3:
aws s3 sync data/raw/ s3://builditall-bucket/builditall/raw_data/
Provide the following as Repository Secrets for your CI-CD Actions:
AWS_ACCESS_KEY
AWS_ACCOUNT_ID
AWS_REGION
AWS_ROLE
AWS_SECRET_ACCESS_KEY
BUCKET_NAME
TF_API_TOKEN
TF_CLOUD_ORGANIZATION
Then Trigger the CI-CD Workflow
The implementation of this Data Platform has significantly impacted BuildItAll client's ability to be more data-driven.
By enabling efficient big data analytics, the client can now derive actionable insights from their massive datasets, leading to improved decision-making processes. The automation and optimization strategies employed reduced manual intervention, increased operational efficiency, and provided a scalable solution to handle future data growth.
This case study project demonstrates our commitment to delivering high-quality solutions that address complex business needs. By leveraging cloud technologies and best practices, we successfully built a Data Platform that empowers BuildItAll clients to harness the power of big data analytics.