Northcoders Data Engineering Bootcamp group project
Following an intensive 10 week "Data Engineering in Python" training course with Northcoders, this project is the group's first attempt at setting up an ETL (extract, transform and load) pipeline. As part of the project, alongside developing technical skills, we got to grips with agile working practices, including utilising a kanban board.
A project brief was supplied by Northcoders to create a minimum viable product (MVP) of an ETL pipeline for a fictional company 'Totesys', in order for them to gain commerical insights. The pipeline should extract data from an operational database, archive it in a data lake and make it available in a remodelled data warehouse. The databases and lambda functions are hosted in the cloud, using Amazon Web Services (AWS).
To deliever this brief, we used Terraform to create the AWS infrastructure as code. We created an ETL pipeline that is trigged by Eventbridge and uses AWS Lambda functions to perform the data ingestion into an S3 bucket, transformation of data and loading of data into the warehouse.
Although we made good progress on the project, we didn't quite complete the brief within the given time of two weeks. The majority of the AWS infrastructure can be succesfully deployed with data ingested, however the load function is not fully complete and thus full automation of pipeline and delivery of the MVP is not complete.
Some of the key challenges included:
- Deploying the project successfully as a whole with the many components.
- Learning and adopting good engineering practices, such as creating and merging git branches, satisfying PEP8 standards, achieving 90% testing coverage and GitHub Actions workflow.
- How to update the data every 30 minutes without duplication or corruption.
- Setting up suitable development, testing and production environments and deploying this in the pipeline.
- Testing functions that interact with operational databases.
- Provisioning multiple resources via Terraform that communicate with each other.
- src: lambda ingest, transform and load python function.
- terraform: policies for S3 bucket, main, Lambda functions, Cloudwatch, Secrets Manager and Evenbridge.
- test: unit tests for the lambda functions.
- test_utils: tests for the utility functions.
- util_func: Utiliy functions for the lambda functions.
N.B. folder ultil_func is in a layer: filepath: NC-DataEng-ETL-Project/tree/main/util_func/python
To install and explore our project please follow this step-by-step guide.
- Python 3.12.7 and follow the installation below to install the required packages and modules stored in the requirements.txt file.
- AWS account with necessary permissions
- AWS CLI configured with your credentials
- Virtual environment (optional, but recommeded)
- VS Code or equivalent
git clone https://github.com/LucyMilligan/NC-DataEng-ETL-Project/tree/main/util_func/python
python -m venv venv
make requirements
make run-security
make run-checks
On a git push or a pull request, the project is deployed via GitHub Actions workflow (as defined in the .github/workflow/deploy.yml file).
To deploy locally, follow these steps:
terraform init
terraform plan
terraform apply
To tear down the AWS insfrastructure:
terraform destroy
MIT Licence.
Northcoders for curating and delivering the course.
Project group: Iris Araneta, Callum Bain, Dani Ghinchevici, Ollie Lawn, Lucy Milligan and Joshua Sessions.
We learnt about good data engineering practices and got to grips with git branching, creating pull requests and communicating between the team. We also put into practice the technical knowledge we'd learnt throughout the bootcamp, delving deeper into some topics bringing it together for a single purpose.
We also learnt that things often take longer than expected - if we anticipate a job will take about half an hour, realitisly that will quickly turn into three hours!
A big take away was that it is often easier to start simply and build complexity, rather than start adding complexity straight away.
The jazz music as we wait in slack is rather quite lovely.
It's a shame we didn't finish, but we are proud of what we've learnt and how we worked together.
Getting all the moving parts successfully working together.