This project serves as my learning documentation.
This project sets up a simple serverless data lake using AWS services, managed with Terraform. As I write this documentation, the project only limited to example data type like the data.csv
.
- Raw Data (S3): Stores incoming raw data.
- Trigger (Lambda): Watches for new data and triggers the transformation process.
- Transform Data (Glue ETL): Cleans and processes the raw data, storing the results in another S3 bucket.
- Transformed Data (S3): Stores the transformed data.
- Discover Data (Glue Crawlers): Scans the raw and transformed data and creates metadata.
- Catalog Data (Glue Data Catalog): Stores metadata, making data available for querying.
- Explore Data (Athena): Allows querying of the processed data via SQL.
Terraform automates the provisioning of AWS resources.
- Terraform (1.10.5)
- AWS CLI (2.17.32)
- Python (3.11)
- GNU Make (3.81)
- Clone the repository to your local repository.
- Ensure you already have configured AWS credentials in your device.
- Navigate to serverless-data-lake directory, then navigate to terraform directory.
- Create a
.tfvars
file in the terraform directory using the.tfvars.template
. - Adjust the variable
aws_profile
if you have multiple profile in your device, otherwise, just filldefault
as the variable value. Adjust other variables. - Run
terraform init
to initialize all terraform resources. - Run
terraform plan -var-file=.tfvars -out=plan.tfplan
to create execution plan. - Run
terraform apply "plan.tfplan"
to apply the execution plan. - Run
terraform show
to inspect the current state.
- Upload
data.csv
to thedata/
folder in the source bucket using AWS CLI or AWS Console. - Check your AWS Glue ETL Job and your AWS Glue Data Catalog.
Ah, there you go!