Skip to content

ETL pipeline for processing a dataset of Belgian traffic accidents. The technologies used are Amazon S3 and Amazon Glue

Notifications You must be signed in to change notification settings

dominicho97/belgium-traffic-accidents-analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 

Repository files navigation

Belgian Traffic Accidents Data Pipeline

This project implements a comprehensive ETL pipeline for processing a dataset of Belgian traffic accidents. The technologies used are Amazon S3 and Amazon Glue

Architecture Overview

This diagram outlines the ETL pipeline’s architecture, showing how raw data flows through AWS S3, AWS Glue Crawler, AWS Glue ETL back to S3 ready for analysis.

etl_excalidraw

Pipeline Overview

  1. Data Source: The Belgian Traffic Accidents dataset (.txt file) is used.

  2. Amazon S3 object storage to store the raw data

  3. Catalog with AWS Glue Crawler: AWS Glue Crawler is used to scan the S3 data and infer its schema, registering it in the AWS Glue Data Catalog.

  4. ETL Processing: AWS Glue ETL cleans, transforms, and converts the raw data into a Parquet file format with Snappy compression, optimizing it for performance and storage.

    Transforming the data

traffic_csv_schema

AWS Glue ETL (visual tool)

traffic_etl_job_visual

  1. Amazon S3 is used again to store the clean Parquet file. Ready for analysis.

s3_final

Result

The transformed Parquet data is now accessible for analysis with tools like Amazon Athena, Power BI, and Python, enabling efficient and scalable insights.

About

ETL pipeline for processing a dataset of Belgian traffic accidents. The technologies used are Amazon S3 and Amazon Glue

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published