Skip to content

sureshb208/AWS-ELT-S3-Athena

 
 

Repository files navigation

ELT Pipeline in Datalake

ELT - Extract Load Transform pipeline ingestion from Database into a single datalake location to do transform processing.

Diagram

Introduction & Goals

  • Ingest Data from RDBMS to a Data lake bucket.
  • Process transaction files to a single location
    • Data set of customer list of sales person
    • We use AWS Glue
    • Create Data Catalog, transform, load to Data lake
    • Data will be transform from CSV to Parquet file
    • Parquet stores the file schema in the file metadata. It is easier to work with because they are supported by so many different projects.
    • Transformed data will be ready for query in Data Warehouse

The Data Set

  • We use transaction data set customer.csv.
  • Data set of sales person's customer list
  • appropiate size 197 KB for Glue ETL demo

Used Tools

  • Amazon RDS postgreSQL, AWS Glue, Glue crawler, Athena

    • Load .csv file RDS into input folder inside a bucket.
    • Create output folder as target inside the same bucket.
    • Add Glue crawler to craw Data catalog of .csv file ( in input folder)
    • run Glue job to transform to parquet file
    • Set target to load into output folder
  • In Athena, check view or query of transformed file.

  • Add Glue crawler to craw Data catalog of .parquet file in (output folder).

Pipelines

  • Batch processing pipeline for bulk import.
  • Use source code of AWS Glue

Conclusion

  • Data set has transform inside a single datalake bucket
  • File can be loading later to Data Warehouse for analysis purpose.
  • Glue job can be assign for aggregate , join, filter tables

Follow Me On

https://www.linkedin.com/in/jirasak-pakdeeto-900665214/

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%