Data engineering Nanodegree

I am super thrilled to have complete the Data Engineering Nanodegree on Udacity. In this repo, you can find all the units' projects I coded and worked on.

Technologies, libraries and frameworks I used:

Python, Jupyter Notebook
Pandas
SQL
PostgreSQL
Apache Cassandra
Apache Spark
Apache Airflow
Amazon Web Service (AWS):
- S3
- Redshift
- Setup of Credentials, Roles & Users
- Budget monitoring

Projects

The data engineering program touches the following topics: 1) data modelling, 2) data warehouse, 3) data lake and 4) data pipeline. Each unit has its own project based on problem from a fictive company named Sparkify.

Fictive company: Sparkify

Sparkify is a fictive company with a music streaming mobile application (like spotify)

Initial data collected by Sparkify are stored directly in S3. Their data comes into two types of JSON files:

log_data : The user activity on the app contains data specific to each user and his associated activities (page he's on, songs he's listening to and when he's listening to it, premium or free user, etc.)
song_data : Metadata on the songs in the app contains information about songs own by Sparkify (song's name, artist, album, etc.)

Project 1: Data Modeling with PostgreSQL

Skills: Apache Cassandra, Jupyter, Python, SQL

Sparkify Problem that needed to be tackle: Sparkify wants to analyze the data they've been collecting on songs and user activity on their new music streaming app. Currently, they don't have an easy way to query their data. They'd like a data engineer to create a Postgres database with tables designed to optimize queries on song play analysis.

Task: Create a Postgre database, a star-schema and an ETL pipeline that make analyses easier.

Project:

Explore and understand the raw data with Pandas
Define fact and dimension tables for a star schema
Write an ETL pipeline that transfers data from files in two local directories into these tables in Postgres using Python and SQL.

Project 2: Data Modeling with Apache Cassandra

Skills: Apache Cassandra, Python, Jupyer

Sparkify Problem that needed to be tackle: Sparkify wants to analyze the data they've been collecting on songs and user activity on their new music streaming app. Currently, they don't have an easy way to query their data. They'd like a data engineer to create a Apache Cassandra database to optimize their analyses.

Task: Create a non-relational database and an ETL pipeline that make analyses easier.

Project:

Denormalization of the raw data
Create an Apache Cassandra database
Load data into tables and verify the quality by running queries

Project 3: Cloud Data Warehouses with AWS

Skills: AmazonWebServices (S3, Redshift), Python, SQL

Sparkify Problem that needed to be tackle: Sparkify has grown their user base and song database and want to move their processes and data onto the cloud.

Task: Building an ETL pipeline that extracts Sparkify's data from S3, stages them in Redshift, and transforms data into a set of dimensional tables for their analytics team

Project:

Setup all the required credentials, roles and permission in AWS
Get data from S3 and add them into staging tables in redshift
Transform the raw data into a star-schema from Project 1 with SQL
Add the new tables in Redshift

Project 4: Data Lake with Apache Spark

Skills: AmazonWebServices (S3), Apache Spark, Python

Sparkify Problem that needed to be tackle: Sparkify has grown their user base and song database even more and want to move their data warehouse to a data lake.

Task: Build an ETL pipeline for a data lake hosted on S3

Project:

Setup all the required credentials, roles and permission in AWS
Get data from S3
Transform the raw data into a star-schema from Project 1 with Spark
Add the new tables in S3

Project 5: Data Pipeline with Apache Airflow

Skills: AmazonWebServices (S3), Apache Airflow

Sparkify Problem that needed to be tackle: Sparkify decided that it is time to introduce more automation and monitoring to their data warehouse ETL pipelines.

Task: Build data pipelines that are dynamic, can be monitored, and allow easy backfills. Ensure data quality by implementing tests against the datasets to catch any discrepancies.

Project:

Setup all the required credentials, roles and permission in AWS and hooks for Airflow
Create empty tables in Redshift
Create custom operators for the DAG:
- Operator that load raw data from S3 into empty staging table in Redshift
- Operator that tranform the staging data into the star schema from Project 1 and add them to new tables in Redshift
- Operator that check the quality of the data (look out for empty tables and test custom SQL queries)
Organize all oprator into a pipeline with dependancies

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
DataLake_Spark		DataLake_Spark
DataModeling_ApacheCassandra		DataModeling_ApacheCassandra
DataModeling_PostgreSQL		DataModeling_PostgreSQL
DataPipeline_ApacheAirflow		DataPipeline_ApacheAirflow
DataWarehouse_S3_Redshift		DataWarehouse_S3_Redshift
.gitignore		.gitignore
Data_Modeling.PNG		Data_Modeling.PNG
README.md		README.md
pipeline_dags.png		pipeline_dags.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Data engineering Nanodegree

Projects

Fictive company: Sparkify

Project 1: Data Modeling with PostgreSQL

Project 2: Data Modeling with Apache Cassandra

Project 3: Cloud Data Warehouses with AWS

Project 4: Data Lake with Apache Spark

Project 5: Data Pipeline with Apache Airflow

About

Uh oh!

Releases

Packages

Languages

Rammen/DataEngineering_Course

Folders and files

Latest commit

History

Repository files navigation

Data engineering Nanodegree

Projects

Fictive company: Sparkify

Project 1: Data Modeling with PostgreSQL

Project 2: Data Modeling with Apache Cassandra

Project 3: Cloud Data Warehouses with AWS

Project 4: Data Lake with Apache Spark

Project 5: Data Pipeline with Apache Airflow

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages