Movies ETL

Overview of Project

The purpose of this project is to create an automated ETL pipeline for the Amazing Prime Hackathon contest.

There were 3 data sources that needed to be cleaned and merged:

The steps taken to clean and merge the aforementioned datasets were as follows:

Read in files to Pandas.
Combine wiki columns that were similar (Directed by was combined with Director and other similar combinations).
Drop wiki columns that had insufficient data points.
Use regex to format numeric and date columns uniformly.
Drop metadata columns that were unnecessary.
Update data types for numeric/datetime columns in metadata.
Merge wiki and metadata.
Identify duplicate columns from wiki and metadata.
Fill in missing values from metadata, where wiki values were not null.
Drop the wiki columns that were duplicated because metadata had more consistent data.
Format ratings to show count of different ratings for each movie (used pivot).
Merge ratings with wiki/metadata.
Export data to PostgreSQL.

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
Resources		Resources
.gitignore		.gitignore
ETL_clean_kaggle_data.ipynb		ETL_clean_kaggle_data.ipynb
ETL_clean_wiki_movies.ipynb		ETL_clean_wiki_movies.ipynb
ETL_create_database.ipynb		ETL_create_database.ipynb
ETL_function_test.ipynb		ETL_function_test.ipynb
README.md		README.md
wiki_data_explore.ipynb		wiki_data_explore.ipynb