The purpose of this project is to create an automated ETL pipeline for the Amazing Prime Hackathon contest.
There were 3 data sources that needed to be cleaned and merged:
- Wikipedia JSON - This data was acquired by web-scraping Wikipedia sidebars.
- Movie Metadata CSV - This data was compiled from IMDB and acquired from Kaggle.
- Ratings CSV - This data was compiled from Metacritic and acquired from Kaggle.
The steps taken to clean and merge the aforementioned datasets were as follows:
- Read in files to Pandas.
- Combine wiki columns that were similar (
Directed bywas combined withDirectorand other similar combinations). - Drop wiki columns that had insufficient data points.
- Use regex to format numeric and date columns uniformly.
- Drop metadata columns that were unnecessary.
- Update data types for numeric/datetime columns in metadata.
- Merge wiki and metadata.
- Identify duplicate columns from wiki and metadata.
- Fill in missing values from metadata, where wiki values were not null.
- Drop the wiki columns that were duplicated because metadata had more consistent data.
- Format ratings to show count of different ratings for each movie (used pivot).
- Merge ratings with wiki/metadata.
- Export data to PostgreSQL.