Movie Analysis on AWS
This repository contains the code and configuration files for setting up a movie recommendation system on AWS.
Overview
The movie recommendation system is designed to recommend movies to users based on their preferences. It utilizes both structured and unstructured datasets, performs data merging, cleansing, and loading, and leverages AWS services such as S3, Lambda, RDS, Glue, and SNS.
Setup Instructions
Create two S3 buckets to store data
Note:- To distribute functionality, you need to create users and assign roles to them.
1. Structured Dataset
- Download the structured dataset from MovieLens.
- Using Python(movie_csv.ipynb) to insert the dataset into an S3 bucket named final-project-rawdata-group-5-sec-2.
2. Unstructured Dataset
- Use Python(api_call_upload_S3.ipynb) to call the TMDB API and obtain the unstructured dataset.
- Insert the dataset into the same S3 bucket (final-project-rawdata-group-5-sec-2).
3. Data Merging with AWS Lambda
- Add the necessary layers in your lambda function(Amazon SDK)
- Assign an IAM role with full access to S3 and use it in the Lambda function.
- Create an AWS Lambda function(s3_upload_using_lambda) to merge and clean the datasets from the S3 bucket and load it into another S3 bucket(final-project-mergedata-group-5-sec-2).
- Configure Lambda triggers to run the function every 30 days(rate(30 days) and on file upload.
- Ensure proper naming conventions for files (api_movie_data.csv and movies.csv).
- Utilize Python with Pandas and Numpy for data cleaning and merging.
In the below screenshot we can see that the data is successfully loaded in the S3 bucket, to confirm this we can check the time.
Note:- Lambda_function will replace the old file with the new one.
4. Database Setup
- Create a Database instance in RDS. (make sure to make a database when creating instance)
- Use .tosql() function in Python to create a table from the merged dataset.
- Insert the table into an RDS instance.
- Design the conceptual model of the database.
- Identify dimensions and facts for the data warehouse schema.
- Define primary keys, foreign keys, and relationships between tables.
- Connected MySQL Workbench to AWS RDS instance
- Created Dimensions and fact table using DDL(create_table_OLAP.sql) in RDS
Note:- You have to input your access key and aws secret key.
Let’s check whether the tables are inserted or not. The screenshot given below are the proof that the data is successfully inserted in the table


5. AWS Glue Service(ETL)
- Assign appropriate policies to a role and give that role to a user for ETL operations.
Note:- Below are the policies that you need to assign to a role and grant to the user for managing ETL
- Create MySQL connection to the RDS instance in AWS Glue.
- Use a crawler to map data to the RDS database and table instance.(use JDBC data source, proper IAM role, frequency as on demand)
- Use Visual ETL to create two ETL jobs: dimension_job and fact_job to load data into dimensions and facts, respectively.(First create dimension_job)
- Extract the data from movie_details database, transform it and load into Data warehouse (refer ETL using glue)
Note:- Below are the diagram which confirms that the data is successfully loaded into the database
6. Email Notifications
- Select the topic you created and create a subscription and confirm the subscription
- In your RDS instance create Event Subscription and choose the SNS topic you created earlier, then update your database to confirm the email
7. Visualization with Tableau
- Connect the RDS database to Tableau for visualization.
- Create visualizations to showcase relevant results.(Movie Analysis.twb)
Prerequisites
- AWS account with appropriate permissions.
- Python environment with Pandas and Numpy for Lambda function.
- MySQL Workbench
- Access to Tableau for visualization.