Recommendation engine using the item-based collaborative filtering technique built on top of Hadoop Distributed File System and MapReduce to recommend movies with a repository of Netflix movie rating data.
The whole system runs on a cluster of AWS EC2 instances with the SunU-Hadoop-Image v1.3 AMI
, where it is configured with 1 master node and 5 slave nodes, all of which uses the t2.large
instance type.
The flow of task orchestrating follows the flow below:
stateDiagram
Netflix_Dataset(Input) --> UserList
Netflix_Dataset(Input) --> DataDividedByMovie
DataDividedByMovie --> MovieVector
UserList --> MovieVector
MovieVector --> CosineSimilarity
CosineSimilarity --> Similarity_Table(Output)
UserList
Key: "$User_List"
Pair: UserID
Eg. "$User_List" = 1025579
DataDividedByMovie
Key: MovieTitle
Pair: (UserID, Rating)
Eg. Character = (1025579, 4)
MovieVector
Key: MovieTitle
Pair: VectorSpace
Eg. Character = [4, 0, 5, 0, 0, ...]
CosineSimilarity
Key: (MovieTitle, MovieTitle)
Pair: SimilarityScore
Eg. (Character, Captain Blood) = 0.2121
- Ensure you have configured EC2 clusters accordingly with MapReduce and HDFS.
# check if its empty
hadoop fs -ls /
# create the directories
hadoop fs -mkdir /user
hadoop fs -mkdir /user/hadoop
hadoop fs -mkdir /user/hadoop/netflix_data
hadoop fs -mkdir /user/hadoop/results
- Clone the repository in the master node
git clone "https://github.com/Grg0rry/MapReduce-Recommendation-System"
- Download the dataset [follow here]
- Upload the data to HDFS
hadoop fs -put data/cleaned_moviesTitles.csv netflix_data/
hadoop fs -put data/sample netflix_data/
IMPORTANT Double check if it follow the file structure [follow here]
(Proceed only if the file structure matches)
- run chmod to make the scripts executable
chmod +x script/*
- execute the script
script/<script-file.sh>
The original dataset comes from Netflix Kaggle Competition Page [Click Here]
|-- data
| |-- combined_data_1.txt
| |-- combined_data_2.txt
| |-- combined_data_3.txt
| |-- combined_data_4.txt
| |-- movies_titles.csv
| |-- ...
Data Cleaning was done using the script/processing.sh
script to merge and aggregate the data to form the structure below.
|-- data
| |-- ...
| |-- cleaned_ratings.csv
| |-- cleaned_titles.csv
| |-- cleaned_moviesTitles.csv
| |-- sample
| |-- ...
_For the sample, it is obtained by running _
split -b 500M data/cleaned_moviesTitles.csv sample
To directly access the cleaned dataset
- Download the zip file: [Download]
- Access through AWS-CLI
aws s3 ls s3://public-netflix-data-store/data
aws s3 sync s3://public-netflix-data-store/data/* <local-folder>
Note:
The data cleaning can only be done on a local instance without MapReduce due to the format and structure of combined_data_*.txt
file.
Preferably perform it with instances type of (t2.xlarge)
with the EBS volume size set to at least (20 GiB)
.
Below is the structure used in organising files in both the HDFS and Local Linux Directory.
Directory Structure (Local):
/home/hadoop/recommendation-system
|-- src
| |-- java_mapreduce
| | |-- javamr.jar
| | `-- solution
| | |-- UserList.java
| | |-- DataDividedByMovie.java
| | |-- MoviesVector.java
| | `-- CosineSimilarity.java
| |-- py_mapreduce
| | |-- UserList_Mapper.py
| | |-- UserList_Reducer.py
| | |-- DataDividedByMovie_Mapper.py
| | |-- DataDividedByMovie_Reducer.py
| | |-- MoviesVector_Mapper.py
| | |-- MoviesVector_Reducer.py
| | |-- CosineSimilarity_Mapper.py
| | `-- CosineSimilarity_Reducer.py
| |-- py_mrjob
| | |-- UserList.py
| | |-- DataDividedByMovie.py
| | |-- MoviesVector.py
| | `-- CosineSimilarity.py
| `-- main.py
|-- preprocessing
| |-- RatingsPreprocessing.py
| |-- TitlesPreprocessing.py
| `-- CombineMovieTitles.py
|-- script
| |-- preprocessing.sh
| |-- local_mapreduce.sh
| |-- py_mapred_streaming.sh
| |-- py_mrjob.sh
| |-- java_mapreduce.sh
| |-- java_py_mapreduce.sh
| `-- recommendMovie.sh
|...
Directory Structure (HDFS):
/user/hadoop
|-- netflix_data
| |-- sample
| `-- cleaned_moviesTitles.csv
`-- results
|-- local_mapreduce
|-- py_mapred
|-- py_mrjob
|-- java_mapreduce
`-- java_py_mapreduce