📚 Book Recommendation System Using Data Lake Architecture

This project presents a scalable book recommendation system built upon a big data processing pipeline using the Medallion architecture (Bronze–Silver–Gold). It utilizes Apache Spark for ETL and model training, Hadoop HDFS for data lake storage, and Apache MLlib for collaborative filtering. This system is designed to demonstrate practical implementation of large-scale data engineering and machine learning workflows using open-source tools.

📘 Project Overview

This repository presents an end-to-end implementation of a Book Recommendation System built upon a modern Data Lakehouse architecture. It showcases practical application of data engineering principles, machine learning workflows, and visualization—all orchestrated in a modular and reproducible pipeline.

🔍 Key Components Included:

Data Lake Design Implements the Medallion Architecture (Bronze → Silver → Gold) to structure raw, cleaned, and enriched datasets.
Data Engineering Workflows Ingests and processes data using Apache Spark, transforming source CSVs into efficient analytical formats (Parquet/ORC).
Machine Learning Pipeline Builds a Collaborative Filtering model (ALS) to generate personalized book recommendations.

Technology Stack

Layer	Technology	Purpose
Storage	HDFS	Distributed file storage
Processing	Apache Spark	Data processing & ML
Metadata	Apache Hive	Data warehouse & SQL
Analytics	PySpark	Data analysis
ML	Spark MLlib,	Machine learning

Architecture and System Component

Data lake architecture

Bronze: raw CSV files in HDFS
Silver: cleaned & joined data in Parquet
Gold: model output (Top-N recommendations) in ORC/Parquet

Dataset Overview

File	Records	Columns	Deskripsi
books.csv	~271k	23	Informasi detail buku
ratings.csv	~1.1M	3	Rating pengguna untuk buku
users.csv	~278k	3	Informasi pengguna

📝 Deployment

1. Clone Repositori

git clone https://github.com/sains-data/Sistem-Rekomendasi-Buku.git

2. Start services

docker-compose up -d

3. Check running container

docker ps

4. Test web UI

Service	Function	Port	URL
HDFS NameNode	Melihat status file & direktori (hasil ingest)	`9870`	`http://localhost:9870`
Spark Master	Melihat status cluster & daftar aplikasi	`8080`	`http://localhost:8080`
Spark Aplikasi	Detail & progres job Spark (ingest, ML, dll.)	`4040`*	`http://localhost:4040`
HiveServer2	Menjalankan kueri SQL & melihat sesi	`10002`	`http://localhost:10002`

5. Initialize HDFS

# Buat direktori HDFS
docker exec -it namenode hdfs dfs -mkdir -p /user/book-recommendation
docker exec -it namenode hdfs dfs -mkdir -p /user/book-recommendation/bronze
docker exec -it namenode hdfs dfs -mkdir -p /user/book-recommendation/silver
docker exec -it namenode hdfs dfs -mkdir -p /user/book-recommendation/gold

6. Data Ingestion

# Make script executable
chmod +x scripts/ingest.sh

# Run ingestion
./scripts/ingest.sh

7. ETL

# Run ETL script
docker exec -it spark-master spark-submit \
    --master spark://spark-master:7077 \
    /opt/spark/scripts/etl_bronze_to_silver.py

# Run ETL script
docker exec -it spark-master spark-submit \
    --master spark://spark-master:7077 \
    /opt/spark/scripts/transform_silver_to_gold.py

8. Model Training: ALS Algorithm

docker exec -it spark-master spark-submit \
    --master spark://spark-master:7077 \
    /opt/spark/scripts/train_als_model.py

9.Evaluate Model

# Run model evaluation script
docker exec -it spark-master spark-submit \
    --master spark://spark-master:7077 \
    /opt/spark/scripts/evaluate_model.py

10. Shutting Down the Project

Once you're done working with the pipeline and the services, you can gracefully stop and remove all running containers, networks, and volumes (unless declared as named volumes) using:

docker-compose down

Struktur Folder

book-recommendation-system/
├── README.md
├── airflows
│   ├── dags 
│       └── airflow_dag.py
├── books_data/
│   ├── books.csv
│   ├── users.csv
│   └── ratings.csv
├── Docs/
│   └── data-lake-architecture.png
|   └── README.md
|
├── src/
│   ├── ingest.sh   
│   ├── etl_spark.py
│   ├── train_model.py
│   ├── evaluate_model.py
│   └── book_recommendation.py
│   └── airflow_dag.py
├── docker-compose.yml
└── requirements.txt

Contributor

Mayada (121450145)
Natasya Ega Lina Marbun (122450024)
Syalaisha Andini Putriansyah (122450111)
Anwar Muslim (122450117)

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
Script		Script
config		config
dags		dags
data_source		data_source
LICENSE.txt		LICENSE.txt
README.md		README.md
avg_rating_by_age_year_heatmap.png		avg_rating_by_age_year_heatmap.png
book_recommendations_by_age.png		book_recommendations_by_age.png
docker-compose.yml		docker-compose.yml
user_rating_distribution.png		user_rating_distribution.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

📚 Book Recommendation System Using Data Lake Architecture

📘 Project Overview

Technology Stack

Architecture and System Component

Dataset Overview

📝 Deployment

1. Clone Repositori

2. Start services

3. Check running container

4. Test web UI

5. Initialize HDFS

6. Data Ingestion

7. ETL

8. Model Training: ALS Algorithm

9.Evaluate Model

10. Shutting Down the Project

Struktur Folder

Contributor

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

Languages

License

sains-data/Sistem-Rekomendasi-Buku

Folders and files

Latest commit

History

Repository files navigation

📚 Book Recommendation System Using Data Lake Architecture

📘 Project Overview

Technology Stack

Architecture and System Component

Dataset Overview

📝 Deployment

1. Clone Repositori

2. Start services

3. Check running container

4. Test web UI

5. Initialize HDFS

6. Data Ingestion

7. ETL

8. Model Training: ALS Algorithm

9.Evaluate Model

10. Shutting Down the Project

Struktur Folder

Contributor

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

Packages