This project presents a scalable book recommendation system built upon a big data processing pipeline using the Medallion architecture (Bronze–Silver–Gold). It utilizes Apache Spark for ETL and model training, Hadoop HDFS for data lake storage, and Apache MLlib for collaborative filtering. This system is designed to demonstrate practical implementation of large-scale data engineering and machine learning workflows using open-source tools.
This repository presents an end-to-end implementation of a Book Recommendation System built upon a modern Data Lakehouse architecture. It showcases practical application of data engineering principles, machine learning workflows, and visualization—all orchestrated in a modular and reproducible pipeline.
🔍 Key Components Included:
-
Data Lake Design Implements the Medallion Architecture (Bronze → Silver → Gold) to structure raw, cleaned, and enriched datasets.
-
Data Engineering Workflows Ingests and processes data using Apache Spark, transforming source CSVs into efficient analytical formats (Parquet/ORC).
-
Machine Learning Pipeline Builds a Collaborative Filtering model (ALS) to generate personalized book recommendations.
Layer | Technology | Purpose |
---|---|---|
Storage | HDFS | Distributed file storage |
Processing | Apache Spark | Data processing & ML |
Metadata | Apache Hive | Data warehouse & SQL |
Analytics | PySpark | Data analysis |
ML | Spark MLlib, | Machine learning |
Data lake architecture
-
Bronze: raw CSV files in HDFS
-
Silver: cleaned & joined data in Parquet
-
Gold: model output (Top-N recommendations) in ORC/Parquet
File | Records | Columns | Deskripsi |
---|---|---|---|
books.csv | ~271k | 23 | Informasi detail buku |
ratings.csv | ~1.1M | 3 | Rating pengguna untuk buku |
users.csv | ~278k | 3 | Informasi pengguna |
git clone https://github.com/sains-data/Sistem-Rekomendasi-Buku.git
docker-compose up -d
docker ps
Service | Function | Port | URL |
---|---|---|---|
HDFS NameNode | Melihat status file & direktori (hasil ingest) | 9870 |
http://localhost:9870 |
Spark Master | Melihat status cluster & daftar aplikasi | 8080 |
http://localhost:8080 |
Spark Aplikasi | Detail & progres job Spark (ingest, ML, dll.) | 4040 * |
http://localhost:4040 |
HiveServer2 | Menjalankan kueri SQL & melihat sesi | 10002 |
http://localhost:10002 |
# Buat direktori HDFS
docker exec -it namenode hdfs dfs -mkdir -p /user/book-recommendation
docker exec -it namenode hdfs dfs -mkdir -p /user/book-recommendation/bronze
docker exec -it namenode hdfs dfs -mkdir -p /user/book-recommendation/silver
docker exec -it namenode hdfs dfs -mkdir -p /user/book-recommendation/gold
# Make script executable
chmod +x scripts/ingest.sh
# Run ingestion
./scripts/ingest.sh
# Run ETL script
docker exec -it spark-master spark-submit \
--master spark://spark-master:7077 \
/opt/spark/scripts/etl_bronze_to_silver.py
# Run ETL script
docker exec -it spark-master spark-submit \
--master spark://spark-master:7077 \
/opt/spark/scripts/transform_silver_to_gold.py
docker exec -it spark-master spark-submit \
--master spark://spark-master:7077 \
/opt/spark/scripts/train_als_model.py
# Run model evaluation script
docker exec -it spark-master spark-submit \
--master spark://spark-master:7077 \
/opt/spark/scripts/evaluate_model.py
Once you're done working with the pipeline and the services, you can gracefully stop and remove all running containers, networks, and volumes (unless declared as named volumes) using:
docker-compose down
book-recommendation-system/
├── README.md
├── airflows
│ ├── dags
│ └── airflow_dag.py
├── books_data/
│ ├── books.csv
│ ├── users.csv
│ └── ratings.csv
├── Docs/
│ └── data-lake-architecture.png
| └── README.md
|
├── src/
│ ├── ingest.sh
│ ├── etl_spark.py
│ ├── train_model.py
│ ├── evaluate_model.py
│ └── book_recommendation.py
│ └── airflow_dag.py
├── docker-compose.yml
└── requirements.txt
- Mayada (121450145)
- Natasya Ega Lina Marbun (122450024)
- Syalaisha Andini Putriansyah (122450111)
- Anwar Muslim (122450117)