- Design and implement a data pipeline using Apache Spark with the PySpark Library for pre-processing the raw dataset.
- Compare the performance of three supervised classification techniques to suggest an efficient model for predicting whether the customer placed an order
Dataset: Link
Summary: This data set represents a day's worth of visit to a fictional website. Each row represents a unique customer, identified by their unique UserID. The columns represent feature of the users visit (such as the device they were using) and things the user did on the website in that day. It’s a classic dataset to explore and expand your analytics skills and sensitive with business case in the future
Problem: Predict whether the customer placed an order
Describe: It has 445,402 rows in train set and 161,656 rows in test set. Each have 25 columns (features) and can get some describe on the link above but in this data a few of the features we need to consider are:
- basket_add_detail: Did the customer add a product to their shopping basket from the product detail page?
- sign_in: Did the customer sign in to the website?
- saw_homepage: Did the customer visit the website's homepage?
- returning_user: Is this visitor new, or returning?
- ordered: Did the customer place an order?
git clone https://github.com/nghoanglong/customer-prosensity.git
cd customer-prosensity
bash buil-image.sh
docker-compose up
- hadoop cluster: http://localhost:50070
- hadoop cluster - resource manager: http://localhost:8088
- spark cluster: https://localhost:8080
- jupyter notebook: https://localhost:8888
docker cp [dataset.csv] hadoop-spark-master:/root
docker exec -it hadoop-spark-master bash
hdfs dfs -put dataset.csv [hadoop_path]
Language: Python
Framework/Libary: Apache Spark, PySpark, Pandas, Numpy
Machine Learning Algorithms: Decision Tree Regression
Metrics: Accuracy, F1-score, Recall, Precision
Author: Nguyen Hoang Long
Facebook: https://www.facebook.com/umhummmm/