Data source : https://www.kaggle.com/prachi13/customer-analytics
Table of Contents
• Stage 1 : We focus on Data Exploration, Exploratory Data Analysis, Business Insight and Visualization
• Stage 2 : We focus on Data Cleansing and Feature Engineering
• Stgae 3 : And then on last stage, We focus Modeling and Evaluation
Overall Project :
• Seek insight from the dataset with Exploratory Data Analysis
• Performed data cleansing, data processing, data engineering to prepare data before modeling
• Built a model to predict whether the shipping deliveries will be received late or on time by the customers
• Developt recommendations & benefit analysis based on insights and model prediction
An international e-commerce company that sells electronic products want to discover key insights from their customer database. Currently, most of the shipping deliveries are late.
| Variable | Type | Definition | Example |
|---|---|---|---|
| ID | Nominal | Customer ID Number | 10, 15, 10995, 10996 |
| Warehouse_block | Nominal | Warehouse to Store the Product | A, B, C, D, F |
| Mode_of_Shipment | Nominal | Mode of Product Shipping | Flight, Road, Ship |
| Customer_care_calls | Discrete | Number of Calls Made | 1, 2, 5, 6 |
| Customer_rating | Ordinal | Company Rating by Customers | 5: Best - 4: Better - 3: Neutral - 2: Bad - 1: Worst |
| Cost_of_the_Product | Discrete | Cost of Product in US Dollars | 177, 216, 236, 182 |
| Prior_purchases | Discrete | Number of Prior Purchase | 3, 2, 6 |
| Product_importance | Ordinal | Product Importance Parameter | Low, Medium, High |
| Gender | Nominal | Customer Gender | Male, Female |
| Discount_offered | Discrete | Product Discount in US Dollars | 65, 10, 16 |
| Weight_in_gms | Continous | Product Weight in grams | 4953, 5676, 2171 |
| Reached.on.Time_Y.N | Nominal | Target Variable, 1: NOT reached on time - 0: REACHED on time | 1, 0 |
-
59.7% of e-commerce shipping deliveries are late received by the customers (6.563 of 10.999 customers).

-
Ship & Warehouse F has the highest frequency of delivery. But it looks almost the same based on the percentage. There's an assumtion that the late is influenced by other factors.

-
Every product that gets a discount above 10 is confirmed Late. There is an assumption that this happens in specific months, but needs further checking.

-
Shipping delivery is confirmed late when the product weight is between 2-4 kg.

• Check missing & duplicate values
• Remove outliers with z-score
• Ordinal encoding for Importance column & feature encoding the rest of categorical columns
• Select best features for modeling
• Normalize & Standarize all selected features
• Split features & target
• Split data into data train & data test
• Train model with 5 different algorithm such as Decision Tree, Logistic Regression, Random Forest, XGBoost , KNN, & Lightgbm
• Evaluate model with Accuracy, Precision, Recall, F1-Score and AUC and focus on AUC Score
• Hyperparameter tuning
• Select the best model
| Model | Accuracy | Precision | Recall | F1-Score | AUC |
|---|---|---|---|---|---|
| Decision Tree | 0.65 | 0.72 | 0.66 | 0.69 | 0.65 |
| Logistic Regression | 0.58 | 0.58 | 1.00 | 0.73 | 0.50 |
| lightgbm | 0.66 | 0.76 | 0.60 | 0.67 | 0.739 |
| KNN | 0.66 | 0.78 | 0.56 | 0.65 | 0.67 |
| Random Forest | 0.68 | 0.82 | 0.56 | 0.67 | 0.70 |
| XGBoost | 0.65 | 0.71 | 0.67 | 0.69 | 0.65 |
• Add estimatedarrival time to assure the package arrived on time
• Give credit points as a compensations to retain customer loyalty
• Add more features to give more specific & accurate insights
• Perform operational audit based on the insights