This project focuses on predicting bike rental demand for the Capital Bikeshare program in Washington, D.C. by analyzing historical usage patterns and weather data.
To forecast the demand for bike rentals by leveraging data science techniques, including exploratory data analysis, feature engineering, and machine learning models.
- Om Jodhpurkar
- Sandesh Sachdev
- Aayush Chandak
- Hevesh Lakhwani
- Python 🐍 - Programming Language
- AWS Sagemaker ☁️ - Jupyter Notebook for model training
- Amazon S3 📂 - Cloud storage for datasets
Column Name | Type | Description |
---|---|---|
DATETIME |
Datetime | "yyyy/mm/dd hh:mm" format |
SEASON |
Integer | 1 = Spring, 2 = Summer, 3 = Fall, 4 = Winter |
HOLIDAY |
Integer | 1 = Holiday, 0 = Not a holiday |
WORKINGDAY |
Integer | 1 = Working day, 0 = Weekend/Holiday |
WEATHER |
Integer | 1 = Clear/Few Clouds, 2 = Mist/Cloudy, 3 = Light Snow/Rain, 4 = Heavy Rain |
TEMP |
Float | Hourly temperature (°C) |
ATEMP |
Float | "Feels like" temperature (°C) |
HUMIDITY |
Float | Relative humidity (%) |
WINDSPEED |
Float | Wind speed |
Column Name | Type | Description |
---|---|---|
REGISTERED |
Integer | Number of registered users |
CASUAL |
Integer | Number of non-registered users |
COUNT |
Integer | Total rentals (registered + casual ) |
📌 Note: Modeling was done separately for casual
and registered
values to predict the total count.
EDA techniques used in this project include:
- Handling Missing Values ❌
- Removing Duplicates 🗑️
- Outlier Treatment 📏
- Data Normalization & Scaling 📉
- Encoding Categorical Variables 🔠
- Bivariate Analysis 📈
To train the model effectively, the DATETIME
column was broken down into:
Feature | Type | Range |
---|---|---|
HOUR |
Integer | 0-23 |
DAY |
Integer | 0-6 (Weekday representation) |
MONTH |
Integer | 1-12 |
- Hourly Trends: Peak rental hours differ for working vs. non-working days.
- Seasonal Trends: Fall (
Season 3
) has the highest rentals, while Spring (Season 1
) has the least. - Temperature Relation: Higher temperatures result in more bike rentals.
- New Feature "PEAK": Created based on high-demand hours.
Why Feature Selection?
- Reduces training time ⏳
- Reduces algorithm complexity ⚡
- Avoids misleading data ❌
- Minimizes redundancy 🔄
- Prevents overfitting 📉
- Improves model accuracy ✅
- Univariate Selection 📊
- Feature Importance 🌟
- Correlation Heatmap 🔥
For Casual Count | For Registered Count |
---|---|
Hour ⏳ | Hour ⏳ |
Humidity 💧 | Humidity 💧 |
Temperature 🌡️ | Month 📅 |
Working Day 🏢 | Working Day 🏢 |
Peak 🚀 | Peak 🚀 |
We explored various regression models, including:
- Linear Regression 📉
- Decision Trees 🌳
- Random Forest 🌲
- Adaptive Boosting (AdaBoost) 🔥
- Gradient Boosting (GBM) 📈
- XGBoost 🚀
We used Root Mean Square Error (RMSE) as our key evaluation metric.
1️⃣ Create: Declare the model
2️⃣ Train: Fit the model on the training data
3️⃣ Evaluate: Measure performance using RMSE
4️⃣ Predict: Generate predictions on the test data
Cross-validation was used to evaluate model generalizability. The process included:
- Shuffling the dataset randomly
- Splitting it into
k
groups - Training on
k-1
groups and testing on the remaining group - Averaging the performance across all groups
We performed hyperparameter tuning using:
- Grid Search 🔍
- Random Search 🎲
Model | RMSE (Before) | RMSE (After Tuning) |
---|---|---|
XGBRegressor (Casual) | 16.31 | 14.91 |
RandomForestRegressor (Casual) | 16.73 | 15.92 |
XGBRegressor (Registered) | 53.52 | 51.95 |
RandomForestRegressor (Registered) | 57.06 | 54.27 |
Our analysis successfully predicted bike rental demand with improved accuracy after feature selection, hyperparameter tuning, and model selection. XGBoost and RandomForestRegressor provided the best results.