This repository contains a detailed analysis and predictive modeling project aimed at forecasting the number of days an Airbnb listing in Los Angeles will be booked in the next 30 days. Below, you’ll find an overview of the project, methodologies, results, and actionable insights.
Table of Contents for the Jupyter Notebook (airbnb_booking_days_prediction.ipynb)
- Introduction
- Library Import
- Data Import and Basic Preprocessing
- Data Split
- Exploratory Data Analysis (EDA)
- 5.1 Data Overview
- 5.2 Numerical Features
- 5.3 Categorical Features
- Preprocessing and Transformations
- Define Functions for Comparing Results
- Baseline Model
- Ridge and Hyperparameter Tuning
- Elastic Net and Hyperparameter Tuning
- Check Permutation Importance
- More Feature Engineering
- Compare Model Performance After More Feature Engineering
- Non-Linear Model: Random Forest and Hyperparameter Tuning
- Permutation Importance for the Best Model
- Final Prediction
- Insights and Recommendations
- Improvements
The data for this project is sourced from Inside Airbnb, providing extensive information about Airbnb properties listed in Los Angeles, California. The dataset includes over 70 features related to listings, hosts, and booking patterns. For a detailed understanding of the data, refer to the data dictionary here: link.
Unlike the common focus on predicting listing prices, the primary goal of this project is to predict the number of days a given Airbnb listing will be booked in the next 30 days. This shift aims to provide actionable insights for hosts and the platform to optimize booking performance.
The dataset contains over 70 features, and it is unclear which are most relevant for predicting booking days. This necessitates multiple rounds of feature engineering to identify and prioritize the most impactful variables, addressing issues like multicollinearity, missing data, and feature relevance.
- Data Preprocessing Techniques: Imputing missing values, normalization, discretization, clamping outliers, log transformation, one-hot encoding, missing indicator, checking residuals, and creating interaction terms.
- Building Data Transformation Pipelines: Streamlined workflows for consistent preprocessing across models.
- Regularization: Ridge and Elastic Net to handle overfitting and feature selection.
- Cross Validation: To ensure robust model evaluation.
- Hyperparameter Tuning: GridSearchCV for optimizing model parameters.
- Feature Importance: Permutation importance to assess feature contributions.
- Self-Defining Functions: Custom functions for comparing and visualizing results.
Initial exploration of the dataset reveals its structure, including feature types and distributions, setting the stage for targeted preprocessing.
Certain numerical features, such as id
, columns with minimal variance (e.g., calendar_updated
), and the target column (days_booked
), provide little predictive value and are dropped from the training set.
Histogram analysis highlights distribution patterns, identifying features requiring transformation:
- Features like
beds
,bedrooms
, andbathrooms
have low cardinality and are treated as discrete. - Maximum and minimum nights features show clustering and outliers, warranting clamping to 30 and 365 days.
- Features like
price
,number_of_reviews
, andcalculated_host_listings_count
exhibit right-skewed distributions, benefiting from log transformation.
- Hand-Picked useful and relevant categorical features include
property_type
,room_type
,neighbourhood_cleansed
,instant_bookable
,host_is_superhost
,host_identity_verified
,host_has_profile_pic
,host_response_time
, andhost_verifications
due to their non-textual nature and manageable complexity (unlike amenities). - Tag informative missing values in
neighborhood_overview
,last_review
, andhost_about
to indicate lower trustworthiness or popularity.
The identified transformations are assembled into a ColumnTransformer to create a consistent preprocessing pipeline. This pipeline is used to build a baseline model, followed by regularization techniques (Ridge and Elastic Net) to reduce overfitting and improve test set performance. Hyperparameter tuning with GridSearchCV optimizes model parameters.
- A baseline model is established using the initial preprocessing pipeline to provide a reference point for subsequent improvements. But the gap between train and test R² scores is too large, indicating overfitting.
- Ridge and Elastic Net are applied with hyperparameter tuning to balance model complexity and performance, reducing overfitting. Elastic net stands out with the highest test R² score.
- Permutation importance is calculated to evaluate the contribution of each feature to the model's predictions.
- Discretize Nights and Availability 30: To introduce nonlinearity, additional discretization is applied to features like minimum_nights, availability_30, and others, based on their importance scores.
- Add Interaction Terms for Latitude and Longitude: Residual analysis of latitude and longitude suggests potential nonlinear relationships, prompting the creation of interaction terms.
- Add Interaction Terms for Price and Bedrooms: Similar analysis for price and bedrooms indicates nonlinear interactions, leading to additional feature engineering.
A Random Forest model is explored for its ability to handle nonlinear relationships, sensitivity to outliers, and built-in feature importance. Hyperparameter tuning is conducted to optimize performance.
The project evaluated multiple models, with performance metrics (R² scores) provided below for train and test sets:
Model | Linear Regression | Ridge | Elastic Net | Elastic Net 2 | Random Forest |
---|---|---|---|---|---|
Train | 0.284392 | 0.246037 | 0.241600 | 0.251292 | 0.580813 |
Test | 0.181947 | 0.223522 | 0.225277 | 0.223712 | 0.254365 |
Additionally, cross-validation results:
- Mean Test Score: Linear Regression (0.216175), Ridge (0.216902), Elastic Net (0.219403), Random Forest (0.229249)
- Std Test Score: Linear Regression (0.010494), Ridge (0.010908), Elastic Net (0.01124), Random Forest (0.01124)
- Params: Model-specific hyperparameters (e.g., Ridge alpha: 417.53189365604004, RF max_depth: 10, max_features: 0.1)
The Random Forest model outperformed others on the training set (0.580813), though its test performance (0.254365) indicates some overfitting. Its cross-validation mean score (0.229249) suggests it generalizes better than linear models, making it the preferred choice.
-
Most Important Features:
Category Feature Importance Guest Engagement Number of Reviews 0.017971 Guest Engagement Number of Reviews LTM 0.017780 Availability Availability_60 0.013940 Availability Availability_90 0.011775 Availability Availability_30 0.008913 Guest Engagement Reviews per Month 0.008624 Stay Rules Minimum Nights 0.005177 Host Metrics Calculated Host Listings Count (Entire Homes) 0.004620 Guest Satisfaction Review Scores Value 0.004369 Availability Availability_365 0.004075 Guest Satisfaction Review Scores Communication 0.004058 Guest Satisfaction Review Scores Check-in 0.003777 Guest Satisfaction Review Scores Location 0.003672 Pricing Price 0.003523 Guest Satisfaction Review Scores Cleanliness 0.003362 Guest Engagement Number of Reviews L30D 0.003031 Guest Satisfaction Review Scores Rating 0.002993 -
Least Important Features:
Category Feature Importance Host Metrics Calculated Host Listings Count -0.000011 Stay Rules Maximum Nights -0.000040 Host Metrics Host Listings Count -0.000062 Guest Satisfaction Review Scores Accuracy -0.000116 Host Profile Host Has Profile Pic -0.000152 Host Profile Host Identity Verified -0.000220 Host Profile Host About -0.000394 Host Responsiveness Host Response Rate -0.000592 Booking Logistics Instant Bookable -0.001187 Host Responsiveness Host Response Time -0.002024 Location Neighbourhood Cleansed -0.002765
- Reviews and Activity Rule: Listings with high and recent review counts signal trust and demand, driving bookings.
- Availability is a Lever: Short-term availability (30-90 days) is critical, reflecting LA’s spontaneous travel trends.
- Quality Supports, Doesn’t Lead: Review scores (value, communication, check-in) matter but are secondary to engagement and availability.
- Price is Contextual: It influences decisions but is secondary to perceived value.
- Host Profile and Logistics Fade: Features like host profile details or instant booking have minimal impact.
- Maximize Review Volume and Recency: Prompt guests for reviews post-stay and highlight recent feedback.
- Keep Short-Term Availability High: Open calendars for 30-90 days, adjusting dynamically.
- Optimize Minimum Nights: Set to 1-2 nights to attract more guests.
- Prioritize Value and Operations: Price competitively, streamline check-ins, and enhance communication.
- Highlight Entire-Home Strengths: Market privacy and unique features of entire homes.
- Don’t Over-Focus on Profile or Instant Booking: Focus on listing quality over minor host traits.
- Boost Review Engagement: Automate review requests and gamify guest participation.
- Refine Availability Tools: Provide dashboards to optimize 30-90 day availability.
- Promote Value-Driven Pricing: Enhance Smart Pricing with review score integration.
- Simplify Guest Experience Metrics: Focus host education on communication and value.
- Rethink Instant Booking Push: Offer more control over instant booking options.
- Downplay Neighborhood Granularity: Emphasize broader LA appeal in marketing.
While the Random Forest model provided the best performance, there are several areas for improvement to enhance predictive accuracy and generalization:
-
More Hyperparameter Tuning for Random Forest:
- The Random Forest model shows signs of overfitting (train R²: 0.580813, test R²: 0.254365). Expand hyperparameter tuning using GridSearchCV to explore a wider range of parameters, such as increasing
min_samples_split
, reducingmax_depth
, or adjustingmax_features
, to improve generalization and reduce the train-test performance gap.
- The Random Forest model shows signs of overfitting (train R²: 0.580813, test R²: 0.254365). Expand hyperparameter tuning using GridSearchCV to explore a wider range of parameters, such as increasing
-
Explore Other Non-Linear Models:
- Test additional non-linear models like XGBoost and LightGBM, which often handle complex datasets well and may outperform Random Forest. Compare their performance using the same cross-validation framework to identify if they offer better generalization or higher test R² scores.
-
Enhanced Feature Engineering with Amenities:
- The
amenities
column contains complex textual data (e.g., lists of features like "Wi-Fi," "parking," "pool"). Process this column by extracting key amenities through text parsing or NLP techniques (e.g., tokenization, one-hot encoding of common amenities). This could uncover valuable predictors, especially for non-linear models like Random Forest or XGBoost, which can capture intricate patterns.
- The
-
Incorporate NLP on Text Columns:
- The current model only uses numerical and categorical features, ignoring text columns like listing descriptions or host bios. Apply NLP techniques (e.g., TF-IDF, word embeddings) to these columns to extract sentiment, keywords, or thematic elements (e.g., "luxury," "cozy"). This could improve model performance by capturing qualitative factors that influence bookings.
This project successfully predicts Airbnb booking days in LA, with the Random Forest model offering the best performance after extensive feature engineering and hyperparameter tuning. The insights highlight the importance of reviews, availability, and guest experience, providing a roadmap for hosts and Airbnb to optimize bookings. Future improvements, such as deeper hyperparameter tuning, testing additional models, and leveraging textual data, could further enhance predictive power.