The purpose of this project is to predict hotel cancellations and ADR (average daily rate) values for two separate Portuguese hotels (H1 and H2). Included in the GitHub repository are the datasets and notebooks for all models run.
The original datasets and research by Antonio et al. can be found here: Hotel Booking Demand Datasets (2019). All other relevant references have been cited in the below articles.
Average daily rate represents the average rate per day paid by a staying customer at a hotel. This is an important metric for a hotel, as it represents the overall profitability of each customer. In this example, auto_arima is used in Python to forecast the average daily rate over time for a hotel chain.
Handling Imbalanced Classification Data: Predicting Hotel Cancellations Using Support Vector Machines
When attempting to build a classification algorithm, one must often contend with the issue of an unbalanced dataset. An unbalanced dataset is one where there is an unequal sample size between classes, which induces significant bias into the predictions of the classifier in question. This example illustrates the use of a Support Vector Machine to classify hotel booking customers in terms of cancellation risk.
When it comes to hotel bookings, average daily rate (ADR) is a particularly important metric. This reflects the average rate per day that a particular customer pays throughout their stay. In this particular example, a neural network is built in Keras to solve a regression problem, i.e. one where our dependent variable (y) is in interval format and we are trying to predict the quantity of y with as much accuracy as possible.
-
Used pandas to collate over 115,000 individual cancellation and ADR entries into a weekly time series format.
-
Identified lead time, country of origin, market segment, deposit type, customer type, required car parking spaces, and week of arrival as the most important features in explaining the variation in hotel cancellations.
-
Trained classification models on the H1 dataset and tested against the H2 dataset. Used boto3 and botocore to import data from AWS S3 bucket to SageMaker.
-
Used the Explainable Boosting Classifier by InterpretML, KNN, Naive Bayes, Support Vector Machines, and XGBoost to predict cancellations across the test set.
-
SVM demonstrated the best performance overall with an f1-score accuracy of 71%, and 66% recall across the cancellation class.
-
An ANN model was also trained in conjunction with dice_ml to identify Diverse Counterfactual Explanations for hotel bookings, i.e. changes in feature parameters that would cause a non-canceling customer to cancel, and vice versa.
-
Used regression modelling to predict ADR (average daily rate) across each customer.
-
Trained regression models on the H1 dataset and tested against the H2 dataset.
-
Regression-based neural network with elu activation function showed the best performance, with a mean absolute error of 29 compared to the mean ADR of 105 across the test set.