Introduction

This analysis aims to find the best-performing predicting model for Airbnb rental prices in New York for April and October 2023. This investigation utilizes a dataset encompassing essential features affecting rental pricing, including nightly rates, guest capacity, bedroom and bathroom counts, bed quantities, minimum and maximum nights for rental stays, and the number of reviews left by previous guests. To accomplish this objective, we employed a regression-based approach, utilizing three distinct models: Ordinary Least Squares (OLS), LASSO, and Random Forests. Within each method, three models were generated and cross-validated to gauge their predictive efficacy.

Cleaning the Dataset and Data Preparation

Before commencing the analysis, we embarked on a comprehensive data cleaning and preparation phase. This involved several crucial steps to ensure the accuracy and consistency of the dataset. One significant aspect was the handling of the 'property_type' feature. The 'property_type' column transformed to categorize the property types into distinct categories: 'Private room,' 'House,' and 'Shared Room.' This transformation involved reassigning property types based on specific keywords found within the entries. The resulting counts provided insights into the distribution of these property types in the dataset. Following categorization, the analysis focused solely on the 'House' category, encompassing apartment-type structures, while discarding other property types to maintain specificity within the study. Additionally, we explored the 'amenities' feature to comprehend the prevalent amenities in Airbnb rentals. This step involved examining a sample of the amenities and constructing a table showcasing the top amenities observed in the dataset. To enhance the usability of the 'amenities' data, we converted it into a binary indicator matrix. This transformation facilitated the creation of a comprehensive matrix indicating the presence or absence of various amenities across the dataset. Moreover, we identified the top 10 amenities frequently available in New York City rentals by parsing the 'amenities' column. This endeavour aimed to provide insights into the most commonly offered amenities in these rentals. These steps in pre-processing and exploring the 'property_type' and 'amenities' columns laid the groundwork for a more refined and insightful analysis of Airbnb rental prices in New York. The dataset was cleaned by removing columns that were deemed unnecessary for the analysis. These columns included 'host_verifications', 'latitude', 'longitude', and other non-essential features. Certain conditions were applied to filter the data, retaining observations that met specific criteria. For instance, observations were kept where the number of guests accommodated ('accommodates') ranged between 2 and 6 and where the property type was designated as a 'House'. Binary categorical columns like 'host_is_superhost,' 'host_has_profile_pic,' 'host_identity_verified,' and 'has_availability' were reformatted into Boolean values for ease of analysis. Several new numeric columns were created based on existing columns. For instance, 'n_bathroom' was derived from the 'bathrooms_text' column, and 'host_acceptance_rate' and 'host_response_rate' were formatted as float values after removing percentage signs (%). Missing values within specific columns were imputed with mean values to ensure the integrity of the dataset. Columns such as review scores, bedrooms, and acceptance rates were subject to this treatment to maintain data completeness. A function was implemented to merge amenities containing specific keywords into broader categories, ensuring a more meaningful representation of amenities across listings. Dummy variables were created for amenities, resulting in a binary indicator matrix representing the presence or absence of various amenities in the dataset. The top 150 amenities were selected for further analysis based on their frequency in the dataset.

Defining Predictor Variables:

Three sets of predictor variables were constructed to be used in the model-building process: • Basic Variables: This set includes fundamental features such as accommodation details ('n_accommodates', 'n_bedrooms', 'n_bathroom', 'n_beds'), host-related attributes ('host_is_superhost', 'host_has_profile_pic', 'n_host_acceptance_rate', 'n_host_response_rate'), and other essential variables related to the property and availability ('n_availability_365', 'n_minimum_nights', 'n_maximum_nights', 'f_neighbourhood_group_cleansed'). • Reviews: Comprising features related to reviews and ratings, this set includes various review-related metrics ('n_review_scores_value', 'n_review_scores_location', 'n_review_scores_communication', 'n_review_scores_checkin', 'n_review_scores_cleanliness', 'n_reviews_per_month') alongside flags indicating missing values in review scores. • Dummy Variables and Interactions: This set involves the inclusion of dummy variables representing amenities ('d_') and interactions between specific features. Interactions are created to capture potential combined effects between different variables, enhancing the predictive power of the model. The dataset ('data_apr') was divided into training and holdout sets using a train-test split, allocating 80% of the data to the training set ('data_train') and 20% to the holdout set ('data_holdout'). This partitioning strategy ensures the model is trained on a subset of the data and evaluated on unseen data to assess its generalizability.

Name		Name	Last commit message	Last commit date
Latest commit History 54 Commits
Graph		Graph
.DS_Store		.DS_Store
README.md		README.md
airbnb APRIL.ipynb		airbnb APRIL.ipynb
airbnb OCTOBER.ipynb		airbnb OCTOBER.ipynb
listings_APR_FINAL.csv		listings_APR_FINAL.csv
listings_APR_FINAL2.csv		listings_APR_FINAL2.csv
listings_OCT_FINAL.csv		listings_OCT_FINAL.csv
listings_OCT_FINAL2.csv		listings_OCT_FINAL2.csv
listings_apr.csv		listings_apr.csv
listings_oct.csv		listings_oct.csv
model_apr.ipynb		model_apr.ipynb
model_oct.ipynb		model_oct.ipynb
visualisations.ipynb		visualisations.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Introduction

Cleaning the Dataset and Data Preparation

Defining Predictor Variables:

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

Jacopo21/AirbPrice_ML

Folders and files

Latest commit

History

Repository files navigation

Introduction

Cleaning the Dataset and Data Preparation

Defining Predictor Variables:

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages