Skip to content

Jacopo21/AirbPrice_ML

Repository files navigation

Introduction

This analysis aims to find the best-performing predicting model for Airbnb rental prices in New York for April and October 2023. This investigation utilizes a dataset encompassing essential features affecting rental pricing, including nightly rates, guest capacity, bedroom and bathroom counts, bed quantities, minimum and maximum nights for rental stays, and the number of reviews left by previous guests. To accomplish this objective, we employed a regression-based approach, utilizing three distinct models: Ordinary Least Squares (OLS), LASSO, and Random Forests. Within each method, three models were generated and cross-validated to gauge their predictive efficacy.

Cleaning the Dataset and Data Preparation

Before commencing the analysis, we embarked on a comprehensive data cleaning and preparation phase. This involved several crucial steps to ensure the accuracy and consistency of the dataset. One significant aspect was the handling of the 'property_type' feature. The 'property_type' column transformed to categorize the property types into distinct categories: 'Private room,' 'House,' and 'Shared Room.' This transformation involved reassigning property types based on specific keywords found within the entries. The resulting counts provided insights into the distribution of these property types in the dataset. Following categorization, the analysis focused solely on the 'House' category, encompassing apartment-type structures, while discarding other property types to maintain specificity within the study. Additionally, we explored the 'amenities' feature to comprehend the prevalent amenities in Airbnb rentals. This step involved examining a sample of the amenities and constructing a table showcasing the top amenities observed in the dataset. To enhance the usability of the 'amenities' data, we converted it into a binary indicator matrix. This transformation facilitated the creation of a comprehensive matrix indicating the presence or absence of various amenities across the dataset. Moreover, we identified the top 10 amenities frequently available in New York City rentals by parsing the 'amenities' column. This endeavour aimed to provide insights into the most commonly offered amenities in these rentals. These steps in pre-processing and exploring the 'property_type' and 'amenities' columns laid the groundwork for a more refined and insightful analysis of Airbnb rental prices in New York. The dataset was cleaned by removing columns that were deemed unnecessary for the analysis. These columns included 'host_verifications', 'latitude', 'longitude', and other non-essential features. Certain conditions were applied to filter the data, retaining observations that met specific criteria. For instance, observations were kept where the number of guests accommodated ('accommodates') ranged between 2 and 6 and where the property type was designated as a 'House'. Binary categorical columns like 'host_is_superhost,' 'host_has_profile_pic,' 'host_identity_verified,' and 'has_availability' were reformatted into Boolean values for ease of analysis. Several new numeric columns were created based on existing columns. For instance, 'n_bathroom' was derived from the 'bathrooms_text' column, and 'host_acceptance_rate' and 'host_response_rate' were formatted as float values after removing percentage signs (%). Missing values within specific columns were imputed with mean values to ensure the integrity of the dataset. Columns such as review scores, bedrooms, and acceptance rates were subject to this treatment to maintain data completeness. A function was implemented to merge amenities containing specific keywords into broader categories, ensuring a more meaningful representation of amenities across listings. Dummy variables were created for amenities, resulting in a binary indicator matrix representing the presence or absence of various amenities in the dataset. The top 150 amenities were selected for further analysis based on their frequency in the dataset.

Defining Predictor Variables:

Three sets of predictor variables were constructed to be used in the model-building process: • Basic Variables: This set includes fundamental features such as accommodation details ('n_accommodates', 'n_bedrooms', 'n_bathroom', 'n_beds'), host-related attributes ('host_is_superhost', 'host_has_profile_pic', 'n_host_acceptance_rate', 'n_host_response_rate'), and other essential variables related to the property and availability ('n_availability_365', 'n_minimum_nights', 'n_maximum_nights', 'f_neighbourhood_group_cleansed'). • Reviews: Comprising features related to reviews and ratings, this set includes various review-related metrics ('n_review_scores_value', 'n_review_scores_location', 'n_review_scores_communication', 'n_review_scores_checkin', 'n_review_scores_cleanliness', 'n_reviews_per_month') alongside flags indicating missing values in review scores. • Dummy Variables and Interactions: This set involves the inclusion of dummy variables representing amenities ('d_') and interactions between specific features. Interactions are created to capture potential combined effects between different variables, enhancing the predictive power of the model. The dataset ('data_apr') was divided into training and holdout sets using a train-test split, allocating 80% of the data to the training set ('data_train') and 20% to the holdout set ('data_holdout'). This partitioning strategy ensures the model is trained on a subset of the data and evaluated on unseen data to assess its generalizability.

About

Second Assignment Prediction

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •