Skip to content

Milanpeter-77/Coursework-Vienna-Airbnb

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 

Repository files navigation

Coursework - Airbnb popularity in Vienna

Introduction As part of the Data Analysis in Scientific Practice course, I was required to conduct a research project using a dataset aligned with my interests. For this paper, I built a regression model and employ descriptive statistical tools based on methods learned in the course. I used Stata, the statistical software package introduced in the class. The dataset I’ve chosen lists Airbnb rentals located in Vienna.

About the Dataset Airbnb Inc., an American company, offers an online marketplace for accommodations, primarily vacation rentals and tourism activities. The company doesn’t own any of these properties but profits by collecting a commission on each booking. Accommodation information on Airbnb's website is accessible to everyone, as customers rely on it to make booking decisions. However, aggregated data is not collected by Airbnb itself. This task has been undertaken by Inside Airbnb, a watchdog website launched by Murray Cox in 2016, focusing mainly on properties that may operate illegally. Inside Airbnb has gathered data from dozens of cities worldwide over the past few years.

Given the extensive volume of data collected by Inside Airbnb, I narrowed my focus by selecting a single city, choosing Vienna due to its proximity and personal preference. Here’s an overview of the key fields in the dataset: id (Airbnb’s unique ID for the listing), name (listing name), host_name (host’s first name), host_since (host’s registration date), host_response_time (host’s typical response time), host_acceptance_rate (host’s booking acceptance rate), host_is_superhost (whether the host is a "superhost"), host_has_profile_pic (whether the host has a profile picture), host_identity_verified (whether the host’s identity is verified), neighbourhood_cleansed (the neighborhood where the listing is located), property_type (type of property), room_type (room type), accommodates (capacity), bathrooms (number of bathrooms), bedrooms (number of bedrooms), beds (number of beds), price (nightly rate in euros), minimum_nights and maximum_nights (minimum and maximum nights required per booking), number_of_reviews (review count), and various review scores (review_scores_rating, review_scores_accuracy, etc.) for aspects like cleanliness, check-in, and communication, each scored on a scale from 1 to 5.

Research Question While analyzing the factors influencing each rental’s price would be an obvious choice, I am more interested in what drives a rental’s popularity. To define popularity, I will look at the number of reviews and the ratings for various aspects associated with each listing and host. Not all variables in the dataset are relevant for this purpose, so I will focus on the ones that provide meaningful insights. My goal is to create a model that effectively explains what makes a listing popular.

Descriptive Statistics The dataset contains both qualitative and quantitative variables, which include nominal, ordinal, and ratio scales. Some variables, such as id, name, and host_name, serve only to distinguish listings and do not hold substantive value for the analysis. Although the id field is numerical, it serves only as an identifier and cannot be treated as a quantitative variable for analytical purposes. Additionally, the dataset assumes each host has only one listing, as analyzing multiple listings from the same host could introduce bias.

Key ratio-scaled, quantitative variables include host_since, host_response_rate, host_acceptance_rate, accommodates, bathrooms, bedrooms, beds, price, minimum_nights, maximum_nights, and number_of_reviews. I summarized these variables in Table 1, which includes the sample size (1288 listings), mean, standard deviation, minimum, and maximum values. For example, ratings are based on a 1-5 scale, while response and acceptance rates range from 0 to 1 (in percentage terms). We can see that the lowest-priced listing is 13 euros per night.

Several qualitative variables, including host_response_time, neighbourhood_cleansed, property_type, and room_type, provide categorical information. These variables are best analyzed using frequency tables and histograms. For instance, Figure 1 shows that most hosts respond within an hour, though some take several hours or days. Figure 2 displays the distribution of listings across Vienna's districts, with the highest concentration in Landstraße. Figures 3 and 4 show that “Entire rental unit” is the predominant property type and “Entire home/apt” is the most common room type. Binary variables, such as host_is_superhost, host_has_profile_pic, and host_identity_verified, are shown in Figures 5 to 7. For instance, 35.79% of hosts are “superhosts,” nearly all hosts have profile pictures, and 84.32% have verified identities.

Another key component of descriptive statistics is correlation, which indicates the strength and direction of linear relationships between variables. Table 2 contains the correlation matrix, which reveals notable correlations. For instance, the number of bathrooms, bedrooms, and beds have a strong positive correlation, as expected; more bedrooms mean more beds, which must be accounted for in the modeling. Figure 8 and Figure 9 illustrate these relationships. Additionally, there is a moderate positive correlation between response and acceptance rates, suggesting that hosts who respond more frequently also tend to accept bookings more often.

Regression Model Preparations To investigate my research question on popularity, I created a new variable to aggregate relevant data. The number of reviews and their ratings for various aspects of the listing define popularity, but ratings need to be weighted by the number of reviews. This new outcome variable, popularity, has a theoretical minimum of 0, as listings without any reviews would score zero. The maximum value in this dataset is 19,320, calculated by multiplying the highest review count (552) by the maximum possible rating sum across seven review categories (35).

Complete Linear Regression Model In my model, the outcome variable is popularity, while the explanatory variables include host_since, accommodates, bathrooms, bedrooms, beds, price, minimum_nights, maximum_nights, host_response_rate, host_acceptance_rate, binary variables (host_is_superhost, host_has_profile_pic, and host_identity_verified), and categorical variables (host_response_time, neighbourhood_cleansed, property_type, and room_type). The results are summarized in Table 3, and the complete linear regression equation is in Equation 2.

Some key findings are that listings with superhost hosts are on average 834.34 points more popular, and those with hosts who respond within an hour are 282.94 points more popular than those with slower response times. Increasing the minimum nights requirement reduces popularity. The intercept of the model is -2083.59, though this parameter is purely for model fit, as it lacks practical interpretation (a listing cannot exist without essential features like bathrooms, bedrooms, or beds).

The most basic measure of model quality is ( R^2 ), or the coefficient of determination, which shows how well the model fits the data. In this case, with all variables included, ( R^2 = 0.1802 ), meaning the model explains 18.02% of the variance in popularity, reducing uncertainty about popularity by this amount when using the explanatory variables. The adjusted ( R^2 ), which accounts for the number of explanatory variables, is 16.92% and will be used for comparison across simplified models.

Model quality was also assessed using the global F-test. The null hypothesis states that all coefficient slopes equal zero, while the alternative suggests at least one slope differs from zero. The global F-test value for the complete model is 16.42, with degrees of freedom 17 and 1270, and a p-value of 0.0000. Given the p-value, we reject the null hypothesis across all significance levels, indicating that at least one explanatory variable is...