Data Description

Airbnb is an online service that allows property owners rent out spaces for short time periods. We are working with a dataset found on Kaggle called ‘Airbnb Prices in European Cities’ that provides Airbnb data of 51707 listings in various European cities.

The cities this dataset covers are:

The variables in the dataset are:

Research Questions

Graphics

For our first research question, we wanted to examine how the relationship between different variables and price differ from city to city. One important variable we wanted to look at is distance from center city, as we expect that to have a negative association with price, but it could differ depending on the particular city. Thus, we look at scatter plots of distance from center city against price, faceted on the city, not including the top 10% priced Airbnb listings for the sake of not including outliers.

We find some pretty interesting differences in the trends here. For most cities, the listings very close to center city, usually within a few miles, are most expensive before there is a decline in price from there. Beyond a few miles, for many cities, the trend begins to flatten off. However, there are some cities that we can see a different trend. Such as in Rome, where the closest Airbnb listings to center city are, on average, the cheapest. Berlin also does not see an extreme downward trend in price as you get further away from the city.

Next, we look at the relationship between the distance from the nearest metro station and guest satisfaction, faceted on the city. This will allow us to further answer the question of how the location of the Airbnb listing relates to guest satisfaction, and how that relationship differs by city.

We ultimately see that for most cities, there is not much of a trend in the level of guest satisfaction and the distance from the nearest metro station. However, for London, we see that, strangely, there is a positive correlation between the distance from metro station and guest satisfaction. This could be because of confounding variables. We also get a sense of which cities have a lot of listings far away from the nearest metro station and which cities it will be easier to use these stations in, such as Athens and Paris, where there is a station relatively close to every Airbnb listing.

This information can help us determine which cities may be worth visiting more if we do or do not have a car or other easy modes of transportation, especially if we are considering spending most of our time in center city. For example, if we believe we will easily be able to travel 10 miles a day, it may be more worth it to book an Airbnb in Amsterdam or Barcelona, where they are heavily discounted on average that far away. Additionally, we could look for at London, where there is higher guest satisfaction far away from metro stations, since we would not need to use them in this scenario. On the other hand, if we plan on needing to stay close to everything we want to do, we should look more at Rome or Berlin, where listings are still cheap near center city, or a city like Athens where there are many listings with high guest satisfaction near a metro station.

For our second research question, we wanted to see upon which Airbnb listing characteristics the different European cities differed. Thus, this suggested that we make a PCA biplot. This allows us to visualize the linear relationships between variables, and by coloring the points by city, we can see how this geographical component is related to the different variables.

The biplot first and foremost indicates a clear degree of differentiation between the European cities, in their Airbnb characteristics. It shows us that the variables rest_index, rest_index_norm, attr_index, and attr_index_norm all point towards the right, thereby signaling that Airbnb listings with a high first principal component tend to have higher values of these variables. Thus, we see that listings in London and and Rome seem to have values of these variables; meanwhile, it seems to be that listings in Berlin and Budapest, seem to have the highest distance to both the city center, as well as Metro stops. This also seems to be the case for Vienna, where in addition, listings seems to have a high number of bedrooms, cleanliness ratings, capacity for the maximum number of people that can stay, and overall guest satisfaction. We also see that Athens and Budapest seems generally rank high in terms of the number of bedrooms per listing, cleanliness, person capacity, and overall guest rating, and these variables are all correlated with each other. The grouping of variables that have to do with restaurants and attractions seem to be nearly orthogonal to the grouping of bedrooms, person_capacity, guest_satisfaction, and cleanliness_rating.

It may be unsurprising to see that the indexes for the ranking of restaurants and attractions in the area are highly correlated with each other, as some areas will have more to do than others. Additionally, it is not surprising to see that a higher number of bedrooms is highly correlated with capacity, or cleanliness and overall guest satisfaction with the Airbnb.

To further investigate the restaurant and attraction indexes, we can look at the correlations between both of these and the price for each city.

The graph allows us to see that the correlation between the attraction and restaurant index are, unsurprisingly given the PCA biplot, mostly the same for an individual city. However, we see that for some cities, the correlation is much stronger than for others. Airbnb listings in Rome that are near more restaurants and attractions are more expensive than those that are not, but in Barcelona there is not as much of a correlation. This could be helpful for us if we are looking to travel to Europe, as we can see that there will be listings in Barcelona and Vienna near restaurants and attractions that are not necessarily more expensive than those that are not, so those could be cities to look more closely into.

We next want to investigate the two variables that we are most interested in from this data set: the total guest satisfaction with the listing, and the rental price of the listing. Therefore, we make two point referenced plots below.

In our first point-referenced plot, we see a map of Europe, with points at each of the cities that are present in the data set, with the points set at the city centers. We averaged the total guest scores within each city to come up with a geographical average, and represented this figure by the color of the points. Furthermore, the number of listings in that city is represented by the size of the dot. We see that Athens and Budapest have the highest average guest satisfaction ratings, closely followed by Amsterdam and Berlin. On the opposite side of the spectrum, London, Lisbon, and Barcelona have the lowest average guest satisfaction ratings. However, at the same time we do note that London and Rome have the highest number of listings at more than 9,000, while Berlin, Amsterdam, and Barcelona each have less than 3,000 total. We also built a similar visualization for average rental price per city.

We note that Amsterdam has the highest average rental price (denoted by the full price of accommodation for two people and two nights) by a substantial amount. The average total price of an Airbnb listing in the city is 573 Euros, as opposed to the second highest city, Paris, at 392 Euros, and London at 363 Euros. We see that most of the other cities have average prices between 200-300 Euros, except for Budapest and Athens, which have prices of 176 and 151 Euros respectively.

For our third research question, we want to analyze the distribution of Airbnb prices depending on the time of week, whether or not it was on a weekend. Because there are some prices that are significantly greater than others, we remove the listings with the top 10% of prices before comparing the distributions.

The ECDF plot above displays the cumulative distributions of the “realSum” (price) variable for both weekend and weekday Airbnb listings. A cumulative distribution shows the proportion of observations that fall below a given price point. We can clearly see that the ECDF for Airbnbs booked for weekdays is consistently higher than that of Airbnbs booked for weekends; this means that throughout the dataset, there is a higher proportion of weekday bookings below each price point (implying that on average, weekend prices may be slightly higher).

Two-sample Kolmogorov-Smirnov test
Test Statistic P-Value
Values 0.033054 1.826e-11

Displayed above are the results of a KS test, which compares the distributions of price for weekdays and weekends. The test yields an extremely significant p-value of \(1.826*10^{-11}\) which is below any reasonable \(\alpha\) significance level. This verifies the results that we saw in our ECDF plot; we have sufficient evidence to suggest that the distributions of price for weekdays and weekends are significantly different.

These results confirm our suspicions that Airbnb prices are greater on weekends than weekdays. Thus, in the future, when traveling to Europe we may want to be aware of the fact that renting places during the weekdays will be cheaper than the weekends.

Future Work

This report covers many trends we can find across different European cities with Airbnb listing data. However, it only contains a subset of major cities. In the future, we could look into how these trends differ in even more European cities. Additionally, there are several variables we may want to further analyze, such as the capacity or number of bedrooms of each listing, since those are key factors in booking an Airbnb. Furthermore, we could look into trying to predict price from a number of these variables, so that we could maybe find listings that cost a lot less or a lot more than we should expect.

Citations

D. Kahle and H. Wickham. ggmap: Spatial Visualization with ggplot2. The R Journal, 5(1), 144-161. URL http://journal.r-project.org/archive/2013-1/kahle-wickham.pdf

Gyódi, Kristóf, & Nawaro, Łukasz. (2021). Determinants of Airbnb prices in European cities: A spatial econometrics approach (Supplementary Material) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.4446043