Airbnb has revolutionized the way people travel and experience cities. With the vast range of listings, Airbnb offers affordable options for travelers, and provides an alternative revenue stream for property owners. In this project, we will analyze Airbnb prices in European cities and identify the factors that affect them. We will focus on Amsterdam and Athens and use the datasets of Airbnb prices on weekdays and weekends. Our goal is to understand how the different factors such as room type, distance from city center, and guest satisfaction impact Airbnb prices.
The dataset contains information about Airbnb listings in Amsterdam and Athens. The data is divided into two files for each city, one for weekdays and one for weekends. Each file contains the following columns:
realSum - the total price of the listing (quantitative) room_type - private/shared/entire home/apt (categorical) room_shared - boolean value indicating if host is a superhost or not (categorical) room_private - indicator whether listing is for multiple rooms or not (categorical) person_capacity - the maximum number of people that can stay in the room (quantitative) host_is_superhost - whether the host is a superhost or not (categorical) multi - whether the listing is for multiple rooms or not (categorical) biz - whether the listing is for business purposes or not (categorical) cleanliness_rating - the cleanliness rating of the listing (ordinal) guest_satisfaction_overall - the overall guest satisfaction rating of the listing (ordinal) bedrooms - the number of bedrooms in the listing (quantitative) dist - the distance from the city center (quantitative) metro_dist - the distance from the nearest metro station (quantitative) lng - the longitude of the listing (quantitative) lat - the latitude of the listing (quantitative)
To understand what type of factors effect the airbnb’s price listings we have decided to examine the variables realSum, dist, room_type, lng, and lat. In the dataset we have 7 quantitative variables, 6 categorical, and 2 ordinal.
On top of this, we pull additional information from a subsidiary dataset (source: insideairbnb.com) to gather details on the descriptions of each listing within this time period, as well the past 3 quarters. These are the columns of interest within the subsidiary dataset:
name - name plus short description of each listing (non-categorical) neighbourhood - the region of amsterdam that the listing is located in (categorical) availability_365 - the number of days in a year that the listing is available (quantitative)
This dataset has 9 quantitative variables and 3 categorical variables.
Here is a summary of the insights we will seek to derive from the data: 1) Time differentiated factors that contribute to airbnb prices (analysis of weekends vs weekdays) 2) Geospatial factors that contribute to airbnb prices (how do geographic properties affect airbnb prices) 3) What affects the availability of a listing
Introduction: Amsterdam is one of the most popular cities for Airbnb stays, and we want to explore what factors contribute to higher prices on weekends compared to weekdays. We will compare the distributions of Airbnb prices on weekends and weekdays and look at the relationship between price and distance from the city center.
After cleaning out the part of the data with outliers, we can see that there seems to be a difference in the density of the price of airbnb’s on the weekend vs. the weekday. We can see that there is a peak at around 260-265 in airbnb prices in the weekday, but the weekend has a peak at 285-290. We can also see that there is a little difference in the shape of the density curves. We can see that the weekend curve is flatter and more spread to the left, indicating that there is an overall higher density of more expensive airbnbs. We can run a t-test to confirm if there is a difference in average pirce of the airbnb prices across the weekend vs. weekdays.
##
## Welch Two Sample t-test
##
## data: weekdayrealsum$realSum and weekendrealsum$realSum
## t = -4.229, df = 1935.8, p-value = 2.457e-05
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -65.87325 -24.13274
## sample estimates:
## mean of x mean of y
## 485.9452 530.9482
Running this test we can see that there is a difference across the averages by about 45 and this result is statistically significant since the confidence interval does not contain 0 (ie the averages are signficantly different). Through this analysis we can confirm that there is a statistically significant difference between the prices of airbnb’s during the weekdays vs. the weekends.
From this scatter plot we can see that there is more variability in price with Airbnb listings that are closer to the city center. This can be because as we get closer to the city, there are more features to consider that could affect the price. This includes if the Airbnb is in a safe area, if it is near any notable attractions, etc. We can also see that prices are higher and the variability is greater for listings on the weekends no matter the distance from the city center.
From this visualization, we can see that entire homes/apartments and private rooms are more common amongst Airbnb listings in Amsterdam. For entire homes/apartments, there seem to be more options for weekends, most likely because bigger parties tend to travel for vacations, which happen more often on weekends. For private rooms, there are more listings for weekdays. We believe this could be because vacations and work events happen on weekdays and are usually for shorter amounts of time, so private rooms have higher demand than anything bigger. Shared rooms are pretty uncommon and this could be due to concerns about safety and that could mean very little demand for shared rooms.
Introduction: In this research question, we want to explore how Airbnb listing vary across different neighborhoods in Amsterdam. We will also use a PCA to identify clusters of variables that affect price, and a choropleth map to visualize the spatial distribution of listing prices across the city. In addition to this, we believe creating a word cloud of listing descriptions can help us identify key words used in various listings that help raise the realSum price of each listing.
A geospatial analysis of the city reveals that most of the listings are concentrated around the Amsterdam city center (as indicated by the magenta ring). This is generally aligned with the the common consensus that listings around the city center would be easier to upsell than those further away from the city. Markers were added to identify key metro stations and attractions within the city and interestingly enough, most of the listings were concentrated around key attractions in the west end of the city center as opposed to major metro stations. A reason for this could be that residential areas near metro stations experience a lot of noise and so the frequency of listings observed around those areas tend to be lower
Some of the most common vocabulary used to describe listings in Amsterdam includes “spacious”, “city”, “canal”, “garden”, “beautiful”, and more. All these words signal that Amsterdam is a great place to travel with “family” and bigger groups because the listings are spacious and it has a calmer lifestyle with gardens and canals, which are more often enjoyed by families rather than people in their early twenties. There are also words such as “luxury”, “central” and “modern” which paints a picture of what the architecture and overall atmosphere of the city is. This signals that there is probably a lot to do and that the views are beautiful.
## PC1 PC2 PC3 PC4
## attr_index 0.66707944 -0.2206643 -0.03923374 0.085803103
## rest_index 0.67066538 -0.2016607 -0.05377438 0.067228711
## lng -0.16229521 -0.7099294 -0.19278902 -0.657641575
## lat 0.28037002 0.5811990 0.16648973 -0.745389739
## cleanliness_rating 0.01630612 0.2623893 -0.96471892 -0.004452304
## PC5
## attr_index -0.7052735886
## rest_index 0.7086109779
## lng -0.0006686515
## lat -0.0166028680
## cleanliness_rating -0.0135478741
A scatterplot of the first two principal components (with the color of each point representing the cleanliness_rating), reveals some interesting information. The plot helps to visualize the variation in the data based on the cleanliness_rating variable, and to identify any potential clusters or patterns that exist in the data. After extracting the loadings, we see that ‘attr_index’ and ‘rest_index’ are the most significant contributor to variance in our dataset (since they have the largest weightage in our first principal component). These variables represent the listings’ distance to attractions and their distance to restaurants respectively. An important point to note is that in the PCA plot, the cleanliness rating scale appears negative because it is centered around zero by default during the scaling step. In this case, since the cleanliness rating variable has a mean value greater than zero, the scaling step will result in some negative values for cleanliness rating in the PCA plot. However, the sign of the loadings for cleanliness rating will still indicate the direction of its relationship with the principal components. Cognizant of this an important observation we can make is that listings with a poor cleanliness rating tend not to contribute to the variance in our dataset as can be noted by their concentration around the 0.00 level for both PC1 and PC2 but these ratings when coupled with the latitude of the listings seem to greatly sway our data
Introduction: So far we’ve studied how prices of listings can vary across weekends and weekdays. We’ve also taken a look at the general geographic distribution of listings, some key descriptive terms that may affect the listing prices, and other discrete numerical variables that may influence the price of a listing. Now we seek to understand what affects the availability of a listing so we have a better sense for the balance between the price of a listing, its availability and the factors that may influence this availability.
##
## Pearson's product-moment correlation
##
## data: airbnb_data$neighbourhood_numeric and airbnb_data$availability_365
## t = -3.5525, df = 6996, p-value = 0.0003841
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.06579914 -0.01902321
## sample estimates:
## cor
## -0.04243443
The correlation test shows that there is a negative correlation between the neighbourhood and the availability of a listing. The correlation coefficient value is -0.0424 which is close to 0, indicating a weak negative correlation. The p-value of 0.0003841 indicates that the correlation is statistically significant at the 95% confidence level. The magnitude of the correlation coefficient suggests that there is a weak tendency for the listings in some neighborhoods to have lower availability than others, meaning that there is little to no linear relationship between the neighborhood and the availability of listings. It is also important to note that a negative correlation does not necessarily mean that the availability of listings is lower in one neighborhood compared to another, but simply that there is a weak tendency for listings in some neighborhoods to have lower availability than others.
From q1 to q2 we can see that as availability increases price increases from q2 to q3, we see the opposite relationship as price slightly increases availability decreases and finally as price plateaus, we can see availability increase from q3 to q4. This shows that as there is a difference across different quarters that could be attributed to the types of weather each quarter faces and the cycle of the workforce (people are more likely to go on summer vacations so q2-q3 has a dip in availability as price continues a steady increase). Price also might be increasing across all the quarters due the the face that as the year goes on people are give more breaks and holidays.
Our goal with this analysis through visualization was to determine what features of Airbnb listings in Amsterdam have correlations with other features. We specifically looked into days of the week, location, and time of the year, as well as how some of these features relate to the price of the listing. With our first research question, we found that there is a statistically difference in average price of a listings on the weekends versus the weekdays. This follows logical reasoning because of the higher demand in travel on the weekends. With our second research question, we found that Amsterdam has a relatively good reputation regarding atmosphere and overall location specs that can be seen on Airbnb. We also found that most listings are in the city center, and ones closer to the center tend to be priced higher because of the access to transportation. With our third question, we found that the first quarter has low availability and price, the second quarter has higher availability but only slightly higher price. The last two quarters are pretty flat standard relative to the second quarter with availability and price. To summarize, we found that Airbnb listings prices in Amsterdam will be higher on the weekends, especially when they are closer to the city center and when they are in the middle of the year. We could use this to recommend travelers on a budget to go during off season, such as the first quarter, choose weekdays, and try to find options farther from the city that don’t have an opportunity cost of expensive transportation. The correlation between neighborhood and availability of a listing is statistically significant so the availability of a listing is affected bt what neighborhood an individual looks in. To conclude, this project was really useful in analyzing the Airbnb market in Amsterdam, which we can use to make optimal travel decisions.
We found this data to be very interesting, with a lot more potential to analyze other trends and variables that affect the prices of Airbnb listings. For example, we can extend analysis into the Airbnb hosts and we can see if their appearance or reviews about them might affect an individual’s willingness to pay for that host’s property. We can also go further with the location analysis to determine if other aspects near a listing might affect the price. For example, we can see how many restaurants are near the Airbnb or what the diversity of the restaurants are. We can also make note of specific popular attractions in Amsterdam and look into how far listings are from those specific areas, and we could also look into if listings are more expensive in areas where tourists are usually around, or in places that locals live in. We could also use similar visualization strategies for listings in other cities as well, to see if other areas have different factors that dominate prices. For these other cities, we can do exactly what we did for this project and more, if these other cities have different aspects that make them unique. For example, Greece is known more for its overall vibe with white buildings and so attractions might matter less, and with Paris we can look into how far listings are from the Eiffel Tower. There are many directions we can extend this analysis and it would be interesting to go further and answer some of these questions.