Data Description

Since 2008, guests and hosts have used Airbnb to travel in a more unique, personalized way. This dataset describes all the listing activity of homestays in New York City up until 12/31/2019. It has 85444 observations across 26 attributes, ranges from host information, geographical information, booking logistics, house/room information to reviews and availability.

Our motivation to explore this dataset is because we want to provide guidance for travelers to New York City before they make any decisions in terms of Airbnbs. Specifically, we want the travelers to have a general understanding of where would be the cheapest and best reviewed homestays. Meanwhile, we want to explore the text from the data so that we can explore some common trends for both the house rules and the name of the listing, helping hosts write more appealing listing names and informing people on the most popular house rules. We believe upon reading our analysis of the New York City Airbnb Dataset, people will make much more informed decisions whether on choosing Airbnbs or creating new listings.

Here are the variables that we are focusing on:

construction.year - the year when the airbnb was constructed;

price - the price of one-night stay;

minimum.nights - the minimum number of nights a customer was required to stay;

number.of.reviews - the number of reviews of this airbnb after its listing;

reviews.per.month - the number of reviews of this airbnb per month;

review.rate.number - (in whole number) the current review rate of this airbnb;

calculated.host.listings.count - the number of airbnbs the same host have listed on airbnb website;

availability.365 - number of days the airbnb was available annually;

neighbourhood.group - the five boroughs of New York City;

house.rule - the house rules of the airbnb listed;

NAME - the name of the listing;

Question 1: PCA: Which continuous variables contribute the most to the variance in this dataset?

There aren’t two same bnbs in the world. We are interested in which continuous variables contribute to the most variances in this dataset. We have perceived some overlap among some of the quantitative variables in this dataset and these overlapped variables could be removed from the dataset just by eye observation - we notice that total.fee is just the sum of service fee and price. Hence, it would be imperative for us to utilize some dimension-reduction techniques to reduce the potential multicolinearity issues in this dataset. Here, we will focus on the continuous variables in the dataset - construction.year, price, service.fee, minimum.nights, number.of.reviews, reviews.per.month, review.rate.number, calculated.host.listings.count, availability.365.

## Importance of components:
##                           PC1    PC2    PC3    PC4    PC5    PC6     PC7
## Standard deviation     1.4144 1.2806 1.0834 1.0056 0.9986 0.9733 0.91295
## Proportion of Variance 0.2223 0.1822 0.1304 0.1124 0.1108 0.1053 0.09261
## Cumulative Proportion  0.2223 0.4045 0.5349 0.6473 0.7581 0.8633 0.95593
##                            PC8      PC9
## Standard deviation     0.62979 0.003016
## Proportion of Variance 0.04407 0.000000
## Cumulative Proportion  1.00000 1.000000

The above plot shows the elbow plot for our NYC Airbnb dataset. We can see that after PCA component 7, the proportion of variance dropped significantly to below 5%. Hence, it is reasonable for us to use 7 principle components for this dataset.

The above figure plots PC1 & PC2 with the group being different neighborhoods groups. We can observe that there isn’t obvious clustering pattern so it seems PC1 and PC2 are at most very weakly associated with neighborhood groups. Similarly, we have also investigated if PC1 and PC2 would demonstrate any clustering pattern on other categorical variables such as room types and whether or not they are instantly bookable, but we also didn’t observe any obvious clustering pattern based on these categorical variables. We do observe that increasing PC1 is strongly correlated with an increase in price and service fee; increasing PC2 is strongly correlated with an increase in minimum nights stayed and host listings in airbnb and meanwhile is strongly correlated with a decrease in reviews per month and yearly availability. We also observe that a decrease in PC1 and a decrease in PC2 are both correlated with an increase in review.rate.number and construction year.

Additionally, we have also investigated whether PC1 and PC3 or PC2 and PC3 would demonstrate a more clear clustering pattern, but we afterall didn’t notice any such pattern regardless of the categorical variables we chose. Hence, it seems that we had an extremely mixed dataset and it is hard to separate and cluster even after we perform the PCA.

Question 2: What are the Differences of Airbnbs in 5 Boroughs of New York City?

In this section, we want to explore three different dimensions (reviews, cost and availability) for the 5 boroughs of New York City. Which borough has the best reviews? Which borough has the lowest price? Which borough has the best availability throughout the year? Answering these questions will help us make more informed decisions when choosing the general area to stay in New York City when travelling.

1) Boroughs and Reviews First we want to explore the relationship between boroughs and reviews (review rate numeber, total number of reviews).

From the graph we can see that Brooklyn and Manhattan has the highest number of reviews, which makes sense as these are the most popular boroughs for travellers. On the other hand, Staten Island and Bronx has the lowest number of reviews, signifying that these places are not popular destinations for travels. In terms of review rating, we can see that all 5 boroughs have similar proportion of every rating (5 is the best rating and 1 is the worst rating). Hence we can conclude that the quality of airbnb in all 5 boroughs are very similar.

2) Boroughs and Cost Then, we turn to exploring the relationship between the total cost per night and different boroughs. Specifically, the total cost per night is calculated through the equation: total cost = price + service fee.

Surprisingly, we can see that the total costs are basically the same across 5 boroughs. However, Manhattan and Brooklyn are having slightly more yellow dots (more total costs) than the other regions, but the trend is not obvious. In order to verify whether there is a statistically significant difference in price, we are going to conduct two different statistical tests.

1. Barlett’s Test

## 
##  Bartlett test of homogeneity of variances
## 
## data:  total.fee by neighbourhood.group
## Bartlett's K-squared = 5.9335, df = 4, p-value = 0.2042

Since the p value (0.2402) is greater than the alpha value (0.05), we fail to reject the null hypothesis of the barlett test and we conclude that the variance of total cost for all 5 boroughs are the same.

2. One-way Test

## 
##  One-way analysis of means
## 
## data:  total.fee and neighbourhood.group
## F = 1.4661, num df = 4, denom df = 85012, p-value = 0.2095

Since the p value (0.2095) is greater than the alpha value (0.05), we fail to reject the null hypothesis of the one way test and we conclude that the means of total cost for all 5 boroughs are the same.

Based on all the analysis above, we conclude that there is not a significant cost differences for the airbnbs in all 5 boroughs in New York City.

3) Boroughs and Availability Finally, we want to explore the distribution of yearly availability (availabiliy.365) with respect to different boroughs.

From the graph we can see that airbnbs in Staten Island and Bronx typycally have the most number of days that are available in a year as their median, q1 and q3 are the highest among all boroughs. On the other hand, the airbnbs in Manhattan and Brooklyn has the lowest amount of available days. This makes sense as traveller’s major destinations are manhattan and brooklyn and that would cause the available days to be typically lower.

Question 3: What are Some Prominent Word Choices of Airbnb Listings in New York City?

In this section, we are exploring the verbal information in the Airbnb dataset. What trend can we get from the name of the Airbnb listing? What do hosts care about the most in terms of house rules?

1) Uncovering the Most Common Additional House Rules

First, we want to explore what are the most common additional house rules.

We can see that no smoking is the most mentioned rules, and after that it’s no pets/no guests. We can then proceed to do a comparison between the rules of private rooms and rules of entire homes/apartments.

The comparison word cloud shows that for both private room and entire home/apr, they have rules on smoking. For private room, there’s also a focus on “noise” and “respect” which makes sense because the tenants are most likely sharing the space with someone else. For entire home/apt the focus is more on enjoy the stay as we can see that “enjoy” and “blast” are some of the top words.

2) What makes a good listing title?

In this section we are exploring some commonalities among the listing titles.

As we can see from the plot, “bedroom”/“room”, “private”, “apartment” are the most common words in the name column. It means that most listings put room type directly in the title. We can also see from the plot that “brooklyn” and “manhattan” words are common in the name of the listings. We can infer that Brooklyn and Manhattan would have more listing than the others and are in high demand(appealing to the customers).

We then proceed to generate a comparison wordclouds based on roomtypes: private room vs Entire home/apt.

The comparison word cloud analysis revealed that the following words were effective in generating interest for Private Room listings: “private”, “cozy”, “spacious”, and “near”. For Entire Apt/Home listings, the words “modern”, “luxurious”, and “loft” were found to be particularly effective in attracting potential guests.

Conclusions

We begin by investigating which continuous variables account for the most variances in this dataset. Our PCA analysis revealed that we have an overall mixed dataset that could be hardly separated/clustered by categorical variables in this dataset as shown by the highly mixed scatters. We found that increase in PC1 is strongly correlated with an increase in price and service fee; increasing PC2 is strongly correlated with an increase in minimum nights stayed and host listings in airbnb and meanwhile is strongly correlated with a decrease in reviews per month and yearly availability.

The next is the analysis for airbnbs in 5 different boroughs. We found that the quality (review ratings) and total cost of the airbnbs in all 5 boroughs are very similar. However, the number of reviews for Manhattan and Brooklyn are significantly higher than average. Meanwhile, the availability for Manhattan and Brooklyn are lower than average, signifying that these are the most common places for New York City airbnbs.

For the text analysis, the most common house rule mentioned in the airbnb dataset is no smoking, and private room listings often focus on respect for the surroundings while entire home/apt listings focus on enjoying the stay. The most common words in listing titles descriptive on room type and location. In particular, the effective words for private room listings include private, cozy, spacious, and near, while effective words for entire home/apt listings include modern, luxurious, and loft.

Limitations and Words to Future Researchers

An interesting question that would also be of notable socio-economic value is how the price/service fee of different airbnbs evolve over time. Given the surge in inflation in US over the past few years, it would be very intersting to see how airbnb industry adjusts themselves to this issue - and, are consumers satisfied with such price adjustments?

Moreover, we have already explored the deterministic factors of reviews of a single airbnb and we have also explored the common features in the housing rules. A natural follow-up question would be - are the influence of these deterministic factors as well as the housing rules consistent overtime? Exploring this question would allow people to have a better understanding of how consumer preferences and owner habits evolve overtime in this rapidly changing world.

Unfortuantely, given that we don’t have further time-related data in this dataset other than construction year and the timestamp of last review, there is not much we can do. Future researchers could start by collecting the fore-mentioned time-series data to tackle the two questions we proposed.