Data Description:

“AB_NYC_2019” is the summary information and metrics for listings in New York City in 2019. The data contains 16 columns and 48895 rows. There are two kinds of id in this data listing ID (unique) and host ID (may not be unique). Besides listing ID and host ID, there are 5 qualitative variables and 9 quantitative variables. The quanlitative variables are name of the listing, name of the host, name of location, name of the neighbourhood area and listing space type. The quantitative variables are latitude coordinates, longitude coordinates, price (in dollars), amount of nights minimum, number of reviews, latest review date, number of reviews per month, amount of listing per host and the number of days when listing is available for booking.

Research Question 1: how does price varies accross location? Is there any specific location with higher price?

We want to learn about whether the price of a airbnb listing variaes by location, which means we are going to examine a geographical plot.

We are interested in how the location affect the price of Airbnb because we expect the central region, that is Manhattan, tend to have higher price compared to other area controlling for other variables. We color the plot via which quantile the price belong to because there are several outlier with extreme high price. We notice that the manhattan area have higher prices than other areas. Besides that, the price of houses do not seem to vary accross longitude or latitude.

We wanted to learn about whether the average price change according to different neighbourhood, which suggests we should examine the average log price of different neighbourhood.

We choose to examine the log price because the data is extremly skewed. The above graph suggests that Manhattan has the highest median prices,25% and 75% percentile, followed by Brooklyn and Queens. This trend does not change accross room type: we observe similar trend on all the boxplot including the marginal distribution. Thus, we conclude that average price change according to different neighbourhood. Manhattan, in general, has the highest price as we expected.

Research Question 2: What words appear most frequent in the title of a listing? What words appear most frequent for high price houses and low price houses?

We wanted to learn about what kind of words appear the most in the title of a listing, which suggests that we should draw a word cloud plot based on the listing names.

The above graph suggests that the most frequently-appeared word types are location, layout of the house and positive adjectives. For instance, words like “nyc”, “brooklyn”, “soho” describe the location of the house; words like “gym”, “gardon”, “kitchen” describe the layout of the house; words like “cozy”, “gorgeous”, “large” are positive adjectives. Thus, we conclude that the name of lisitng contains both room description words as well as positive comments.

We also wanted to learn about what kind of words appear the most for houses with higher prices and lower prices, which suggests that we should examine variables like name and price and draw a word cloud plot based on these variables.

We find that in the listing titles of houses with higher prices, adjectives that display luxury and superiority of the house appear more frequently (e.g. luxurious, spectacular, classic, etc). Moreover, we also see that neighbourhoods like “midtown/downtown Manhatten”, “greenwich” and “chelsea” appear more in the high price section. On the contrary, adjectives that display coziness and affordability of the house appear more frequently in the lower price section (e.g. cozy, convenient, cheap, safe, etc). Also, neighborhoods like “queen”, “bronx”, “brooklyn” appear more in the lower price section.

Research Question 3: what is the relationship between price and all other variables?

We wanted to investigate how price differs across all other variables. We divided the rest of variables into quantitative group and qualitative group. The quantitative variables include minimum number of nights, number of reviews, number of reviews per month, number of host listings and availability in a year.

There is no outstanding relationship shown in the graph above. Most correlation coefficients are really weak and close to 0. Most of the data points are clustered together. However, one thing to notice is that the correlation between number of reviews and number of reviews per month is the highest, 0.55. The reviews of the airbnb increase with the increase of the number of reviews per month. In addition, price seems to have a reverse relationship with minimum number of nights, number of reviews, number of reviews per month and number of host listings. We can see that for airbnbs with higher number of reviews, their prices are relatively low. For airbnbs with extremely high prices, above $5000, their number of reviews is really low. For airbnbs with different availabilities, the distribution of prices is pretty even.

Since the range of price is from 0 to 10000 and the 3rd quantitle is at 175, the boxes in the boxplot for qualitative variables are squeezed to a line. To make the boxplot more readable, we deleted 1044 data with price higher than $500. Here are the two boxplots for the qualitative variables, which are room type and neighbourhood group.

Based on the boxplots, the price of the entire home is the highest, with private room being the second and shared room being the lowest. For neighbourhood groups, Manhattan undoubtedly has the highest price. Brooklyn has the second highest price and the rest all have similar prices.