Airbnb Data Analysis - New York City

Jamie Chen, Yunseong Jung, Chang Bum Lee, Michelle Zhu

Dataset Description

Our report utilizes data from the “Inside Airbnb” website. This website scrapes data from the property rental marketplace company Airbnb. They have datasets for multiple locations but we will focus on the data from New York City. This dataset contains information about listings in New York City, New York, United States on Airbnb as of November 1st, 2023. It consists of 38,780 rows, each representing a single listing. There are 74 columns corresponding to each of the variables that were recorded for each listing. For this analysis, we will mainly focus on geographic location, prices, room type, host type, and ratings. The variables of interest are listed below:

  • latitude: a quantitative variable indicating the latitude a listing is located at
  • longitude: a quantitative variable indicating the longitude a listing is located at
  • neighbourhood_cleansed: a categorical variable indicating the neighborhood a listing is located in
  • neighbourhood_group_cleansed: a categorical variable indicating the borough a listing is located in
  • room_type: a categorical variable indicating whether a listing was an entire home/apt, private room, shared room, or hotel
  • prices: a quantitative variable indicating the daily price of a listing
  • host_is_superhost: a boolean variable indicating whether the host is a superhost
  • review_scores_rating: a quantitative variable indicating the review score of a listing
  • neighborhood_overview: a text variable with the host’s description of the neighborhood

Research Questions

Below, we have written the questions we will explore throughout this report.

  1. How does number of Airbnb listings compare by geographic location?
  2. How do the price distributions and median prices of Airbnb listings compare across different room types and boroughs in New York City? Is there an Airbnb price difference between different boroughs of New York?
  3. Is there a difference in review score ratings between superhost and non-superhost listings across various price points?
  4. What are the most frequently mentioned features, characteristics, or attributes in the descriptions of Airbnb listings for each major neighborhood in New York City?

With the answers to these questions, we hope to get a better understanding of different factors of each Airbnb listing in New York City. The first two questions focus on location, room types, and prices which are features customers focus on when booking an Airbnb. The last two questions are more related to the host and how their advertise their listings.

Exploratory Data Analysis

Before we dive into our questions, we will do a quick exploratory data analysis. First, let’s look at a location density plot for all the listings in New York City.

In this plot, we can see the density plot of the number of listings throughout New York City. This shows us where there are more listings in New York City. Looking at the plot, we see that the highest densities are found mostly in Manhattan followed by Brooklyn. We see that there are a decent amount of listings in northern Brooklyn with the density decreasing as we move south. We see some low densities in Queens, also the part closer to Manhattan. We see that there is actually no shading in the Bronx or Staten Island. This means that the proportion of the number of listings in those locations were really low compared to the other locations.

Next, we will take a look at the number of listings of each room type in each borough. We omitted the hotel room listings as there aren’t very many data to explore.

Looking at the distribution, entire home or apartment or private room listings are most popular, and majority of them are from Manhattan and Brooklyn, followed by Queens.

Now we can move onto our questions.

Question 1

First, we are interested in how the number of Airbnb listings compare by geographic location. We can do this by looking at a choropleth graph and its corresponding dendrogram.

We will start off by taking a look at how the distribution differs between boroughs with a choropleth graph broken up by neighborhoods in New York City.

In the choropleth graph above, we see that there are a relatively high number of listings in the neighborhoods of Crown Heights, Bedford-Stuyvesant, Bushwick, Williamsburg, Harlem, Upper West Side, Upper East Side, Hell’s Kitchen, Midtown, and East Village. These neighborhoods are in Brooklyn and Manhattan. We see that the neighborhoods surrounding Central Park have a high number of listings. This makes sense since this is where most of the tourist attractions are. We also see that the neighborhoods in Brooklyn with a high number of listings are neighborhoods that are close to Manhattan. We see that all the neighborhoods in Staten Island (bottom left) and Bronx (everything above and to the right of Inwood, to the right of Washington Heights, and above East Harlem) have less than 100 listings. Some of these neighborhoods have no listings (NA). We see that Queens have some neighborhoods shaded light blue meaning there are at least 500 listings in those neighborhoods.

Next we will use a statistical analysis to determine if the patterns we saw above are “statistically significant” in some sense. To do this we will create a dendrogram with each neighborhood as leaves and colored by borough. Here we have Bronx as dark green, Staten Island as purple, Queens as orange, Brooklyn as red, and Manhattan as blue.

From the above dendrogram, it appears that the geographic clusters of boroughs (Bronx, Brooklyn, Manhattan, Queens, and Staten Island) are only somewhat associated with the clusters automatically detected by the dendrogram. We see that there are some clusters on the left and center that are mostly neighborhoods from Manhattan and Brooklyn. We also see that neighborhoods from Queens are often clustered with neighborhoods from Manhattan and Brooklyn. We see there are not that many neighborhoods in the Bronx and when they do appear they are randomly scattered except on the left there is a cluster of four of the Bronx neighborhoods together. We see that Staten Island is not included in this dendrogram at all. Since the neighborhoods with the most number of listings are in Manhattan and Brooklyn, it makes sense that those neighborhoods are clustered together. Since Queens has the next highest number of listings, it also makes sense for those to be clustered with some Manhattan and Brooklyn neighborhoods that probably have about the same number of listings as neighborhoods in Queens. Since Bronx and Staten Island had the least number of listings in their neighborhoods, they had little to no neighborhoods in the dendrogram. In summary, there seems to be some association between geographic clusters of boroughs and the clusters automatically detected with the dendrogram, but there is not a clear one-to-one relationship.

Question 2

Next, we are interested in how the price distributions and median prices of Airbnb listings compare across different room types and boroughs in New York City as well as whether there is a significant price difference between boroughs. We can do this by looking at graphs of price distributions by room type and borough as well as paired t-tests.

First, we used a log scale for the price variable to manage a wide range of values and to better visualize the price distribution, especially when dealing with data that can span several orders of magnitude. The box plots are segmented by room type—Entire home/apt, Hotel room, Private room, and Shared room—and are further divided by borough.

Interpreting the plot, we can discern that entire homes/apartments and hotels, tend to have a higher price point, particularly in Manhattan, aligning with the borough’s high-demand real estate market. The presence of outliers, especially in Manhattan, suggests a significant variation in pricing, likely due to luxury listings. In contrast, shared and private rooms across all boroughs present a more compact distribution of lower prices, indicating a more affordable range of options for travelers.

This plot presents a comprehensive overview of Airbnb’s pricing structure within New York City’s diverse housing market. It clearly delineates the differences in price ranges between various accommodation types and boroughs, thus providing an invaluable tool for both hosts to competitively price their properties and for guests to budget their stays. Additionally, the logarithmic transformation of prices allows for an equitable comparison across a broad spectrum of prices, making the data accessible and interpretable to a wide audience, from casual users to market analysts.

Next, we can take a look at a heatmap representing the same data but this time shaded by median prices. This is important for understanding the landscape of rental costs in a diverse urban area, which can be influenced by factors such as location desirability, availability of space, and the type of accommodation provided.

The heatmap visually represents the median prices for various room types—such as entire homes/apartments, private rooms, and shared rooms—across the five boroughs of NYC: Bronx, Brooklyn, Manhattan, Queens, and Staten Island. The color intensity correlates with the price level, with darker shades indicating higher prices. For example, the dark green tiles in the Manhattan row clearly show that this borough has the highest median prices for entire homes/apartments and hotel rooms, a reflection of Manhattan’s status as the most expensive borough in NYC. In contrast, Staten Island and the Bronx show lighter colors, indicating lower median prices, which aligns with the general understanding that these boroughs are more affordable.

Next, we will focus on entire home/apt and private rooms when looking at prices of Airbnb listings in different boroughs since those are the most popular room types.

Looking at the density plot for Airbnb prices in different boroughs of New YorK, we see that other than Manhattan the density plot generally looks similar. It also suggests that the price for entire home or apartment is much higher than private rooms on average. We can run t-test to confirm if there is difference in average prices between different boroughs of New York City.

## 
##  Welch Two Sample t-test
## 
## data:  price_Bronx and price_Queens
## t = -1.1942, df = 2132.4, p-value = 0.2326
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -6.929752  1.684383
## sample estimates:
## mean of x mean of y 
##  110.1876  112.8103
## 
##  Welch Two Sample t-test
## 
## data:  price_Bronx and price_Staten
## t = -1.3834, df = 606.58, p-value = 0.1671
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -13.382095   2.320816
## sample estimates:
## mean of x mean of y 
##  110.1876  115.7182
## 
##  Welch Two Sample t-test
## 
## data:  price_Queens and price_Staten
## t = -0.80085, df = 423.37, p-value = 0.4237
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -10.045153   4.229242
## sample estimates:
## mean of x mean of y 
##  112.8103  115.7182

After running 10 different pairwise t-test (and applying Bonferroni correction), we don’t have enough evidence that there is difference in average Airbnb prices between Bronx, Staten, and Queens borough. However, for other pairwise tests, we found that there is statistically significant difference in average prices between them. The test suggests that the average Airbnb prices in Brooklyn and Manhattan is higher than other boroughs. This isn’t too surprising since Manhattan and Brooklyn are both very densely populated.

Question 3

Next, we are interested in whether or not there is a difference in review score ratings between superhost and non-superhost listings across various price points. We will use a scatterplot to help us understand whether the superhost status correlates with higher review scores and how this might be influenced by the price.

Looking at the graph, we see that listings from superhosts (blue) seem to be generally clustered at the higher end of the review score ratings, suggesting that superhosts typically have better reviews. This visualization can help Airbnb hosts understand the importance of the superhost status and its possible effect on guests’ experiences and reviews. Additionally, it can guide guests in making informed decisions when selecting a listing.

Question 4

Lastly, we will take a look at the most frequently mentioned features, characteristics, or attributes in the descriptions of Airbnb listings written by the host for each borough in New York City. To do this, we used a word cloud to offer a visual summary of the most emphasized features in Airbnb listings, reflecting what hosts believe are the attractive points of their neighborhoods.

We will take a look at the most notable words for each borough below:

Manhattan: Words such as “central”, “river”, “city”, “west”, and “east” could be pointing to specific areas within Manhattan, like Central Park, East River, and the general centrality of the location within NYC. This suggests a focus on the central location and perhaps the vibrant city life.

Brooklyn: Words like “Prospect,” “Williamsburg,” “coffee,” “historic,” and “vibrant” can be associated with well-known areas in Brooklyn, and they emphasize a trendy, historical, and vibrant community atmosphere.

Queens: Words such as “quiet,” “JFK,” “Astoria,” and “mins” might refer to the quieter residential areas, proximity to JFK airport, and neighborhoods like Astoria, known for its cultural diversity.

Bronx: Words like “zoo,” “botanical,” “safe,” and “great” could be referencing attractions such as the Bronx Zoo and the New York Botanical Garden, and an overall positive sentiment about the area.

From our word cloud, we see that it suggests that accessibility, local attractions, safety, and neighborhood ambiance are key selling points for these listings. This analysis could be useful for potential renters to get an impression of what to expect from different neighborhoods and for hosts to understand which aspects to emphasize in their listings. This also relates to our second question in the sense that we see why Manhattan has higher demand since it is a cultural and financial center with a lot of major attractions. For example, we see in the word cloud “Central Park” and “Times Square”.

Conclusion

We will summarize our main findings below:

Our first question asked how the number of Airbnb listings compare by geographic location. From our choropleth map, we see that there are more listings in neighborhoods in Manhattan and Brooklyn compared to the other three boroughs. This difference was “significant” since our dendrogram showed clusterings of Manhattan and Brooklyn neighborhoods together while the other boroughs’ neighborhoods were more randomly scattered. This meant that there could be a relationship between the number of listings and geographic clusters (boroughs).

Our second question looked at prices between boroughs and room types. From our side-by-side boxplot and heatmap visualizations, they underscore how room type and borough location are significant factors in the pricing of Airbnb accommodations in the city. Entire homes and apartments, particularly in Manhattan, show the highest prices and outliers, indicative of a luxury market. In contrast, private and shared rooms maintain lower, more uniform prices across all boroughs, with the Bronx and Staten Island being the most affordable options. This then naturally leads us to explore if there is any difference in Airbnb prices between other boroughs, and why there are such differences. The price difference in Queens, Staten Island, and Bronx is statistically insignificant, while there is visible difference in Manhattan and Brooklyn.

Our third question asked whether there is a difference in review score ratings between superhost and non-superhost listings across various price points. The scatterplot illustrates that there is a general clustering of higher review score ratings towards the lower price range for both superhosts and non-superhosts. However, superhost listings are represented more densely at the top range of review scores across a variety of price points, suggesting that superhosts consistently receive higher review scores compared to non-superhosts. Therefore, the conclusion is that superhost status is associated with higher review scores, regardless of the listing price, supporting the idea that superhosts provide a level of service that is recognized in customer reviews.

Our last question looked at the most frequently mentioned features, characteristics, or attributes in the descriptions of Airbnb listings for each major neighborhood in New York City. The word cloud analysis of Airbnb listings in New York City reveals that hosts frequently emphasize location convenience, proximity to public transport and parks, neighborhood ambiance, and local amenities like dining and shopping. Each neighborhood showcases the distinct character and perceived advantages of each area within the city. This information is valuable for guests in selecting a neighborhood that aligns with their preferences and for hosts looking to market their listings effectively.

Limitations and Further Research

Since “Inside Airbnb” is a website that is not associated with Airbnb, there exists some limitations with using their datasets. The data from this website is a snapshot of the listings available at a particular time. In our case, it was what was available on November 1st, 2023. This means our analysis may not be applicable for people wanting to look at listings from today as some new listings might be added or some listings that we have may have been deleted. In the future, we can do similar analysis in other cities in the world. Due to time, we focused only on New York City since that is one of the most touristy cities in the United States. But we could also take a look at other popular cities or even more suburban cities to compare. We could potential compare the differences of listing characteristics between different countries or continents. For example, how does the Airbnb listings in Paris compare to the listings in New York?