Data

Living in San Francisco and the surrounding “Bay Area” is notoriously expensive due to its growing technology industry. Along with this boom in innovation, we see a massive influx of educated tech workers to the area, resulting in some degree of gentrification via higher average rent. We will explore the many factors which could impact the listing price of rent in this area and examine if there is any indication of gentrification in lower income neighborhoods over time.

Our dataset features 200,796 observations of listings between the years of 2000 and 2018 in the San Francisco Bay Area. The dataset we use is from the Tidy Tuesday package and contains the following 18 variables:

post_id - Unique ID
date - Date posted in YYYYMMDD format.
- new_date - We constructed this variable to have listing date in YYYY-MM-DD format.
year - Year posted
nhood - The neighborhood the listing is in
city - The city the listing is in
county - The county the listing is in
price - The listing price
beds - The number of bedrooms the listing features
baths - The number of bathrooms the listing features
sqft - The number of square feet in the listing
room_in_apt - Room in apartment
address - The address street number, if listed
lat - Latitude of the address, if the address is listed
lon - Longitude of the address, if the address is listed
title - Title of listing
description - Description of the listing
details - Additional details, if any

There are a number of missing values in the dataset, in part because some of the variables appear to have been added on over time. Filtering to only include complete cases removes over 99% of the data, so we instead remove NAs depending on the variables used in each visualization to avoid biased results.

Question 1: What are the demographic and geographic features of these areas?

We examine the geographic and demographic characteristics of specific areas in order to determine if they could be subject to gentrification over this time period. As we see from the plot below, there are 10 distinct counties in this dataset, but looking deeper, we see clusters around certain areas. Large amounts of listings are present in San Francisco County (San Francisco) and Santa Clara County (San Jose, Palo Alto, Cupertino), indicating large cities. We also generally see that counties such as Contra Costa, Santa Cruz, and Solano are less densely populated (in terms of listings) than other counties.

In order to better understand this area in terms of affordability, we take the average price of listings in each county, and compare this over time. We see that larger counties such as San Francisco generally have higher rent prices, while smaller counties such as Contra Costa and Solano have lower rent prices on average. There are a few counties that see localized, upward trends in the cost of rent such as in Napa county in about 2008, but overall, rent prices appear to stagnate or lower in most counties.

One thing we must consider is the size of the listing: $3000/month for a 1 bedroom apartment compared to a 2 bedroom house are different quantities altogether. From our dataset, we know that one-bedroom listings are common in larger cities: San Francisco or San Jose. San Francisco is a blatant example of extremely large rent prices for available listings that are relatively small.

Question 2: How has availability of rentals changed over time?

We next consider how availability changes over time in order to examine if certain factors, like county or number of bedrooms, has an impact on availability. We also will consider the impact of the 2008 Great Recession on availability.

Rental Listings over Years

The graph shows that the amount of listings has changed over the years with a few years of relatively high listing amounts followed by one or two years of lower listings. This may indicate that rentals follow a cycle dependent on rental lease terms. For example, rental leases may be on average about 1-3 years long and after that the rental may be re-listed as the renters move out. This would coincide with the trend seen because while rentals are leased, they cannot be on the market, which would result in lower listing availability during the following leased years.

The top four counties by listings have many more rentals listed overall than the lower six counties. This is likely due to demand stemming from commuting concerns, job availability, and type of neighborhoods. The top four counties include San Francisco, San Clara, and two of the closest adjacent counties to both. Both San Francisco and San Clara counties are huge hubs for tech companies that provide thousands of jobs. Therefore, there is high demand for rentals in these areas as people desire housing within commuting distance to their jobs.

Another observation is that San Francisco county has had the most listings amongst all other counties. Since San Francisco city and San Francisco county cover essentially the same area, we can understand that there are more housing units per area due to its city nature. In addition to the large amount of jobs offered in the city, it makes sense that there are more listings because there are more housing units and more demand.

Rental Listings by Bedrooms over Years

We can see by the similar density curves for listed rentals by bedrooms over the years, that whether a rental has 0, 1, 2, or 3 bedrooms does not seem to impact their relative availability. This implies that rental availability does not fluctuate very much for rentals with the most common number of bedrooms.

Impact of the Great Recession on Rentals

During the height of the Great Recession, late 2007 to 2009, there appears to be an extremely low amount of rentals listed. This makes sense as the Great Recession was triggered by a housing crisis that resulted in millions of people losing their homes to foreclosure. Therefore, as many people struggled to keep their homes, there were many fewer who owned and could provide rentals. We can see that prior to and after the height of the Great Recession, the amount of rentals listed increased once more. This reflects the healthier economic and housing situations of those years.

Question 3: What determines rent prices?

With our understanding of various trends in Bay Area rental properties, we now want to apply this information to the prices of rentals. We are concerned not just with prediction of price, but rather understanding what causes differences in prices between properties.

While we can expect certain variables to have fairly straightforward relationships with price, like beds and square footage, we want to investigate what else underlies how much people are willing to pay and how much landlords think they can get.

Exploring Variable Relationships

First, we will look into the relationships between prices and some variables we expect to be predictive.

Since the Bay Area is not homogeneous, as seen in our first research question, we expect rents to also differ substantially between areas. We can examine this with rentals mapped by location and colored by rental price.

We see that rent prices vary substantially across the Bay Area and do not follow any strictly increasing or decreasing patterns. Rent does, however, tend to be higher within cities, compared to the suburbs and further out.

In addition to area, we also expect size of the rental to have a large influence on the price. However, “size” is more than just square footage, it is also the potential occupancy of a unit. This means that size is currently measured by three variables in our data: beds, bathrooms, and square footage. For visualization purposes, we want to condense this down to a single variable, for which we will use Principal Component Analysis.

Our first principal component explains 78.2% of the variance in size, which means it accounts for most of the size differences between rentals. Additionally, since all three variables have a negative correlation with the first PC, meaning it decreases with size, we use the inverse of the PC instead.

Graphing this against rental prices, we see that rent price tends to increase with rental size, as expected. This is true for all counties. However, the degree to which rent increases with size appears to differ between counties. For example, the premium paid for a larger rental in Sonoma is much less than that paid in San Francisco.

Regression

The variables investigated so far appear to be quite predictive of rent prices. We also saw in our second research question that trends in rentals have changed over time, so we will include the listing date for each rental and run the following regression model:

$\hat{Price} = \beta_1 + \beta_2Beds + \beta_3Baths + \beta_4SqFt + \beta_5County + \beta_6ListingDate$

However, we saw indication in our graph above that the relationship between size and price differs by county, so we will include an interaction between county and size.

$\hat{Price} = \beta_1 + \beta_2Beds + \beta_3Baths + \beta_4SqFt + \beta_5County + \beta_6ListingDate + \beta_7County*Beds + \beta_8County*Baths + \beta_9County*SqFt$

We can formally compare these models via an F-test, from which we conclude that the additional interaction terms in the second model substantially improve the model’s fit to the data. That is, the difference between counties in the relationship between size and price is statistically significant.

Res.Df	RSS	Df	Sum of Sq	F	Pr(>F)
14412	10265954090	NA	NA	NA	NA
14385	9284286563	27	981667528	56.33288	0

Text Analysis

In creating a regression model to predict rent, we have thus far focused on using the more accessible variables, leaving the free-text fields untouched. The title and description fields corresponding to each listing may have substantial information which is not revealed in the other variables, and which may tell us a lot about how rental prices are determined.

For ease of visualization, we created a categorical variable ranking rent prices by “low,” “medium,” or “high” (into three equally sized groups), with cutoffs between groups of $1450 and $2250.

To analyze the differences in words used for different rent brackets, we do text analysis with inverse document frequency weighting (TF-IDF).

Looking at the top words for each rent bracket, we see little indication of real explanatory value in the chosen words. Some words make sense - for example, “roommates” indicating low rent and “remodeled” indicating high rent. Some may just require more context, like “charging” in high rent, which could possibly refer to something like EV charging. However, we also have some nonsensical results: perks like “fitness center” in low rent, results that aren’t real words, and random names like Gavin and Hiawatha.

It seems that on aggregate, the title and description of each listing are not particularly indicative of how rental prices are determined. While a person reading through each listing without knowledge of the price could potentially guess whether rent was high or not, the “bag of words” approach that we are using unfortunately does not reveal much about what features affect rent prices.

Conclusion

We sought out to examine potential gentrification in the San Francisco Bay area in the 2000s due to the tech boom in that region. First, we found that smaller apartments were more common in larger cities, such as San Francisco, which makes sense intuitively. In addition, rent tends to be higher in these places when compared to smaller areas. Next we examined this issue of gentrification through an economic lens via looking at availability in specific regions over time. We found that the top four counties had more listings than the bottom six, likely due to demand stemming from more job opportunities or larger communities in the higher counties. We also noticed the effects of the 2008 Financial crisis on the availability of rentals, where the number of listings steadily increased after 2010 or so. We then explored the relationship between many factors and price. We found that size (square footage, number of bedrooms, number of bathrooms) was a major contributing factor, and that its relationship with price differed with different locations, at a statistically significant level. We also examined the text fields such as title and description of the listings, although unfortunately these did not seem to have much of an impact on cost directly.

We do have some limitations, including that our dataset features a lot of NA values, and when we remove those, our number of observations decreases drastically. While we would have liked to have a more robust dataset, we felt that we still had enough information to analyze how these various factors interact with price. We also understand that gentrification is a large and complex issue and we would be unable to confirm or deny with 100% certainty that this was happening in the scope of this project. We examined only a few research questions that could suggest gentrification, and there is a lot more research that can be conducted to prove that this phenomenon is occurring in the Bay Area.

Our paper suggests that gentrification could be occurring due to the differences in county and city in relationships between certain factors and price. While we expect that cities would have high prices, historically, cities have been a melting pot of different types of people including different income levels, and these high prices could be forcing people of lower socioeconomic status out of the city. As the technology industry in San Francisco has boomed over the past twenty years, so has the city. Economic growth is good, however, the tech industry is notoriously high paying and thus certain Bay Area counties have become an impenetrable bubble for those below a certain income. The 2008 Financial Crisis resulted in a large decrease in housing availability, which we show in Research Question 2, and this decrease in supply means that there was likely an increase in demand, due to the inverse relationship. Rent prices at that time were jacked up and those who could not afford to pay the increased price were forced out of the city.

The implications of this project are very large - gentrification is a massive issue impacting many people throughout the US and the world. Communities are being pushed out of their homes and any research which could prove that this is occurring could help society to stop this from happening. This topic can always be further explored with more data from different cities, or different variables about specific communities, such as racial or religious communities and how they are specifically being impacted.

Bay Area Rental Market Analysis

Abigail Khieu, Amelia Boose, Logan Saito, Zoe Thorpe

2023-05-01