From the commercial real estate tycoon to the humble family looking to settle down, everyone is affected by the housing market. This market is incredibly complex, with prices and trends constantly changing over the years. In this report we seek to shed some light on this noise, providing a clear and powerful understanding of the housing market and how it has changed over time. To do this we will examine a set of data from 33,946 real estate sales collected from 2002 - 2018. Within this dataset we are interested in

Property Type: A variable containing 4 possible property types - residential, commercial, condominium, and large apartment.

Exterior Wall Material: A variable containing possible materials for residential property exteriors: aluminium/vinyl, block, brick, fiber-cement, frame, masonry/frame, premium wood, stone, and stucco.

Style: A variable containing possible styles for properties.

Year Built: A variable containing the year each property was built. Ranges from 1835 to 2018.

Sale Date: A variable containing the year each property was sold. Ranges from 2002 to 2018.

Sale Price: A variable containing the price at which each property was sold.

Lot Size Size of the lot the property is on in square feet.

Finished Square Feet: Size of the finished property in square feet.

Full bath: Number of full bathrooms.

Half bath: Number of half bathrooms.

Number of Rooms: Number of rooms in the property.

Number of Stories: How tall the property is (stories.)

Bedrooms Number of Bedrooms.

With this data we seek to address 3 primary questions:

How have the characteristics of newly built houses changed over time?

How have market dynamics between properties marketed at single consumers (condominiums, residential) and properties marketed primarily at businesses and investors (large apartments, commercial properties) changed over time?

What combination of characteristics sells the best in the modern day?

Each of these questions helps to build upon the last, ultimately creating a better understanding of the housing market for the reader of this report. We proceed with the first graph.

We begin with this set of boxplots which show several of the key variables in the dataset broken down by decade. This plot gives a good overview of the dataset itsself, while begining to also shed some light on any trends in the dataset. The main purpose of this graph is to answer the following major questions: “How have the characteristics of houses being built changed over time?”, and “Which combination of architectural styles/characteristics sells the best for each time period?” Some of the variables seem to be fairly constant across the decades, such as the variables for number of full bathrooms, half bathrooms. The ones with the most variation from decade to decade are number of rooms, finished square feet, lot size, and units. These also have a lot of outliers, and the outliers seem to be dominating those plots more than the actual box/whisker. The box and whisker plots are effective because it displays the central tendency, so we can see this information for each decade of each architectural feature. Additionally, we can see how much spread/variability there is for each of the features from generation to generation. Most importantly, boxplots are useful for comparing the distribution of a variable across different groups. When using facets or multiple boxplots side by side, you can easily compare the central tendency and spread of each group, which is why this format allows for easy comparisons both within a specific architectural feature, but also across them.

Next we examine property types in more detail using this stacked bar plot. This plot specifically shows the proportion of each proprety type in the year it was built out of properties sold from 2002 to 2018. This helps us to address our major questions:“How have the characteristics of houses being built changed over time?” and “How have market dynamics between residential and commercial properties changed over time?”As we see the majority for most of the time period consists of residential property, but near the 1960s and 2000s we see a large number of condominiums were built, taking a massive fraction of total properties during this time which were sold recently. Commercial properties seem to generally get outsold by residential properties in terms of overall volume, but consistently maintain a portion of the market, at times exceeding 25 and even 50% of the properties built.

Looking deeper into these yearly trends, we plot rolling averages of both property size and sale price by year built, each broken down by property type. Using a rolling average allows for a cleaner look at time series data while acknowledging the connected and dependent nature of said data. Each year is dependent on the information of the prior year; a rolling average accounts for this. In these charts we see that the properties at a larger scale - commercial and large apartment - balloon in size rapidly in size starting in the 1960s and continuing through the turn of the century. Residential and Condominium type properties, by contrast, remain relatively constant in size. In this same period we see gains in sale price for Residential and Condominium properties in spite of the lack of gain in size. We also see an increase in sale price of large apartment and commercial properties, though not proportionate to the size increase seen in the previous chart.

Next we seek to look at the various architectural styles displayed, breaking these down by the number of rooms and bedrooms - and subsequently the size capacity - that they hold on average. From the plot, we can see that the ‘Town House’ category has the highest average number of rooms, followed by ‘Old Style Duplex’ and ‘Colonial’. The other architectural styles have a lower average number of rooms, with several styles being quite similar in their average room count. The visual format makes it easy to see which styles tend to have more rooms and which have fewer, information that could be useful to investors, home buyers, urban planners, or even architects who are interested in understanding the characteristics of different housing styles.

This graph seeks to illustrate how the average sale price of properties changes relative to the decade it was built and the style and type of property it is. Using a heat map makes it quick to see the best “zones” in which the highest valued properties are found (in green) as well as the lowest bargain properties (in red). The styles with the top 10 highest frequencies in the dataset were the only ones included since the dataset had a very large number of styles (most of which had very low frequency), and time periods were grouped into decades for easier visualization. We see that most styles had similar average prices regardless of the year they were built. There appears to be a slightly higher sale price for colonial properties built at the turn of the 20th century, as well as higher prices for store properties built in the 2000s, followed by the 1950s. Lower prices can be seen in Milwaukee bungalows built at the turn of the 20th century, as well as apartments built in the 1950s. Moving to the property type charts, we see that large apartments built in the 2010s sell for the most, while other large properties seem to hold relatively similar sale prices. Comparing residential and condominiums, we see condos built in the 2000s selling for the most, while condos built in the 50s and 70s sell for the least. Residential properties built in the 1830s and 1840s seem to sell for the least, while those from the 2000s and 2010s sell for the most.

This tool is extremely powerful because it provides all of this information instantly at a glance.

Observations 33913
Dependent variable log.sales
Type OLS linear regression
F(8,33904) 1577.54
0.27
Adj. R² 0.27
Est. S.E. t val. p
(Intercept) 3.85 0.26 14.78 0.00
Year_Built 0.00 0.00 31.68 0.00
Fin_sqft 0.00 0.00 32.76 0.00
PropTypeCondominium -0.25 0.02 -14.62 0.00
PropTypeLg Apartment 0.19 0.02 7.54 0.00
PropTypeResidential -0.45 0.01 -30.94 0.00
Units -0.10 0.01 -9.81 0.00
Lotsize 0.00 0.00 23.27 0.00
Stories 0.14 0.01 18.49 0.00
Standard errors: OLS

We conclude our report by attempting to regress upon the variables in question and gain some predictive capability for making educated guesses about the future. We chose to fit a linear model to the dataset, attempting to predict Log Sales Price from the Year_Built, Finished square feet, Property Type, Number of Units, Lot size, and number of Stories. These were chosen for their completeness of data as well as using formal model selection. We see the regression line predicts fairly accurately, thought the 95% confidence interval is quite wide. There were some issues with normality which may need further testing and exploration of transformations to fix. Using a global F test (F(8, 33904) = 1577.54, p=0) we found that we could reject the null hypothesis that Year built, Finished square feet, Property Type, Units, Lot size, and number of stories all had zero coefficients. We can safely conclude that these variables do influence sale price, and our model proposes one solution to this problem. Further testing is necessary, but this provides a first step for any investor, future homeowner, or commercial real estate buyer looking to understand what factors are associated with higher or lower sale prices.

Overall, this report provides a first step in a long line of analysis necessary to fully understand this dataset and the questions posed. We demonstrated some insight into the changing trends of houses over time, as well as how these trends related to the market dynamic between commercial and residential property types. We also provided some insight into what contributed to final sales price, creating a tool that could show historical hot spots at a glance, as well as taking steps to create a predictive tool. We found that our variables in question showed a statistically significant association with final sale price, so there is merit to this research. Further study is necessary to fully utilize all of the information available in this rich dataset. We have only begun to scratch the surface. Other types of models should be tested using some form of cross validation to create a real, valid, powerful predictive tool that can be useful to property buyers in the future. These graphs, though useful, can be expanded with more precise data targeting specific regions and property categories. This study is limited in the fact that it simply looks at the property market as a homogeneous entity, but really it varies state to state, city to city, and even county to county. Overall, our biggest limitation is in this lack of granularity, and further studies are necessary to remedy that.