library(tidyverse)
library(ggplot2)
library(dplyr)
library(factoextra)
library(ggseas)
housing_data <- read.csv("ames-housing.csv")
The housing landscape has changed dramatically over the last couple of years. With an increase in income, cost of living, and upgrades desired by homeowners, the architecture and characteristics of a house have been updated repeatedly to match the demand. Even in rural areas, houses have seen disproportionate price growth, especially between March 2020 and March 2023 (JCHS Harvard, 2024).
Usually for a house, the neighborhood, type of house, characteristics of the house, type of sale, and time are factored into the price. In this paper, we will try to understand these factors and how they influence the housing market in Ames, a rural city in Iowa. We will also try to understand how they may have affected the price of houses over time to see the drastic change that was explained above.
In this Ames house data, we will analyze a random sample of 2930 houses and 82 variables. The data is organized by City Parcel Identification Number by row and each column is the variable value associated with each observation. Each column is a feature of the house, with more information on specific chosen variables below. Since we are interested in houses in Ames, Iowa, we will examine the type of houses in Ames, how the houses have been/are being purchased, and how have the features of houses changed the price over time. We summarize the variables we will use below:
The first couple of lines of the dataset looks like the following:
head(housing_data)
## Order PID MS.SubClass MS.Zoning Lot.Frontage Lot.Area Street Alley
## 1 1 526301100 20 RL 141 31770 Pave <NA>
## 2 2 526350040 20 RH 80 11622 Pave <NA>
## 3 3 526351010 20 RL 81 14267 Pave <NA>
## 4 4 526353030 20 RL 93 11160 Pave <NA>
## 5 5 527105010 60 RL 74 13830 Pave <NA>
## 6 6 527105030 60 RL 78 9978 Pave <NA>
## Lot.Shape Land.Contour Utilities Lot.Config Land.Slope Neighborhood
## 1 IR1 Lvl AllPub Corner Gtl NAmes
## 2 Reg Lvl AllPub Inside Gtl NAmes
## 3 IR1 Lvl AllPub Corner Gtl NAmes
## 4 Reg Lvl AllPub Corner Gtl NAmes
## 5 IR1 Lvl AllPub Inside Gtl Gilbert
## 6 IR1 Lvl AllPub Inside Gtl Gilbert
## Condition.1 Condition.2 Bldg.Type House.Style Overall.Qual Overall.Cond
## 1 Norm Norm 1Fam 1Story 6 5
## 2 Feedr Norm 1Fam 1Story 5 6
## 3 Norm Norm 1Fam 1Story 6 6
## 4 Norm Norm 1Fam 1Story 7 5
## 5 Norm Norm 1Fam 2Story 5 5
## 6 Norm Norm 1Fam 2Story 6 6
## Year.Built Year.Remod.Add Roof.Style Roof.Matl Exterior.1st Exterior.2nd
## 1 1960 1960 Hip CompShg BrkFace Plywood
## 2 1961 1961 Gable CompShg VinylSd VinylSd
## 3 1958 1958 Hip CompShg Wd Sdng Wd Sdng
## 4 1968 1968 Hip CompShg BrkFace BrkFace
## 5 1997 1998 Gable CompShg VinylSd VinylSd
## 6 1998 1998 Gable CompShg VinylSd VinylSd
## Mas.Vnr.Type Mas.Vnr.Area Exter.Qual Exter.Cond Foundation Bsmt.Qual
## 1 Stone 112 TA TA CBlock TA
## 2 None 0 TA TA CBlock TA
## 3 BrkFace 108 TA TA CBlock TA
## 4 None 0 Gd TA CBlock TA
## 5 None 0 TA TA PConc Gd
## 6 BrkFace 20 TA TA PConc TA
## Bsmt.Cond Bsmt.Exposure BsmtFin.Type.1 BsmtFin.SF.1 BsmtFin.Type.2
## 1 Gd Gd BLQ 639 Unf
## 2 TA No Rec 468 LwQ
## 3 TA No ALQ 923 Unf
## 4 TA No ALQ 1065 Unf
## 5 TA No GLQ 791 Unf
## 6 TA No GLQ 602 Unf
## BsmtFin.SF.2 Bsmt.Unf.SF Total.Bsmt.SF Heating Heating.QC Central.Air
## 1 0 441 1080 GasA Fa Y
## 2 144 270 882 GasA TA Y
## 3 0 406 1329 GasA TA Y
## 4 0 1045 2110 GasA Ex Y
## 5 0 137 928 GasA Gd Y
## 6 0 324 926 GasA Ex Y
## Electrical X1st.Flr.SF X2nd.Flr.SF Low.Qual.Fin.SF Gr.Liv.Area Bsmt.Full.Bath
## 1 SBrkr 1656 0 0 1656 1
## 2 SBrkr 896 0 0 896 0
## 3 SBrkr 1329 0 0 1329 0
## 4 SBrkr 2110 0 0 2110 1
## 5 SBrkr 928 701 0 1629 0
## 6 SBrkr 926 678 0 1604 0
## Bsmt.Half.Bath Full.Bath Half.Bath Bedroom.AbvGr Kitchen.AbvGr Kitchen.Qual
## 1 0 1 0 3 1 TA
## 2 0 1 0 2 1 TA
## 3 0 1 1 3 1 Gd
## 4 0 2 1 3 1 Ex
## 5 0 2 1 3 1 TA
## 6 0 2 1 3 1 Gd
## TotRms.AbvGrd Functional Fireplaces Fireplace.Qu Garage.Type Garage.Yr.Blt
## 1 7 Typ 2 Gd Attchd 1960
## 2 5 Typ 0 <NA> Attchd 1961
## 3 6 Typ 0 <NA> Attchd 1958
## 4 8 Typ 2 TA Attchd 1968
## 5 6 Typ 1 TA Attchd 1997
## 6 7 Typ 1 Gd Attchd 1998
## Garage.Finish Garage.Cars Garage.Area Garage.Qual Garage.Cond Paved.Drive
## 1 Fin 2 528 TA TA P
## 2 Unf 1 730 TA TA Y
## 3 Unf 1 312 TA TA Y
## 4 Fin 2 522 TA TA Y
## 5 Fin 2 482 TA TA Y
## 6 Fin 2 470 TA TA Y
## Wood.Deck.SF Open.Porch.SF Enclosed.Porch X3Ssn.Porch Screen.Porch Pool.Area
## 1 210 62 0 0 0 0
## 2 140 0 0 0 120 0
## 3 393 36 0 0 0 0
## 4 0 0 0 0 0 0
## 5 212 34 0 0 0 0
## 6 360 36 0 0 0 0
## Pool.QC Fence Misc.Feature Misc.Val Mo.Sold Yr.Sold Sale.Type Sale.Condition
## 1 <NA> <NA> <NA> 0 5 2010 WD Normal
## 2 <NA> MnPrv <NA> 0 6 2010 WD Normal
## 3 <NA> <NA> Gar2 12500 6 2010 WD Normal
## 4 <NA> <NA> <NA> 0 4 2010 WD Normal
## 5 <NA> MnPrv <NA> 0 3 2010 WD Normal
## 6 <NA> <NA> <NA> 0 6 2010 WD Normal
## SalePrice
## 1 215000
## 2 105000
## 3 172000
## 4 244000
## 5 189900
## 6 195500
What is the housing market like in Ames, Iowa? Specifically, what kind of houses are in Ames, how do they vary by neighborhood, and how have amenities of houses changed over time?
How does the nature of a housing sale impact its sale price? Specifically, how do sale type and sale condition relate to sale price to show homeowner patterns?
How do the quality and condition of a house impact the price of a house? Also, how does the average sale price change over time in accordance with economic shifts?
First, we want to explore what kind of houses are being sold in Ames, Iowa to give us a better understanding of what the demographic looks like. This will require us to look into variables such as the house style, neighborhoods, and how amenities such as garages and porches have changed. By analyzing the types of houses in Ames, we will be able to further contextualize what the housing market is truly like and get a broader understanding of why specific trends regarding pricing may occur.
To start this analysis, let’s look at the distribution of houses to see if distinct clusters emerge by house style using multi-dimensional scaling:
table(housing_data$House.Style)
##
## 1.5Fin 1.5Unf 1Story 2.5Fin 2.5Unf 2Story SFoyer SLvl
## 314 19 1481 8 24 873 83 128
haa <- housing_data %>%
select(Lot.Area, Mas.Vnr.Area, Total.Bsmt.SF, X1st.Flr.SF, X2nd.Flr.SF,
Gr.Liv.Area, Garage.Area, Wood.Deck.SF, Open.Porch.SF, Enclosed.Porch,
X3Ssn.Porch, Screen.Porch, Pool.Area)
haa[is.na(haa)] <- 0
haa <- haa %>%
apply(MARGIN = 2, FUN = function(x){x / sd(x)})
house_dist <- dist(haa)
housing_mds <- cmdscale(house_dist, k = 2)
housing_mds <- housing_data %>%
mutate(mds1 = housing_mds[,1], mds2 = housing_mds[,2])
housing_mds %>%
ggplot(aes(x = mds1, y = mds2)) +
geom_point(aes(color = House.Style), alpha = 0.3) +
theme_minimal() +
labs(title="MDS1 vs. MDS2 Colored by Style of the House",
x="MDS 1",
y="MDS 2")
In the plot above, we can see that three large clusters form: a blue cluster indicating houses with features similar to a typical 2-story home, a green cluster indicating houses with features similar to a typical 1-story home, and a slightly smaller red cluster indicating houses with features similar to a one and one-half story home where the second level has been constructed. Besides showing similarities between houses, this plot also shows us the relative frequencies of how common each house style is. Visually, there are far more 1-story, 2-story, and 1.5-story homes with other housing styles not as prevalent.
The fact that three major clusters emerge, highlights that the quantitative variables from the housing dataset have specific feature differences that distinguish these types of housing styles from one another. Intuitively, this makes sense because on average, in terms of pricing alone, 1-story homes will be less expensive than 2-story homes.
Another question that comes to mind when talking about housing styles, is how these various housing styles are spread across the state of Iowa. In our house pricing dataset, there are over 25 neighborhoods in which housing prices were recorded. That leads us to the question of, how are housing styles different per neighborhood. Are there neighborhoods that are more affluent than others? How does this relate to pricing? To answer these questions, we create the following stacked bar chart (filtering out all houses that aren’t either 1-story, 2-story, or 1.5-story for readability):
housing_data %>% filter(House.Style == c("1.5Fin", "1Story", "2Story")) %>%
ggplot(aes(x = Neighborhood, fill = House.Style)) +
geom_bar() +
coord_flip() +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
labs(title = "Distribution of House Style per Neighborhood", y = "Frequency")
From this chart many things become apparent. First off, the amount of houses that were sold in each neighborhood was not distributed uniformly. There are a few neighborhoods with several houses sold that far exceed the mean houses sold per neighborhood: North Ames, College Creek, and Old Town.
Beyond that, looking at the marginal distribution of housing style for each neighborhood shows us that there are many areas in which there exists the area and space to create 2-story houses, weeding out the number of 1 and 1.5-story houses. Based on the ratio of 2-story homes to non-2-story homes, the more affluent neighborhoods in Ames, Iowa are Gilbert, Somerset, and Northridge. As a check, I browsed the internet and found that among many of the cities in Iowa, these three neighborhoods in particular are regarded as some of the safest, most welcoming neighborhoods within Ames. According to niche.com, Gilbert specifically is ranked the number one best place to raise a family within the county.
Among these neighborhoods, less than half feature houses with 1.5 stories. Based on further internet research, houses with 1.5 levels are not considered houses that would be primarily found in low-income neighborhoods and thus, we can only make the claim that neighborhoods with a high percentage of 2-story homes are more affluent even though they typically have fewer houses within said neighborhood.
Lastly, when it comes to understanding what the housing market looks like, one thing that’s important to track is the number of amenities a house would have. Intuitively, we expect that the number of houses with additional benefits such as larger garages or basements, would increase as homes modernize. With this in mind, how did the prevalence of home luxuries increase over time? Are any amenities more important than others? What is the general trend of each amenity? We demonstrate this by creating the following time-series plot.
amenities <- housing_data %>%
group_by(Year.Built) %>%
summarize(avg_garage = mean(Garage.Area),
avg_basement = mean(Total.Bsmt.SF),
avg_porch = mean(Open.Porch.SF))
amenities_plot <- amenities %>%
pivot_longer(cols=c(avg_garage,avg_basement, avg_porch),
names_to="Amenities", values_to="Average_Value")
ggplot(amenities_plot, aes(x=Year.Built,y=Average_Value, color=Amenities)) +
geom_line() +
geom_smooth(method = "lm", se = FALSE, aes(group = Amenities)) +
labs(title="Average of Amenities Over Time",
x="Year Built",
y="Average Value")
This time-series plot features the year of creation on the x-axis and the average value (in dollars) on the y-axis. We separate each amenity by its own colored line to track its movement over time. For basement and garage space, the regression line slope is positive, thus, over time, there has been heavier emphasis on such features/amenities in Iowan homes. On the contrary, the porch size shows a downwards trend, implying that porches are less of a priority for Iowan homeowners. This plot also shows that basements are naturally larger in size than garage or porch size. Lastly, the slope of the basement line is steeper than the slopes of the garage and porch lines. This implies that basement size has had the most significant growth over the last century. However, it is important to note that this trend could be contextually dependent on the Iowan landscape due to having more space to work with, while with garages, people only need to fit in two cars and thus only need to accommodate one or two vehicles.
For further quantitative analysis, we ran a regression analysis on showing how the Year.Built, Garage Area, Total Basement Area, and Open Porch Area affected Sale Price. The linear regression model is shown below:
model <- lm(SalePrice~Year.Built + Garage.Area + Total.Bsmt.SF + Open.Porch.SF, data = housing_data)
summary(model)
##
## Call:
## lm(formula = SalePrice ~ Year.Built + Garage.Area + Total.Bsmt.SF +
## Open.Porch.SF, data = housing_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -510568 -28928 -6537 21372 427162
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.177e+06 7.022e+04 -16.760 <2e-16 ***
## Year.Built 6.223e+02 3.635e+01 17.120 <2e-16 ***
## Garage.Area 1.240e+02 5.355e+00 23.164 <2e-16 ***
## Total.Bsmt.SF 6.326e+01 2.521e+00 25.094 <2e-16 ***
## Open.Porch.SF 1.215e+02 1.448e+01 8.392 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 50690 on 2923 degrees of freedom
## (2 observations deleted due to missingness)
## Multiple R-squared: 0.598, Adjusted R-squared: 0.5975
## F-statistic: 1087 on 4 and 2923 DF, p-value: < 2.2e-16
We see that all the variables are significant for the model and all have a very low p-value. The R^2 value, 0.598, is also relatively okay for a linear model. We see that the Total Basement Area has a large coefficient indicating that it has a larger effect on the sale price of a house. Overall, the linear regression model is a good estimate of the sale price, however, it has certain limitations related to diagnostics and perhaps omitted variable bias that should be looked into more in the future.
Following our understanding of the housing demographic in Ames, we will now look into how the houses are being purchased by homeowners and the relationship between these methods and the price of the sale. This will require us to look into the sale type of the house, the sale condition of the house, and the sale price. This will benefit us because understanding the nature of how sales are conducted in Ames, Iowa can help us gain a better understanding of market dynamics, property valuation, and other areas of interest in real estate research.
ggplot(data = housing_data, aes(y = Sale.Type, x = SalePrice)) +
geom_boxplot(outlier.color = "red") +
theme_minimal() +
labs( y = "Type of sale",
x = "Price of sale",
title = "Boxplot of sale price vs. sale type")
housing_data %>%
group_by(Sale.Type) %>%
summarise(Median_SalePrice = median(SalePrice))
## # A tibble: 10 Ă— 2
## Sale.Type Median_SalePrice
## <chr> <dbl>
## 1 "COD" 127500
## 2 "CWD" 160750
## 3 "Con" 215200
## 4 "ConLD" 127500
## 5 "ConLI" 119000
## 6 "ConLw" 92500
## 7 "New" 250580
## 8 "Oth" 116050
## 9 "VWD" 137000
## 10 "WD " 157000
New home sales have the highest median sale price of all sale types, at around $250,000. This logically makes sense, as customers are generally willing to pay a premium for newly built homes.
Contracts with 15% down payment and regular terms have the second highest median sale price, and while it has a higher interquartile range than other sale types, it has no outliers. This is likely because these are more traditional sales, with standard down payments (15% is the American median). These are likely to be middle-income Americans, which follows the median sale price being somewhat middling, compared to newer sales and distressed property sales.
Conventional warranty deeds have a somewhat low median sale price, with a high number of outliers. This could indicate that a number of high-valued properties, like estates, manors, or other more expensive properties, are being sold through this method.
Court officer deeds, contracts with low down payments, low interests, and low down payments and interests all have the lowest median sale prices of the sale types, besides Other. This makes sense, as court officer deeds are judicially mandated sales of properties, and are sold expeditiously due to a foreclosure or a judicial proceeding. Properties with low down payments and low interest rates would logically sell for less, as these are properties on the lower end of the market to begin with.
ggplot(data = housing_data, aes(y = Sale.Condition, x = SalePrice)) +
geom_violin() +
geom_boxplot(width = 0.1, outlier.color = "red") +
theme_minimal() +
labs(x = "Price of sale",
y = "Condition of sale",
title = "Violin + box plot of sale condition vs sale price")
housing_data %>%
group_by(Sale.Condition) %>%
summarise(Median_SalePrice = median(SalePrice))
## # A tibble: 6 Ă— 2
## Sale.Condition Median_SalePrice
## <chr> <dbl>
## 1 Abnorml 129450
## 2 AdjLand 110000
## 3 Alloca 149617
## 4 Family 144400
## 5 Normal 159000
## 6 Partial 250000
Partial construction sale conditions have the highest median sale price of any condition of sale. This could be due to the sale of partially-completed high-value properties. These properties could also be over particularly valuable land, meaning that the actual construction of the property wasn’t the reason behind the sale, but rather, the value of the land the property was on. The timeframe of the dataset (2006-2010) spans the initial periods before, during and after the global financial crisis. As such, it’s possible that sales of high-value partially completed properties were speculative sales, and part of the general bubble of housing prices at the time. Normal sale conditions had the second highest median of sale price, along with a high number of outliers. These outliers could also be speculative sales made before the housing crisis, with the buyer’ assumption being that the high price was justified since housing prices (at the time) kept rising and rising. Inter-family and two linked property sale conditions have around the same median sale price, slightly lower than normal sale conditions. This could be due to family sales being discounted thanks to the generally non-competitive market dynamics within a family, and linked property sales tend to be of smaller units. Adjoining land purchase and abnormal sale conditions had the lowest median sale prices. Abnormal sale conditions also had several outliers. Adjoining land purchases having low median sale prices makes sense, as these are just purchases of land immediately next to a property, i.e., land-only transactions. With no housing property to purchase, the sale price would logically be lower. Abnormal sales occur under foreclosures or short sales, which are sold below market value to attract buyers quickly, so their low median sale price (likely exacerbated by the financial crisis) makes perfect sense. The presence of high outliers could be due to speculative purchasing and price spikes, much like normal sale conditions.
In conclusion, taking into consideration the historical climate of sale purchases at the time, newer home sales, conventional warranty deed sales, partial construction, and normal sale conditions had the highest median sale prices. Court officer deeds, contracts with low interest rates, down payments, or both, adjoining land purchases, intra-family sales, and abnormal sale conditions had the lowest median sale prices.
Finally, we want to understand how the features of a house like the overall condition and quality differ among houses in Ames, and if they affect the sale price of a house. We will also see how the average sale prices have changed over time to provide insight into possible spikes and dips during critical time-periods like the Great Depression and The Financial Crisis. We will use Year.Built, SalePrice, Overall.Qual, and Overall.Cond as the variables to build our visualizations.
To start off, we will examine the density of houses in Ames with various quality and condition ratings. This will give us a better idea of the overall housing market in Ames and specifically what kind of houses exist in Ames. Below is a heat map of Overall Quality by Overall Condition of houses.
ggplot(housing_data, aes(x=Overall.Qual, y=Overall.Cond)) +
stat_density_2d(aes(fill=after_stat(density)),
geom = "tile",
contour=FALSE) +
geom_point(alpha=0.2) +
coord_fixed() +
scale_fill_gradient(low="white",
high="red") +
theme_bw() +
labs(title="Heat Map of Overall Quality by Overall Condition of Houses",
x="Overall Quality",
y="Overall Condition")
table(housing_data$Overall.Qual)
##
## 1 2 3 4 5 6 7 8 9 10
## 4 13 40 226 825 732 602 350 107 31
table(housing_data$Overall.Cond)
##
## 1 2 3 4 5 6 7 8 9
## 7 10 50 101 1654 533 390 144 41
The houses with a certain combination of quality and condition are brighter in red and slowly lose color as the density of such houses decreases. In the plot above, we can see that houses with an overall quality of around 7 and an overall condition of 5 are the densest in Ames while the next densest houses are the ones with an overall condition of 5 and overall qualities of 5, 6, and 8. This is interesting because it shows that even though the overall condition of the houses is constant at around Average, the houses’ quality varies between Average, Above Average, Good, and Very Good. This is quite interesting, and it might be worthwhile to look more into the distribution of Overall Quality. Specifically, we will look into how the year a house was built affects the sale price of a house while dividing the points by Overall Quality. This will help us understand how the overall quality of houses in Ames has changed over time and if there has been a prolonged period where only houses with overall qualities of Average, Above Average, Good, and Very Good were built. We will also look at how it affects sale prices and how different qualities result in different prices to better understand buying patterns. Below is a time series of Year Built by Sale Price Separated by Overall Quality.
ggplot(housing_data, aes(x=Year.Built, y=SalePrice, color=as.factor(Overall.Qual))) +
geom_point(alpha=0.5) +
labs(title="Year Build by Sale Price Separated by Overall Quality",
x="Year Built",
y="Sale Price")
The time series shows that there is a slight exponential growth in sale price over time. Sale prices remained fairly constant till around 1980 when they slowly started to increase and peaked around 2000. The overall quality of the houses has also changed drastically as we can see by the colors of the points. Most of the houses in the early 1900s and throughout have had an overall quality of around 5(Average), 6(Above Average), and 7(Good). Only from around 1980, do we see houses with quality ratings of around 8(Very Good), 9(Excellent), and 10(Very Excellent). This shows how there were many houses with overall qualities of 5, 6, and 7 built over a longer time period than other quality houses, which aligns with the heat map we saw before. Perhaps more houses with somewhat average quality were built to match the rural environment of Ames because of peoples’ cost of living and housing expectations. This is also probably why the sale price of houses has remained around the same for a while because the quality of houses has not changed until around 1980. The sale price has tripled from 1980 to the early 2000s, which indicates that there was something that happened during that period. To better understand the trend during this period, we will make another plot that shows the moving average of prices over time.
average_sale_price <- housing_data %>%
group_by(Year.Built) %>%
summarize(Avg_Sale_Price = mean(SalePrice))
ggplot(average_sale_price, aes(x=Year.Built,y=Avg_Sale_Price)) +
geom_line(color="purple") +
stat_rollapplyr(width=2, align="left") +
labs(x="Year", y="Average Sale Price", title="Moving Average of House Sale Prices")
The graph above shows that the moving average of sale prices has been increasing over time with dramatic increases and decreases during specific years maybe due to recessions and financial crises, specifically around 1890 and 1940. Some research tells us that there was the Panic of 1893, which led to unemployment and bank failures. This could have led to a sharp decline in house prices because the demand for housing went down. Around 1930-1940, the Great Depression occurred, which could have also led to a decrease in sale prices because demand decreased during that time period too. After that, housing prices seem to have been steadily rising due to economic activity, which has in turn led to higher costs of living.
In conclusion, these graphs show that the disproportionate growth of housing prices and housing in general is not only limited to the urban areas but has also affected rural areas. Ames has been average in housing for a while in terms of both quality and condition, but recently, has increased both due to cost of living and economic booms, resulting in higher sale prices.
In this analysis, we learned that the housing landscape in Ames, Iowa can be explored in many different ways to see how houses have changed.
First, we looked at what the overall housing market looks like in Ames to get a better understanding of what kind of houses or features are most prevalent. We did this by looking at the style of houses, how the most prominent types of houses vary by neighborhoods, and some interesting amenities have changed over time. We did by running an MDS plot on house size (square feet) and colored it by house style to understand, which styles were most prevalent. We saw that one story, one and a half stories, and two stories houses were most prevalent. We also ran a stacked bar chart on those house styles by neighborhood to explore which houses were most common in different parts of Ames. We saw that neighborhoods such as North Ames and College Creek had more houses than other places and affluent neighborhoods like Gilbert had larger houses. Lastly, we ran a time series and a regression analysis to see how amenities such as garage area and open porch square feet changed over time and how they may have affected sale price. We saw that basement and garage space had a positive relationship and that basement size had the most effect in the linear regression analysis. All the listed amenities were significant in the relationship, but as mentioned, there may be some limitations that should be addressed in future analysis.
Second, we looked at how houses are being purchased and the relationship between buying methods and the house sale price. We did this by running a boxplot on the type of sale and the price of sale. We also ran a violin plot on the condition of sale and price of sale. We saw that newer home sales, conventional warranty deed sales, partial construction, and normal sale conditions had the highest median sale prices. Also court officer deeds, contracts with low interest rates, down payments, etc. had the lowest median sale price. This is all useful in understanding what methods and conditions homeowners in rural areas are using even sale price changes.
Last, we looked at how the overall quality and condition of houses impact sale price and how sale prices have changed over time. We did this by running two time series with the year the houses were built, sale prices, and overall quality and condition of the houses. We saw that houses higher overall quality and condition had higher sale prices, but majority of the houses in Ames were around average in both quality and condition. Only recently was there an increase in houses with greater quality and condition related maybe to economic boom and demand from cost of living. This ties in with the research we conducted in the motivation section where there has been a disproportionate increase in sale prices, even in rural areas, due to demand and cost of living.
Overall, the data analysis on the housing environment in Ames, Iowa portrays the ever-growing housing industry, especially in rural areas. It provides important insight into possible factors behind sale price growth and how the houses themselves have changed in terms of characteristics and features to meet the modern age of housing. Some questions that may have been left unanswered due to time constraints or data constraints could be how consumers’ incomes have affected buying habits for houses in Ames. This could provide important insight into how income has shifted perspectives on sales and demand during economic fluctuations. However, the dataset does not have household income listed as a variable. Also, even though the data is observational, it might be useful to use econometrics techniques to identify causal relationships between variables using regression discontinuity or difference-in-differences methods. For regression discontinuity, we would need a policy or an event where we could compare homeowners before and after to identify potential causal relationships. For difference-in-differences we would also need a policy change or program implementation to identify a control and treatment group, which would reveal causal relationships. These are more advanced data analysis techniques and would require more data, but it would be interesting to investigate. We look forward to perhaps continuing this data analysis as we gather more data and learn more advanced statistical techniques.