Introduction

Estimating the value of a house can be difficult due to numerous factors involved, such as its size, amenities, proximity to certain roads or parks, and many other factors that are hard to quantify. In this report, we will analyze the houses sold in Ames, Iowa from 2006 to 2010 to understand how different features of the house affect its price. To do so, we will focus on the following research questions in particular:

Which features best determine house prices in Ames, Iowa?
What role do different neighborhoods play in determining house prices?
Are there any seasonal trends in volume and price of sales?
How are age and remodeling of homes related to their prices?

Dataset

The dataset we used is a collection of housing data from Ames, Iowa from the CMU Statistics and Data Science Repository. The dataset consists of 2930 rows and 82 columns. Each row represents a house sold in Ames, Iowa from 2006 to 2010, and each column represents a feature related to the house, such as number of rooms, year built, and quality of the house. There are 31 quantitative and 51 categorical variables, and more detailed descriptions of each variable can be found here.

Our main variable of interest is SalePrice, which indicates how much the house was sold for. Throughout this report, we conduct multiple analyses through graphs and statistical tests to determine which other variables will be useful in answering our research questions.

Question 1

When first examining how the variables in the data set can help us accurately predict home sale prices in Ames, Iowa, we perform a principal component analysis (PCA) to reduce the dimensions. It is obvious that features like the size of the house will impact its price, but we also wanted to find what other variables in particular help determine prices. There are 82 features originally, with 31 quantitative and 51 categorical variables. We run the PCA on these 31 quantitative variables to see how well the principal components explain the variations in the data.

The scree plot above shows that the first two principal components explain about a third of the variations in the data set, and the plot starts to flatten out after. For simplicity’s sake, we continue our analysis based on these two principal components only. Next, to see which variables are associated with the two principal components, we examine the eigenvector matrix presented below.

##                          PC1          PC2
## Lot.Frontage     0.095070268  0.008249272
## Lot.Area         0.131267274  0.062279601
## Overall.Qual     0.294347589  0.041534662
## Overall.Cond    -0.076064876  0.001375353
## Mas.Vnr.Area     0.209839637  0.071726057
## BsmtFin.SF.1     0.160793939  0.363351229
## BsmtFin.SF.2    -0.001891817  0.114321054
## Bsmt.Unf.SF      0.101760738 -0.182254186
## Total.Bsmt.SF    0.266830071  0.237582514
## X1st.Flr.SF      0.273871161  0.204541003
## X2nd.Flr.SF      0.134429639 -0.420443588
## Low.Qual.Fin.SF -0.005694894 -0.074811771
## Gr.Liv.Area      0.325717240 -0.204591948
## Bsmt.Full.Bath   0.092599315  0.336224819
## Bsmt.Half.Bath  -0.014092308  0.047079633
## Full.Bath        0.251077809 -0.170704992
## Half.Bath        0.128879548 -0.241095558
## Bedroom.AbvGr    0.126896863 -0.357309935
## Kitchen.AbvGr    0.004630013 -0.164824215
## TotRms.AbvGrd    0.257415915 -0.298215020
## Fireplaces       0.205618447  0.069350129
## Garage.Cars      0.279799254  0.051174571
## Garage.Area      0.278134975  0.091178179
## Wood.Deck.SF     0.135812339  0.092074901
## Open.Porch.SF    0.149633878 -0.028947709
## Enclosed.Porch  -0.047773803 -0.077082035
## X3Ssn.Porch      0.008981277  0.047759259
## Screen.Porch     0.040992427  0.053289668
## Pool.Area        0.044957365  0.020199900
## Misc.Val         0.022050532  0.021873987
## SalePrice        0.338432685  0.071096603

Each row corresponds to the original variables, and each column displays the linear combination of the variables that each principal component represents. The matrix shows that houses with higher values of PC1 are associated with variables like Gr.Liv.Area, Overall.Qual, Garage.Cars, Garage.Area, and so on. Since higher values of SalePrice are associated with higher values of PC1, more valuable homes are associated with higher living area above ground, overall quality, size of garage, and more. From these variables, we can see that PC1 is associated with features regarding size, garage, and overall quality of the house. Similarly, higher values of PC2 is associated with higher values of BsmtFin.SF.1, Bsmt.Full.Bath, Total.Bsmt.SF, and X1st.Flr.Sf. Since SalePrice is also positively associated with PC2, we can see that more valuable homes have larger basements with more bathrooms, along with larger first floor area. Unlike PC1, PC2 is associated with features related to the basement.

To better understand how the variables determine sale prices, we also take a look at the correlation matrix heatmap of each quantitative variable.

Since we are specifically interested in how the variables are associated with the price of the house, we examine the column (or row) that corresponds to SalePrice. Most of the squares representing each variable appear red, which indicates that they are positively correlated with SalePrice. In particular, we see that the variables we’ve seen from the PCA such as Gr.Liv.Area, Total.Bsmt.SF, X1st.Flr.SF, Garage.Cars, Garage.Area, and Overall.Qual are more strongly associated with the sale price, as shown by darker red colors.

From the PCA and the correlation matrix heatmap, we can see that home prices may be best determined by the sizes of the living area, basement, and the garage, along with the overall quality, which is determined by the material and finish of the house. We will examine these variables again later in this report as we perform further analyses.

Question 2

Location is another deciding factor for house prices. For this research question, we seek to find what role a neighborhood plays in determining house prices in Ames. We will first examine how prices differ across neighborhoods. Then we take a look at how the relationship between price and the size of the house might differ across different neighborhoods.

2.1. House prices are different across neighborhoods.

The dataset contains a variable Neighborhood that indicates which neighborhood the house is located in. There are 28 different neighborhoods given, and we expect house prices to differ across different areas. We visualize this through a bar plot as shown below.

We see that the average house sale prices differ drastically across different neighborhoods. Some neighborhoods exhibit notably higher average sale prices, which may suggest that it is located in a more desirable neighborhood or is newly built/renovated. Sales price is the highest in Northridge and is more than three times the average sales price of the cheapest neighborhood, Meadow Village.

To see if the different house prices can be actually explained by their neighborhoods, we create a dendrogram and examine if the houses in the same neighborhood form a cluster.

To create this dendrogram, we first select just the SalePrice variable since we are only concerned with how similar the prices are for houses in the same neighborhood. After creating the clusters, we display the leaves and color them based on the neighborhood. Although it is very hard to see the actual neighborhoods due to a large number of observations, we only need to examine if the same colors are grouped together. We see that the same colors tend to cluster together: light blue on the left, red in the middle, blue towards right, and green on the right. This indicates that houses within each neighborhood are similar to each other in terms of their sale prices.

2.2. Relationships between house size and sale price may differ across neighborhoods.

In addition to observing how different house prices are based on location, we wanted to see if the relationship between size of the house and its sale price is different across neighborhoods as well. We can expect that bigger houses will cost more, but is this increase in price the same for different areas?

We expect that this relationship to be different across neighborhoods, as some areas are more favorable due to factors such as transportation, safety, and education. We first visualize this idea through a scatter plot to answer our question. We chose the above ground living area (sq. ft.) to represent the size of the house.

As expected, we see that bigger houses are associated with higher sale prices. However, we see that the sale prices increase at different magnitudes for different neighborhoods. For example, the pink points near the top tend to increase quicker than yellow points in the middle of the graph. This indicates that there may be a premium for an additional square foot of living area for certain neighborhoods, and most importantly, this premium differs across different areas.

To formally analyze this idea, we perform a partial F-test. First, we fit the following two linear regression models:

Full model:

\[\begin{align*} & SalePrice = \beta_0 + \beta_1Total.Bsmt.SF + \beta_2X1st.Flr.SF + \beta_3Garage.Cars + \beta_4Garage.Area \\ & + \beta_5Overall.Qual + \beta_6Gr.Liv.Area + \sum_{i=7}^{33} \beta_i \mathbf{1}_{\{Neighborhood = N_i\}} \\ & + \sum_{i=34}^{60} \beta_i Gr.Liv.Area \times \mathbf{1}_{\{Neighborhood = N_i\}} + \epsilon \end{align*}\]

Reduced model:

In fitting these models, we chose the variables that we’ve seen frequently in research question 1 for simplicity, along with Neighborhood. The reduced model contains all of these variables as predictors and sales price as the response. Here, \(\mathbf{1}_{\{Neighborhood = N_i\}}\) is an indicator function that returns 1 if the Neighborhood variable is equal to \(N_i\), or the neighborhood that this indicator function corresponds to, and 0 otherwise. Since there are 28 neighborhoods, 27 terms will be added, with one neighborhood as reference group. We have shortened the model equation using the summation, as there are lots of terms. The full model contains all the variables of the reduced model plus the interactions between Gr.Liv.Area and each neighborhood. Again, since there are 28 neighborhoods, there will be 27 new interaction terms with one neighborhood omitted as the reference.

Finally, we perform the partial F-test to see if the relationship between size of the house (represented as living area above ground in sq. ft.) and its sale price depends on the neighborhood. We test the null hypothesis \(H_0: \beta_{34} = \ldots = \beta_{60} = 0\) and the alternative hypothesis \(H_a: \text{at least one of } \beta_{34},\ldots,\beta_{60} \neq 0\) through ANOVA.

## Analysis of Variance Table
## 
## Model 1: SalePrice ~ Total.Bsmt.SF + X1st.Flr.SF + Garage.Cars + Garage.Area + 
##     Overall.Qual + Gr.Liv.Area + Neighborhood
## Model 2: SalePrice ~ Total.Bsmt.SF + X1st.Flr.SF + Garage.Cars + Garage.Area + 
##     Overall.Qual + Gr.Liv.Area * Neighborhood
##   Res.Df        RSS Df  Sum of Sq      F    Pr(>F)    
## 1   2894 3.2838e+12                                   
## 2   2868 2.3277e+12 26 9.5617e+11 45.313 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

According to the ANOVA output, the test statistic is F = 45.313 with 26 numerator degrees of freedom and 2868 denominator degrees of freedom. The p-value is basically zero and less than \(\alpha=0.05\). Therefore, we can reject the null hypothesis and have evidence to say that the relationship between the size of the house and its sale price does indeed depend on the neighborhood. We don’t specifically know which neighborhood, but at least one of them affects the relationship between the size and sale price.

Question 3

So far, we have examined features that are specific to each house, such as its size and which neighborhood it is located in. In this research question, we take a look at the housing market of Ames as a whole through time series plots. We are mainly concerned with how volume and price of sales have changed over the time period of the dataset (2006-2010). Firstly, we observe how the sales volume has changed over time.

The plot above shows the number of sales per month and the exponential moving average (EMA), each represented by orange and blue lines, respectively. This plot gives a clear indication of seasonal trends in house sales. We can see that the number of sales per month reach their highest at the middle each year and drop drastically in between. These peaks appear to correspond with the peaks shown in the first time series, and thus we can say that increase in demand may drive prices higher during the summer. Next, we look at how prices have changed over time.

The average monthly sales price time series is generated using the same method as the initial plot, with orange and blue lines each representing raw series and EMA, respectively. Because the data set only contains the month of each house sale, we took an average of sales prices per month and plotted them over the given time period. The EMA was calculated with the weight \(\alpha = \frac{2}{12+1}\), since we are given monthly data. From the plot we can see that the housing market in Ames is slowly declining although there are fluctuations. The EMA line smooths these fluctuations, revealing underlying trends without the short-term volatility. However, it should be noted that there are consistent peaks of average sales price throughout time, and these peaks appear to correspond to the peaks shown in the first time series. To better understand these trends, we take a look at the seasonal decomposition of average monthly sales price and compare it to the sales volume time series.

The “observed” plot is the same as the raw time series, and the “trend plot” shows us the declining average prices that we noted in the average sales price time series. The most defining feature of the “seasonal” plot are the peaks that we have mentioned, which accurately match the peaks in the sales volume time series. One possible explanation is that people may prefer to buy houses and move in during warmer weather, and thus the increase in demand may push prices higher in the summer. Lastly, there aren’t any clear patterns in the “irregular” plot other than the peaks, and it seems much like the raw time series.

This suggests that timing the market correctly may result in prices that may be favorable to either the buyer or the seller. More specifically, buyers can buy houses at a discounted price in the winter, and sellers can sell for higher prices during the summer, for Ames at least.

Question 4

One last feature we would like to study is the age of the house. Generally, older houses are dirtier and less-appealing due to the damage they have accumulated, which may lead to lower prices. However, homeowners often renovate or remodel their houses before listing them on the market to attract more buyers and ultimately increase the property value of their homes. To answer this research question, we first look at how the age of a house is related to its sale price through a scatter plot.

The graph above plots the years in which the house was built on the horizontal axis and its sale price on the vertical axes, with different colors representing if the house was remodeled or not. The data set contained a quantitative variable named Year.Remod.Add, which represents the remodel year for the house. If there were no remodeling or additions, then the Year.Remod.Add variable is the same as the construction date, or Year.Built. Using these two variables, we created a new categorical variable called remod with two values: “Yes” if Year.Remod.Add \(\neq\) Year.Built, and “No” otherwise.

The scatter plot shows that newer houses are generally associated with higher prices, but this relationship isn’t too deterministic, as we can see some older houses with moderately higher prices and vice versa. The most interesting observation from this plot is that houses built before 1950 are all remodeled. However, these remodeled “old” houses still seem to sell for lower prices. On the other hand, houses built after 1950, whether remodeled or not, seem to have a very similar relationship between year built and sale prices. This may indicate that for houses that are not too “old”, remodeling may not have a drastic impact on their sale prices.

The best way to formally test this theory would be to compare the price before and after remodeling the house. However, the dataset only contained the final sale price of the home, and thus we do not have information regarding pre-renovation value.

Therefore, we decided to simply use the final sale prices instead. We conducted a two-sample t-test to see if there is a significant difference between the average sale prices of remodeled and non-remodeled houses built after 1950. We first extract the houses built after 1950 and subset them based on whether they were remodeled or not. Then we test the null hypothesis \(H_0: \mu_1 = \mu_2\) and the alternative hypothesis \(H_a: \mu_1 \neq \mu_2\), where \(\mu_1\) and \(\mu_2\) each represent the average sale price of remodeled and non-remodeled homes built after 1950, respectively. We use the Welch t-test as we do not know if the two groups have equal variances.

## 
##  Welch Two Sample t-test
## 
## data:  remod and no_remod
## t = 9.1247, df = 1238.2, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  26883.75 41610.48
## sample estimates:
## mean of x mean of y 
##  218559.9  184312.8

As shown above, the value of the test statistic is 9.1247 with 1238.2 degrees of freedom. The p-value is basically zero, which is less than \(\alpha=0.05\). Therefore, we reject the null hypothesis that the difference between the average sale prices of remodeled and non-remodeled houses built after 1950 is zero. In other words, there is a significant difference between the average sale prices of the two groups. Although we don’t know if remodeling a house raises its price significantly, we can at least say that remodeled houses are generally more expensive than non-remodeled houses, specifically for those built after 1950.

Conclusion

Our analysis of residential property sales in Ames, Iowa, from 2006 to 2010 has led to several conclusions. We determined that key factors influencing house prices include the size of living areas, basements, garages, and overall quality of materials and finishes. Geographic features, particularly neighborhood characteristics, also play a significant role in determining home sale prices. Seasonal trends were observed in both the volume and price of home sales, with certain periods exhibiting higher activity and pricing fluctuations.

We do have some limitations, however. Firstly, it should be noted that we have skipped diagnostics for our linear regression models in Question 2.2. To ensure our hypothesis tests are accurate, we need to make sure that assumptions for linear regression hold, such as equal variance, zero conditional mean, and normality of residuals. These could be checked using residual plots and Q-Q plots, but we have omitted them for simplicity’s sake.

In addition, some aspects remain unexplored due to limitations in the data. For instance, the impact of remodeling on sales prices could not be conclusively determined as we only had access to post-renovation sale prices. Similarly, the exact nature of the relationship between house age and sale price requires further investigation to determine causation, which could be accomplished with more data or different statistical methods/modeling. Also, our report focused on analyses with mostly quantitative variables. There are many unexplored categorical variables, which may be helpful in predicting sale price as well.

Future research could look deeper into these unresolved questions. Examining pre and post-renovation values of properties, conducting a study across different economic periods, or incorporating unused variables in the dataset into our analyses could provide more comprehensive insights into the housing market dynamics of Ames, Iowa.

36-315 Final Project

Analyzing the Housing Market in Ames, Iowa to Predict House Prices

Chong Lee, David Ng, Celine Park, Andrew Wang

12/9/2023