Description of Dataset
The sales of houses are determined by a variety of factors.The
dataset contains a set of property sales in the suburb of Abbotsford,
Victoria Australia. There are 18396 rows and 21 columns in total,
including details such as the address, number of rooms, property type,
price, method of sale, and seller information, as well as information on
the property’s location and characteristics.
We are interested in exploring the weight of various factors that
contribute to the price of the houses from the seller’s perspective. In
addition, we hope to predict the price of a house using a regression
model, which the model itself could be implemented into websites as
useful sources for real-estate agents to better understand the housing
market.
We consider the following variables in the dataset (obtained from source
description):
Suburb: Name of the suburb where the property is
located
Rooms: Number of rooms in the property (excluding
bathrooms and other non-living spaces)
Price: Sale price of the property in Australian
dollars (AUD)
Type: Type of property (e.g., h = house, t =
townhouse, u = unit/apartment)
Date: Date of the sale
Distance: Distance from the property to Melbourne
central business district (CBD) in kilometers
Regionname: Name of the region where the property is
located (e.g., Eastern Metropolitan, Northern Metropolitan, Southern
Metropolitan, Western Metropolitan)
Bathroom: Number of bathrooms in the property
BuildingArea: Total building area of the property in
square meters
Latitude: Latitude of the property
Longitude:Longitude of the property
Project Overview
Before conducting actual statistical analysis on the different
variables to see how they are correlated with sale price, we make a
guess about potential important predictor variables. We hypothesize that
the Sale Price per Unit of Building Area of the
property will be more influenced by its location. On the other hand, the
Total Sale Price of the property will depend more on
the types of the houses and interior structure (number of bedrooms,
bathrooms, etc) and building area. With these preliminary assumptions,
we proceed our analysis to check to what extent our guesses reflect the
actual situation on the housing market.
First of all, to validate our hypothesis on the sale price per unit
of building area (referred as cost per unit in the rest of the report),
we will first consider the relationship between the cost per unit of
properties and their located regions. Secondly, we would verify our
assumptions on the correlation between total sale price and other
relevant predictor variables by diving into the relationship between
Prices and specific features regarding types of properties. Finally, we
will try to choose the most salient variables and construct different
linear regression models to compare and contrast their performances,
hence best predicting sales of properties with the significant
covariates identified.
Question 1: Are there any differences between the prices of
properties across the eight regions of Melbourne?
a) EDA
First, we perform EDA on variables : Price, Region, and Distance to
see the distribution of properties in different regions across the area
of Melbourne City.
As appeared in the Histogram, the distribution of the properties by
distance is unimodal and skewed to the right. Most properties appear
around 3 to 15 km away from the CBD of Melbourne City. In particular,
properties in Southern Metropolitan, Western Metropolitan and Northern
Metropolitan appear most frequently around this distance range, and
properties in Southern-Eastern Metropolitans are completely absent in
this distance range. This indicates that the different regions are not
of similar distance to the center of Melbourne city. To more visibly
display the regional differences in price due to distance, we map the
data in terms of their coordinates.
b) Map
From this graph, the difference in location of the areas is clearly
shown. Southern Metropolitan, Western Metropolitan, and Northern
Metropolitan are the closest to the CBD of the city. Even at alpha
level= 0.1, the pink color indicating property price is still extremely
visible, which demonstrates the high concentration and large amount of
properties in these regions. Moreover, we can see a few blue data points
in these regions, indicating that they have fairly higher prices. On the
other hand, Eastern Victoria, Northern Victoria, and Western Victoria
have very few data points with lower color identity demonstrating less
data points and sparse distribution of properties in areas far from the
CBD of the Melbourne City.
c) Statistical Tests
After an overview of the general distributions, we would like to
select a specific measurement of property price and examine whether
there is difference between the regions. To do so, we would be
performing statistical tests on the targeted variables: Price and
Regionname.
Since prices of properties are obviously affected by building area,
we would like to eliminate this variable. Thus, we created a new
variable: costperunit = Price/ BuildingArea. The unit is: AUD per square
meters. Now our targeted variables are: costperunit and Regionname.
By limiting the x Axis from 0 to 2000, we are able to see the
distribution of density of costperunit across different regions.
Given the large amount of data, it is safe to assume that the
distribution of costperunit is approximately Normal. In addition, the
distribution of density seems to have similar variances across different
regions, so equal variance assumption is satisfied. Finally, each
property belongs to only one region, so independence assumption is
satisfied. The assumptions are all met, we proceed to perform
statistical tests.
##
## Bartlett test of homogeneity of variances
##
## data: costperunit by Regionname
## Bartlett's K-squared = 4954.7, df = 7, p-value < 2.2e-16
Barlett’s test reveals a p-value at approximately 0, smaller than
alpha= 0.05. Thus, we do have sufficient evidence to reject the
null hypothesis and conclude that there is significant
difference in variance of property prices per square meters between the
eight regions of Melbourne city.
##
## One-way analysis of means (not assuming equal variances)
##
## data: costperunit and Regionname
## F = 19.244, num df = 7.00, denom df = 245.45, p-value < 2.2e-16
The one-way ANOVA test not assuming equal variances reveals a
p-value at approximately 0, smaller than alpha= 0.05. Thus, we do have
sufficient evidence to reject the null hypothesis and
conclude that there is significant difference in means of property
prices per square meters between the eight regions of Melbourne
city.
d) Conclusion
Therefore, we conclude that there is significant
difference in the variance and mean of property price per
square meters across the eight regions.
Question 2: What are some typical types of properties sold?
After discussing the price per square meter across various regions,
we switch our focus to the total price of a property regarding some
carefully selected variables. Properties are categorized by low
($0-600k), middle ($600k-3m), and high price level ($3m-10m).
a) Mosaic Plot
We first graph mosaic plots of price levels across the number of
rooms, types of properties (unit/apartment, house, or townhouse), and
month sold.
Low-priced properties (below 600k AUD) are more
likely to be units/apartments, have 1 to 2 rooms, and were sold in July.
Middle-priced properties (600k-3m AUD) commonly include
3-4 rooms and seem less likely to be sold in July as opposed to
lower-priced ones as indicated by the blue shaded area. More
expensive properties, in contrast, are more likely to
be houses with greater than or equal to 4 rooms, and there is a slight
tendency that properties over 10 million dollars were sold in
November.
These observations are only interpreted from the independence
relation result and it is a good starting point so that further
justification and modifications will be presented in the following
paragraphs.
b) MDS Analysis
Then we explore properties of different price levels grouped by MDS.
This would help to validate whether or not the previous conclusions are
precise.
For the MDS colored by number of rooms, only properties with less
than 2 rooms share very similar features. As the number of rooms
increases, variations are more obvious since they scatter more
separately from each other. The transition of color in MDS 2 also
indicates the same conclusion. Properties of lower price tend to share
more similarities and form a cluster, while the higher the prices, the
more different features they possess. This adds on to the previous
conclusion, showing that the previous features of different prices are
more likely to be true if it is low-priced property, which typically are
units/apartments, have 1-2 rooms, and were sold in July. For
higher-priced properties, more factors contribute to their high prices
and thus the previous summary about it (houses, more than 4 rooms, sold
in November) is less likely to be the general case.
c) Conclusion
- Low-priced properties share many similarities including i> having
1-2 rooms in general, ii> are apartments/units, iii> more were
sold in July compared to other months.
- Higher-priced properties, however, show greater variations in terms
of features.
- Number of rooms play a significant role in contributing to prices of
properties as shown in mosaic plots and MDS. The more rooms, the greater
possibility of a higher price. This conclusion is not only intuitive and
its verification serves the following investigation using a regression
model, where the number of rooms is an important factor.
Question 3: How can we predict the sale price for any property in
Melbourne?
a) More EDA on Price
To answer this question, we look at several different regression
model, including linear/non-linear ones, to predict the Price of
properties. Following our conclusions from the two previous conclusions,
we have identified Region Name and Number of
Rooms as important predictor variable relevant to our response
variable Price. We also examined the relationship among
some quantitative variables with the response variable price. Thus, we
just one more step: constructing another EDA to investigate potential
correlation between some qualitative variables: we choose
Type and Regionname here.
By plotting a facetted histogram of Price against the qualitative
variable Type (indicating type of the property) and RegionName, it can
be noticed that Price is potentially correlated with type of the
property: House properties have a larger range of price than
Unit/Apartment than townhouse. In addition to conclusion obtained from
part (a), we can conclude from the above graph that Region Name is both
correlated to cost per unit as well as the total sale price of
property.
b) Regression Model on Types of Property
To take a closer step at more regression models, we follow our
conclusion from the previous question which identifies Rooms to be the
most significant covariate for predicting Price. Then we also consider
plotting Price against Types of Property and Number of Rooms by
RegionName as we confirm both as significant for our prediction.
Firstly, we plot the Price by Types of Property with a linear regression
line of best fit added, with a summary of the regression model displayed
alongside:
##
## Call:
## lm(formula = Price ~ Rooms * Type, data = sale_combined)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1955293 -313907 -60491 190955 7839509
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 239011 33615 7.110 0.00000000000128 ***
## Rooms 307160 9777 31.418 < 0.0000000000000002 ***
## Typet -23237 100941 -0.230 0.817938
## Typeu -63728 58680 -1.086 0.277506
## Rooms:Typet -66435 34016 -1.953 0.050854 .
## Rooms:Typeu -95906 25380 -3.779 0.000159 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 566600 on 6824 degrees of freedom
## Multiple R-squared: 0.2923, Adjusted R-squared: 0.2918
## F-statistic: 563.7 on 5 and 6824 DF, p-value: < 0.00000000000000022
Notice that the lines doesn’t provide a very good fit for the
datapoints, but there is certainly different line of regression on the
sale Price for different Type of Properties, which indicates potential
significance of the interaction. As we double check this with the
summary of the model, we notice that the majority of coefficients are
not significant in this linear regression model,
showing that this is a poor regression model for prediction of the sale
price. However, the interaction between Number of Rooms and Types of
property is significant for Properties of type
Unit/Apartment. This implies that we may this linear
regression model to predict sale price for Property of type
Unit/Apartment from Number of Rooms.
We then move on to take a look at the non-linear regression model at
the same data, and compare with the linear regression model to see
whether it produces a better result:
We then notice that this regression line seems to fit the datapoints
better compared with the previous one, indicating that this
non-parametric approach using a logical regression model predicts the
price with number of rooms based on different types of houses is more
preferable.
c) Regression Model on Region Name
Similarly, we plot the graph of Price over Number of Rooms again, but
this time grouped the data points by RegionName with a linear regression
line of best fit added, and a summary of the regression model displayed
alongside:
##
## Call:
## lm(formula = Price ~ Rooms * Regionname, data = sale_combined)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2107018 -276212 -82151 194454 8127840
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 226283 73168 3.093 0.00199 **
## Rooms 262965 20850 12.612 < 0.0000000000000002 ***
## RegionnameEastern Victoria 16726 523919 0.032 0.97453
## RegionnameNorthern Metropolitan 77647 81186 0.956 0.33890
## RegionnameNorthern Victoria 243876 497858 0.490 0.62426
## RegionnameSouth-Eastern Metropolitan 144968 162107 0.894 0.37121
## RegionnameSouthern Metropolitan -509813 79005 -6.453 0.000000000117 ***
## RegionnameWestern Metropolitan -103526 85953 -1.204 0.22846
## RegionnameWestern Victoria -134747 670098 -0.201 0.84064
## Rooms:RegionnameEastern Victoria -136432 148835 -0.917 0.35935
## Rooms:RegionnameNorthern Metropolitan -55653 24136 -2.306 0.02115 *
## Rooms:RegionnameNorthern Victoria -232285 139584 -1.664 0.09613 .
## Rooms:RegionnameSouth-Eastern Metropolitan -95995 46307 -2.073 0.03821 *
## Rooms:RegionnameSouthern Metropolitan 313113 22932 13.654 < 0.0000000000000002 ***
## Rooms:RegionnameWestern Metropolitan -18738 25184 -0.744 0.45688
## Rooms:RegionnameWestern Victoria -173778 186268 -0.933 0.35088
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 488000 on 6814 degrees of freedom
## Multiple R-squared: 0.4759, Adjusted R-squared: 0.4748
## F-statistic: 412.5 on 15 and 6814 DF, p-value: < 0.00000000000000022
The graph shows that for properties in different region, there is
certainly different regression lines, indicating potential interaction
between the covariates Number of Rooms and
Regionname. More specifically, the linear regression
line for properties in Southern Metropolitan (the blue line) appears to
fit the datapoints really well. We double check this conclusion with the
summary outputs for the corresponding regression model. As we notice the
coefficient for Regionname = Southern Metropolitan is significant, this
confirms our previous conclusion. We also notice that the interactions
between Number of Rooms and Regionname = Southern Metropolitan,
South-Eastern Metropolitan, and Northern Metropolitan are also
significant, suggesting that this model is a good fit for predicting the
sale price of properties in Region of Southern
Metropolitan especially (potentially also for properties in
South-Eastern Metropolitan and Northern Metropolitan), using the number
of rooms for the properties.
d) Conclusion
Overall, we examined different potential regression models to predict
the Price of properties in Melbourne, using
Number of Rooms as the main predictor variable, taking
interaction between it with Types and
RegionName of Properties into consideration. We can
then predict the price of different properties in Melbourne based on
different Types of the Properties or Region Name with the most suitable
regression model.
Main Takeaways & Future Directions
Main Takeaways
- There is significant difference in the variance and
mean of property price per square meters across the eight
regions.
- Number of rooms play a significant role in
contributing to prices of properties.
- We can predict Price of properties in Melbourne
using Number of Rooms as the main predictor variable,
using different/most suitable regression model for different
Types of the Properties or Region
Name.
Future Directions
Since this data frame only contains properties that were sold. With
data about unsold properties, we could possibly gain more insights about
what are decision factors to success sales. Additionally, variables
about quality of furnishing and neighborhood safety could help us
comprehend more about sales.
Another future improvement for this project is to dive into Question
3 to improve on the regression model employed, which would require more
nuanced statistical techniques and knowledge about modern regression
models. The regression model under consideration and analysis for now is
relatively simple, and more complicated model which incorporates more
covariates, or adopts models such as logistic regression, polynomial
regression, and nonparametric regression can be adopted to improve on
the accuracy of prediction for sale price.