Description of Dataset

The sales of houses are determined by a variety of factors.The dataset contains a set of property sales in the suburb of Abbotsford, Victoria Australia. There are 18396 rows and 21 columns in total, including details such as the address, number of rooms, property type, price, method of sale, and seller information, as well as information on the property’s location and characteristics.

We are interested in exploring the weight of various factors that contribute to the price of the houses from the seller’s perspective. In addition, we hope to predict the price of a house using a regression model, which the model itself could be implemented into websites as useful sources for real-estate agents to better understand the housing market.

We consider the following variables in the dataset (obtained from source description):

Suburb: Name of the suburb where the property is located

Rooms: Number of rooms in the property (excluding bathrooms and other non-living spaces)

Price: Sale price of the property in Australian dollars (AUD)

Type: Type of property (e.g., h = house, t = townhouse, u = unit/apartment)

Date: Date of the sale

Distance: Distance from the property to Melbourne central business district (CBD) in kilometers

Regionname: Name of the region where the property is located (e.g., Eastern Metropolitan, Northern Metropolitan, Southern Metropolitan, Western Metropolitan)

Bathroom: Number of bathrooms in the property

BuildingArea: Total building area of the property in square meters

Latitude: Latitude of the property

Longitude:Longitude of the property

Project Overview

Before conducting actual statistical analysis on the different variables to see how they are correlated with sale price, we make a guess about potential important predictor variables. We hypothesize that the Sale Price per Unit of Building Area of the property will be more influenced by its location. On the other hand, the Total Sale Price of the property will depend more on the types of the houses and interior structure (number of bedrooms, bathrooms, etc) and building area. With these preliminary assumptions, we proceed our analysis to check to what extent our guesses reflect the actual situation on the housing market.

First of all, to validate our hypothesis on the sale price per unit of building area (referred as cost per unit in the rest of the report), we will first consider the relationship between the cost per unit of properties and their located regions. Secondly, we would verify our assumptions on the correlation between total sale price and other relevant predictor variables by diving into the relationship between Prices and specific features regarding types of properties. Finally, we will try to choose the most salient variables and construct different linear regression models to compare and contrast their performances, hence best predicting sales of properties with the significant covariates identified.

Question 1: Are there any differences between the prices of properties across the eight regions of Melbourne?

a) EDA

First, we perform EDA on variables : Price, Region, and Distance to see the distribution of properties in different regions across the area of Melbourne City.


As appeared in the Histogram, the distribution of the properties by distance is unimodal and skewed to the right. Most properties appear around 3 to 15 km away from the CBD of Melbourne City. In particular, properties in Southern Metropolitan, Western Metropolitan and Northern Metropolitan appear most frequently around this distance range, and properties in Southern-Eastern Metropolitans are completely absent in this distance range. This indicates that the different regions are not of similar distance to the center of Melbourne city. To more visibly display the regional differences in price due to distance, we map the data in terms of their coordinates.

b) Map


From this graph, the difference in location of the areas is clearly shown. Southern Metropolitan, Western Metropolitan, and Northern Metropolitan are the closest to the CBD of the city. Even at alpha level= 0.1, the pink color indicating property price is still extremely visible, which demonstrates the high concentration and large amount of properties in these regions. Moreover, we can see a few blue data points in these regions, indicating that they have fairly higher prices. On the other hand, Eastern Victoria, Northern Victoria, and Western Victoria have very few data points with lower color identity demonstrating less data points and sparse distribution of properties in areas far from the CBD of the Melbourne City.

c) Statistical Tests

After an overview of the general distributions, we would like to select a specific measurement of property price and examine whether there is difference between the regions. To do so, we would be performing statistical tests on the targeted variables: Price and Regionname.

Since prices of properties are obviously affected by building area, we would like to eliminate this variable. Thus, we created a new variable: costperunit = Price/ BuildingArea. The unit is: AUD per square meters. Now our targeted variables are: costperunit and Regionname.


By limiting the x Axis from 0 to 2000, we are able to see the distribution of density of costperunit across different regions.

Given the large amount of data, it is safe to assume that the distribution of costperunit is approximately Normal. In addition, the distribution of density seems to have similar variances across different regions, so equal variance assumption is satisfied. Finally, each property belongs to only one region, so independence assumption is satisfied. The assumptions are all met, we proceed to perform statistical tests.

## 
##  Bartlett test of homogeneity of variances
## 
## data:  costperunit by Regionname
## Bartlett's K-squared = 4954.7, df = 7, p-value < 2.2e-16


Barlett’s test reveals a p-value at approximately 0, smaller than alpha= 0.05. Thus, we do have sufficient evidence to reject the null hypothesis and conclude that there is significant difference in variance of property prices per square meters between the eight regions of Melbourne city.

## 
##  One-way analysis of means (not assuming equal variances)
## 
## data:  costperunit and Regionname
## F = 19.244, num df = 7.00, denom df = 245.45, p-value < 2.2e-16


The one-way ANOVA test not assuming equal variances reveals a p-value at approximately 0, smaller than alpha= 0.05. Thus, we do have sufficient evidence to reject the null hypothesis and conclude that there is significant difference in means of property prices per square meters between the eight regions of Melbourne city.

d) Conclusion

Therefore, we conclude that there is significant difference in the variance and mean of property price per square meters across the eight regions.

Question 2: What are some typical types of properties sold?

After discussing the price per square meter across various regions, we switch our focus to the total price of a property regarding some carefully selected variables. Properties are categorized by low ($0-600k), middle ($600k-3m), and high price level ($3m-10m).

a) Mosaic Plot

We first graph mosaic plots of price levels across the number of rooms, types of properties (unit/apartment, house, or townhouse), and month sold.


Low-priced properties (below 600k AUD) are more likely to be units/apartments, have 1 to 2 rooms, and were sold in July. Middle-priced properties (600k-3m AUD) commonly include 3-4 rooms and seem less likely to be sold in July as opposed to lower-priced ones as indicated by the blue shaded area. More expensive properties, in contrast, are more likely to be houses with greater than or equal to 4 rooms, and there is a slight tendency that properties over 10 million dollars were sold in November.

These observations are only interpreted from the independence relation result and it is a good starting point so that further justification and modifications will be presented in the following paragraphs.

b) MDS Analysis

Then we explore properties of different price levels grouped by MDS. This would help to validate whether or not the previous conclusions are precise.


For the MDS colored by number of rooms, only properties with less than 2 rooms share very similar features. As the number of rooms increases, variations are more obvious since they scatter more separately from each other. The transition of color in MDS 2 also indicates the same conclusion. Properties of lower price tend to share more similarities and form a cluster, while the higher the prices, the more different features they possess. This adds on to the previous conclusion, showing that the previous features of different prices are more likely to be true if it is low-priced property, which typically are units/apartments, have 1-2 rooms, and were sold in July. For higher-priced properties, more factors contribute to their high prices and thus the previous summary about it (houses, more than 4 rooms, sold in November) is less likely to be the general case.

c) Conclusion

  • Low-priced properties share many similarities including i> having 1-2 rooms in general, ii> are apartments/units, iii> more were sold in July compared to other months.
  • Higher-priced properties, however, show greater variations in terms of features.
  • Number of rooms play a significant role in contributing to prices of properties as shown in mosaic plots and MDS. The more rooms, the greater possibility of a higher price. This conclusion is not only intuitive and its verification serves the following investigation using a regression model, where the number of rooms is an important factor.

Question 3: How can we predict the sale price for any property in Melbourne?

a) More EDA on Price

To answer this question, we look at several different regression model, including linear/non-linear ones, to predict the Price of properties. Following our conclusions from the two previous conclusions, we have identified Region Name and Number of Rooms as important predictor variable relevant to our response variable Price. We also examined the relationship among some quantitative variables with the response variable price. Thus, we just one more step: constructing another EDA to investigate potential correlation between some qualitative variables: we choose Type and Regionname here.


By plotting a facetted histogram of Price against the qualitative variable Type (indicating type of the property) and RegionName, it can be noticed that Price is potentially correlated with type of the property: House properties have a larger range of price than Unit/Apartment than townhouse. In addition to conclusion obtained from part (a), we can conclude from the above graph that Region Name is both correlated to cost per unit as well as the total sale price of property.

b) Regression Model on Types of Property

To take a closer step at more regression models, we follow our conclusion from the previous question which identifies Rooms to be the most significant covariate for predicting Price. Then we also consider plotting Price against Types of Property and Number of Rooms by RegionName as we confirm both as significant for our prediction. Firstly, we plot the Price by Types of Property with a linear regression line of best fit added, with a summary of the regression model displayed alongside:

## 
## Call:
## lm(formula = Price ~ Rooms * Type, data = sale_combined)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1955293  -313907   -60491   190955  7839509 
## 
## Coefficients:
##             Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)   239011      33615   7.110     0.00000000000128 ***
## Rooms         307160       9777  31.418 < 0.0000000000000002 ***
## Typet         -23237     100941  -0.230             0.817938    
## Typeu         -63728      58680  -1.086             0.277506    
## Rooms:Typet   -66435      34016  -1.953             0.050854 .  
## Rooms:Typeu   -95906      25380  -3.779             0.000159 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 566600 on 6824 degrees of freedom
## Multiple R-squared:  0.2923, Adjusted R-squared:  0.2918 
## F-statistic: 563.7 on 5 and 6824 DF,  p-value: < 0.00000000000000022


Notice that the lines doesn’t provide a very good fit for the datapoints, but there is certainly different line of regression on the sale Price for different Type of Properties, which indicates potential significance of the interaction. As we double check this with the summary of the model, we notice that the majority of coefficients are not significant in this linear regression model, showing that this is a poor regression model for prediction of the sale price. However, the interaction between Number of Rooms and Types of property is significant for Properties of type Unit/Apartment. This implies that we may this linear regression model to predict sale price for Property of type Unit/Apartment from Number of Rooms.

We then move on to take a look at the non-linear regression model at the same data, and compare with the linear regression model to see whether it produces a better result:


We then notice that this regression line seems to fit the datapoints better compared with the previous one, indicating that this non-parametric approach using a logical regression model predicts the price with number of rooms based on different types of houses is more preferable.

c) Regression Model on Region Name

Similarly, we plot the graph of Price over Number of Rooms again, but this time grouped the data points by RegionName with a linear regression line of best fit added, and a summary of the regression model displayed alongside:

## 
## Call:
## lm(formula = Price ~ Rooms * Regionname, data = sale_combined)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2107018  -276212   -82151   194454  8127840 
## 
## Coefficients:
##                                            Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)                                  226283      73168   3.093              0.00199 ** 
## Rooms                                        262965      20850  12.612 < 0.0000000000000002 ***
## RegionnameEastern Victoria                    16726     523919   0.032              0.97453    
## RegionnameNorthern Metropolitan               77647      81186   0.956              0.33890    
## RegionnameNorthern Victoria                  243876     497858   0.490              0.62426    
## RegionnameSouth-Eastern Metropolitan         144968     162107   0.894              0.37121    
## RegionnameSouthern Metropolitan             -509813      79005  -6.453       0.000000000117 ***
## RegionnameWestern Metropolitan              -103526      85953  -1.204              0.22846    
## RegionnameWestern Victoria                  -134747     670098  -0.201              0.84064    
## Rooms:RegionnameEastern Victoria            -136432     148835  -0.917              0.35935    
## Rooms:RegionnameNorthern Metropolitan        -55653      24136  -2.306              0.02115 *  
## Rooms:RegionnameNorthern Victoria           -232285     139584  -1.664              0.09613 .  
## Rooms:RegionnameSouth-Eastern Metropolitan   -95995      46307  -2.073              0.03821 *  
## Rooms:RegionnameSouthern Metropolitan        313113      22932  13.654 < 0.0000000000000002 ***
## Rooms:RegionnameWestern Metropolitan         -18738      25184  -0.744              0.45688    
## Rooms:RegionnameWestern Victoria            -173778     186268  -0.933              0.35088    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 488000 on 6814 degrees of freedom
## Multiple R-squared:  0.4759, Adjusted R-squared:  0.4748 
## F-statistic: 412.5 on 15 and 6814 DF,  p-value: < 0.00000000000000022


The graph shows that for properties in different region, there is certainly different regression lines, indicating potential interaction between the covariates Number of Rooms and Regionname. More specifically, the linear regression line for properties in Southern Metropolitan (the blue line) appears to fit the datapoints really well. We double check this conclusion with the summary outputs for the corresponding regression model. As we notice the coefficient for Regionname = Southern Metropolitan is significant, this confirms our previous conclusion. We also notice that the interactions between Number of Rooms and Regionname = Southern Metropolitan, South-Eastern Metropolitan, and Northern Metropolitan are also significant, suggesting that this model is a good fit for predicting the sale price of properties in Region of Southern Metropolitan especially (potentially also for properties in South-Eastern Metropolitan and Northern Metropolitan), using the number of rooms for the properties.

d) Conclusion

Overall, we examined different potential regression models to predict the Price of properties in Melbourne, using Number of Rooms as the main predictor variable, taking interaction between it with Types and RegionName of Properties into consideration. We can then predict the price of different properties in Melbourne based on different Types of the Properties or Region Name with the most suitable regression model.

Main Takeaways & Future Directions

Main Takeaways

  • There is significant difference in the variance and mean of property price per square meters across the eight regions.
  • Number of rooms play a significant role in contributing to prices of properties.
  • We can predict Price of properties in Melbourne using Number of Rooms as the main predictor variable, using different/most suitable regression model for different Types of the Properties or Region Name.

Future Directions

Since this data frame only contains properties that were sold. With data about unsold properties, we could possibly gain more insights about what are decision factors to success sales. Additionally, variables about quality of furnishing and neighborhood safety could help us comprehend more about sales.

Another future improvement for this project is to dive into Question 3 to improve on the regression model employed, which would require more nuanced statistical techniques and knowledge about modern regression models. The regression model under consideration and analysis for now is relatively simple, and more complicated model which incorporates more covariates, or adopts models such as logistic regression, polynomial regression, and nonparametric regression can be adopted to improve on the accuracy of prediction for sale price.