Introduction

The dataset this project will be based on is data on the Moscow Real Estate market. The recent Russian-Ukrainian conflict has had detrimental effects on Russia’s economy, and this dataset offers potentially interesting data on Russia’s largest real estate market. We have three main research questions that we would like to answer arising from this dataset. Firstly, we would like to understand how the ease of access to metro stations affects the price of the property by exploring the conditional distribution of price given minutes as well as a regression on minutes to price. Access to public transportation is often a key variable for potential home buyers, and we would like to see the extent of that effect on property prices in Moscow. Secondly, we would like to explore which kinds of listings receive the most interest from potential property buyers and does that correlate to higher prices. What kinds of characteristics could a property have (like larger area, ease of access to the metro, provider of the property listing) would make it the most attractive to property buyers. Lastly, we would like to explore the relationship between the Price of the property and property-specific characteristics like number of rooms, floor number, high-rise/low-rise apartments, kitchen/living area and total area, and to what extent to these characteristics effect price. This tells us about the preferences of property buyers in Moscow and what kinds of characteristics would make properties sell for more.

This data set has 137418 data entries, each representing a different property, with 13 covariates. The main response variable is Price (in rubles) of the property at the time of sale. The other 12 variables are: X (the index number of the property), Metro (a categorical variable describing if the Property is within the Metro area or not), Minutes (how far away in minutes is the nearest Metro Station), Way (a categorical variable describing how to get to the closest metro station), Provider (a categorical variable describing who sold the property), Views (the number of views the advertisement for the property received), Storey (the storey of where the property is located), Storeys (the number of storeys in that block of flats), Rooms (the number of rooms the property has), Total Area (the area of the property is square meters), Living Area (the area of the property that is dedicated to living eg living room, bedrooms) and Kitchen area (the area of the property that is dedicated to the Kitchen).

The heatmap above shows the correlations between each quantitative variable in the Moscow dataset. As we can see, Price and Rooms are highly correlated, as are Price and Total/Living Area (which makes sense as more expensive properties tend to cost more). Views and Minutes are also highly correlated, which implies that people are more interested in properties closer to a Metro station. Views are also highly correlated with Rooms and Total Area, implying that people are interested in larger properties.

The mosaicplot above shows distributions the categorical variables of the Moscow dataset. For Metro, Outskirts makes up the majority of properties with Metro being in a slight minority.

For properties in the Metro, more people walk than take their own transportation which makes sense as the Metro stations provide transport. Metro properties where people take transport to their closest Metro are mainly sold by developers and individual owners, with real estate agencies and realtors being the least common providers. Metro properties where people walk to their closest Metro are sold by developers and agencies, with owners and realtors being much less common.

For properties outside the Metro, more people take their own transportation than walk which makes sense as for most properties walking would be too far. Outskirt properties are mainly sold by agencies and developers, with owners and realtors being the least common providers.

Research Question 1

The first question we wanted to explore related to this dataset was whether or not there is a relationship between the ease of access to a metro station and the price of a listing. In order to explore this, we will look at how the variables metro and minutes. The variable metro is a binary variable that indicates whether or not a listing is close to a metro station. The variable minutes is another quantitative variable that represents the number of minutes it takes to get to the metro station from a listing. Note that we look at both of these variables because the dataset does not include minutes values for listings that the metro variable indicates to be far from a metro station.

We first removes price outliers from the dataset. We do this because these outliers would cause the scales of the graphs we make to be much to large and, in turn, make it difficult to visualize trends in the majority of the data.

In this graph we look to examine the relationship between the minutes variable and the price variable. To do this we create a scatterplot with minutes on the x-axis and price on the y-axis and overlay a linear regression line.

This graph suggest that, for listings considered close to a metro station, as the number of minutes it takes to get to a metro station increases, the price of the listing decreases. It is important to note, however, that there is a lot of variance in price for listings that are closer to a metro station (i.e. where the metro station can be reached in 20 minutes or less). This means that there are a good amount of both high and low price listings in this 20 minutes or less region, but overall the majority of the listings are going to have higher prices than listings that are further away. The reason for this negative association between listing price and minutes away from metro station could be due to the convenience of being close to a station raising listing prices. Now we will test whether the linear relationship shown in the above graph is statistically significant.

## 
## Call:
## lm(formula = price ~ minutes, data = df)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -15333928  -6538928  -1972356   5264918  29128595 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 15438928      28596 539.902   <2e-16 ***
## minutes       -28728       3048  -9.425   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8885000 on 123387 degrees of freedom
## Multiple R-squared:  0.0007195,  Adjusted R-squared:  0.0007114 
## F-statistic: 88.84 on 1 and 123387 DF,  p-value: < 2.2e-16

From this output, we can see that it is estimated that the price of a listing that is zero minutes away from a metro station is $15438928. Also, we can see that for every minute increase in the time it takes to get to the metro station, it is estimated that the price will decrease by $28728. The output also indicates that both the estimate for the slope and the intercept are statistically significant at a \(\alpha = .001\) level, so it follows that there is evidence of a true negative linear relationship between minutes and price

In this graph we look to examine the relationship between the price variable and the metro variable (recall that the metro variable is a binary indicator of whether or not a listing is close to a metro station). Note that it is important that we examine this relationship in addition to the relationship between price and minutes because the dataset only records minutes values for listings that are considered close to a metro station. To examine this relationship, we first sort the prices into three equally sized bins and then create a mosaic plot comparing price and whether or not a listing is close to the metro.

This graph suggests that proportion of listings that are close to a metro station increases as we go from low price to medium price, but it then decreases as we go from medium price to high price. An explanation for this could be that for low to average price listings, being closer to the metro station increases the price, as it provides convenience to to low and middle class people. However, the people buying high-priced listings likely do not take the metro and travel by car instead, so being close to the metro station is less of a benefit for these listings. Another explanation for why the proportion of listings close to a metro station decreases for higher priced listings could be that higher priced listings are often too large to be located in city areas, so they are more likely to be far from a metro station.

Research Question 2

Research Question 2: Is there a relationship between views and information about the real estate? First we will create a regression model to see which predictors significantly affect the model to understand what variables are important

## 
## Call:
## lm(formula = views ~ ., data = moscow)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -812.8  -23.0   -1.7   22.7 7295.2 
## 
## Coefficients:
##                     Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)        2.352e+02  1.952e+00  120.495  < 2e-16 ***
## metroOutskirts    -1.180e+02  1.196e+00  -98.591  < 2e-16 ***
## price             -1.053e-08  4.035e-09   -2.610 0.009049 ** 
## minutes           -3.132e+00  5.514e-02  -56.795  < 2e-16 ***
## waywalk           -8.197e+01  8.521e-01  -96.200  < 2e-16 ***
## providerdeveloper  1.491e+01  6.238e-01   23.908  < 2e-16 ***
## providerowner      2.507e+02  1.243e+00  201.730  < 2e-16 ***
## providerrealtor   -1.237e+02  2.544e+00  -48.606  < 2e-16 ***
## storey            -6.575e-02  1.729e-02   -3.803 0.000143 ***
## storeys            6.908e-04  2.306e-04    2.995 0.002742 ** 
## rooms              6.695e+00  2.876e-01   23.279  < 2e-16 ***
## total_area         4.925e-01  2.406e-02   20.471  < 2e-16 ***
## living_area        1.044e+00  4.214e-02   24.774  < 2e-16 ***
## kitchen_area      -3.557e+00  2.606e-02 -136.522  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 101.3 on 137404 degrees of freedom
## Multiple R-squared:  0.5498, Adjusted R-squared:  0.5497 
## F-statistic: 1.291e+04 on 13 and 137404 DF,  p-value: < 2.2e-16
## 
## Call:
## lm(formula = views ~ price + provider + metro, data = moscow)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -916.5  -29.1   -1.8   20.5 7406.9 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        9.112e+01  7.566e-01  120.42   <2e-16 ***
## price              6.736e-08  4.100e-09   16.43   <2e-16 ***
## providerdeveloper -7.647e+00  6.730e-01  -11.36   <2e-16 ***
## providerowner      3.457e+02  1.266e+00  273.00   <2e-16 ***
## providerrealtor   -4.929e+01  2.670e+00  -18.46   <2e-16 ***
## metroOutskirts    -6.909e+01  7.034e-01  -98.22   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 113.3 on 137412 degrees of freedom
## Multiple R-squared:  0.4363, Adjusted R-squared:  0.4363 
## F-statistic: 2.127e+04 on 5 and 137412 DF,  p-value: < 2.2e-16

It looks like every variable in the full regression model is significant. The adjusted R^2 is 0.55. When we created a model with the three variables that we thought was important, we got and adjusted R^2 of 0.43. In order to get a better understanding of the predictors, we made a PCA graph.

MetroOutskirts and minutes and providerowner have eigenvectors in the same direction. Agency and transport have eigenvectors in the same direction. Thus PC1 is based on whether the real estate is local or not. On the vertical axis price and living area go in the same direction so PC2 is based on the size of the house. However the first two principal components are only around 40% of the variation in the data which means prediction could be difficult for estimating the number of views.

On average, metro real estate seems to get more views with some significant outliers at cheap prices. Otherwise, it doesn’t seem like either of these two aspects seem to have great predictive qualities for views. The number of views seem to mostly resemble a uniform distribution with some outliers at lower prices. Metro seems to have a slightly higher mean than Outskirts does.

Research Question 3

The third question we wanted to explore was does the price of a listing have a relationship to characteristics of the apartment such as: number of rooms, floor number, total floors in the apartment, kitchen area, living area, and total area. To explore this relationship we will look at the variables price, storey, storeys, rooms, total_area, living_area, and kitchen_area. The variable price is a quantitative variable that represents the price of the listing. The variable storey is a quantitative variable that represents the storey that the listing is on. The variable storeys is a quantitative variable that represents the total number of the storeys the building the listing is located in has. The variable total_area is a quantitative variable that represents the total area the listing has. The variable living_area is a quantitative variable that represents the total living area the listing has. The variable kitchen_area is a quantitative variable that represents the total kitchen area the listing has.

First we will remove the price outliers from the dataset, because these price outliers cause the scales of the graph to be too large and therefore making it difficult to visualize trends in the data.

In this graph that is plotted we are examining the relationship between the price variable as a response variable and the storey variable as a predictor variable. We created a scatterplot that shows the price variable on the y-axis against the storey variable on the x-axis, we also colored the points depending on the number of rooms the listing has. We also overlayed a linear regression line on top (regressing price on storey).

This graph suggests that, for listings that are located on a higher storey, the price of a listing is higher. We also note that as the listing contains more rooms the price is also higher as we see from the color of the plot. We see that at a lower price the points are colored darker and more purple which corresponds to a lower number of rooms. We see that at a higher price the points are colored lighter and more orange which corresponds to a higher number of rooms. However, it is important to note that there is a lot of variance in price for the listings. This scatterplot shows associations that would seem to make sense as you would expect a listing on a higher storey to be more expensive and you would also expect a listing with more rooms to be more expensive. And on the otherhand, you would expect a listing on a lower storey to be less expensive and you would also expect a listing with less rooms to be less expensive. Now we will test whether the linear relationship between price and storey that is shown in the graph above is statistically significant.

## 
## Call:
## lm(formula = price ~ storey, data = mosc)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -34569386  -6600974  -2020716   5264542  28840186 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1.522e+07  2.593e+04  586.91   <2e-16 ***
## storey      9.510e+03  5.766e+02   16.49   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8878000 on 123387 degrees of freedom
## Multiple R-squared:  0.0022, Adjusted R-squared:  0.002192 
## F-statistic: 272.1 on 1 and 123387 DF,  p-value: < 2.2e-16

From this output we see that it is estimated that the price of a listing that is on storey level 0 is 1.522e+07 Also we see that for every storey increase the price will increase by approximately 9.510e+03. The output indicates that both the slope and intercept are statistically significant as they have both have p-values that are <2e-16 which is approximately 0 and much less than 0.05.

Next we will create a new price_bin categorical variable that divides price into 3 bins and labels it accordingly: Low , Medium, High.

We will create a plot that shows the MDS plot to visualize the distance between the listings and then categorize them by color to see if listings in the same pricing bucket (Low, Medium, and High) have similar distances thus showing similar characteristics.

We are only using the first 2500 datapoints, because the dataset is too large at 137418 points and our computer is not fast enough to calculate the distance matrix. We are also using columns 10-16 which represent storey, storeys, rooms, total_area, living_area, and kitchen_area.

We are using euclidean distance for the distance matrix and k = 2 for the MDS plotting.

After looking at the MDS plot we are able to see that listings of similar price bins have similar characteristics. We see that Medium priced listings which are colored green are clustered together from around X = -100 to X = 10 and from Y = -100 to Y = 25. We see that Low priced listings which are colored red are clustered together from around X = -25 to X = 60 and from Y = -25 to Y = 70 We see that High priced listings which are colored blue are clustered together from around X = -50 to X = 0 and from Y = -25 to Y = 40. From the MDS plot we are able to conclude that listings of similar characteristics tend to be priced closely with each other.

Conclusion

For our first research question, which asks if there is any relationship between listing price and the ease of access to the metro from the listing, we can make a couple conclusions. From the scatterplot and linear regression, we can see that for listings close to a metro station, the exact time it takes to get to the station has a statistically significant negative association with the price of the listing. We can also conclude from the mosaic plot that ease of access to a metro station is a more accurate predictor of price for lower to medium priced listings than high priced listings.

From our second research question, we found that our dataset could adequately predict the number of views given the rest of the dataset. The first principal component was based on the location of the real estate - for example, is it in the metropolitan area or the outskirts. The second principal component was based on the size of the real estate which is related to price. From a scatter plot we showed that whether or not the real estate was in a metro area or on the outskirts of Moscow was a decent predictor of views, but price was not a very good predictor of views from a visual analysis.

From our last research question we found out that there is a statistically significant positive relationship between the price variable and the storey variable. This intuitively makes sense as you would expect a listing that is on a higher storey to be more expensive due to it possibly having a better view, and also a view that is not blocked by other buildings. We also observed the relationship that showed positive relationship between number of rooms and price which would also make sense as you would expect a listing with more rooms to have a higher price. We also saw from our MDS plot that listings with similar characteristics are priced similarly. Our MDS plot took into account storey, storeys, rooms, total_area, living_area, and kitchen_area which are all characteristics of the listing.

Future Research

Our dataset represents a week’s worth of real estate data in Moscow, and so additional weeks of data could be used to test the validity of the conclusions we made, or generate new insights into how prices/types of listings change with the seasons throughout the year. We could also have further insights into the effect of Russia’s conflict with Ukraine, especially as the conflict still develops. Our regressions produced are also limited because we did not dive into the diagnostics of the regressions and there could also be omitted variable biases. Our dataset also does not include information that can be useful in predicting price such as age of the real estate, crime rate around the area, security of the apartment, and other variables. One question that was not explored in our report and can be explored in future research is do external factors (such as demographic of the neighborhood, size of the neighborhood, etc) of these listings affect the price.