class: center, middle, inverse, title-slide # Supervised Learning ## Linear regression ### Ron Yurko ### 06/16/2020 --- ## The setting We have a collection of `\(p\)` variables for each of `\(n\)` objects `\(X_1,\dots,X_n\)`, and for observation `\(i\)` (e.g. a person): `$$X_{i,1},...,X_{1,p} \sim P$$` - `\(P\)` is a `\(p\)`-dimensional distribution that we might not know much about *a priori* - pretend `\(p = 2\)`, variable 1 is height, variable 2 is weight -- For each observation we have an additional measurement, the __response__ `\(Y_i \sim Q\)` - where `\(Q\)` is a univariate distribution (e.g., normal distribution) - e.g. how far a person has thrown a football in an experiment -- In linear regression, we assume that `\(Y\)` is related to the variables `\(X\)` via the model $$ Y = \beta_0 + \beta_1 X_1 + \cdots + \beta_p X_p + \epsilon $$ - where `\(\epsilon\)` represents the scatter of data around the regression line - `\(\epsilon\)` is a __random variable__ assumed to be normally distributed with mean zero and constant covariance matrix `\(\Sigma\)` --- ## Linear regression: why use it? - It is __inflexible__, but readily __interpretable__ - e.g. If a person's height changes by one unit, the distance that they threw a football changes by `\(\beta_1\)` units, on average - Not necessarily the case that there is an *a priori* belief that the `\(X\)`'s and `\(Y\)` are exactly linearly related -- - It is __best linear unbiased estimator (BLUE)__ - assuming `\(X\)` is fixed `$$\mathbb{E}[\hat{\beta} | X = x] = \beta$$` -- - It is a __fast model to learn__ - the `\(\beta\)`'s can be computed via an exact formula, as opposed to slower numerical optimization - __Ordinary Least Squares (OLS)__ estimator in matrix notation ( `\(\boldsymbol{X}\)` is a `\(n \times p\)` matrix): `$$\hat{\boldsymbol{\beta}} = (\boldsymbol{X} ^T \boldsymbol{X})^{-1}\boldsymbol{X}^T\boldsymbol{Y}$$` - Minimizes the __residual sum of squares (RSS)__: `$$\text{RSS}(\boldsymbol{\beta}) = \sum_i^n \big(Y_i - \sum_j^p \hat{\beta}_j X_{ij}\big)^2$$` --- ## Linear regression: basic points to remember 1. It estimates a __conditional mean__: $$ E[Y \vert X_1,\cdots,X_p] $$ i.e., the average value of `\(Y\)` *given* values `\(X_1,\cdots,X_p\)`. -- 2. If the `\(X\)`'s are measured with uncertainty, then the estimates of the `\(\beta\)`'s become *biased* - i.e. they differ on average from the true value, and approach zero as the uncertainties increase -- <img src="https://www.explainxkcd.com/wiki/images/9/91/linear_regression.png" width="40%" style="display: block; margin: auto;" /> --- ## Linear regression: what's the R code? - Create fake variable `x` where the the true relationship is `y = x`... ```r x <- 1:10 ``` -- - But we observe noise so that we need to simulate the values for `\(\epsilon\)` -- - Can simulate data from different distributions easily - Generate normal noise `\(\epsilon\)` with `rnorm()` ```r set.seed(303) *y <- x + rnorm(10, sd = 0.5) ``` -- - We fit linear regression models using `lm()` - Formula is input as: `response ~ predictor1 + predictor2 +`... ```r *init_lm <- lm(y ~ x) ``` --- ## Linear regression: display the model summary Display the model output with `summary()`: ```r summary(init_lm) ``` ``` ## ## Call: ## lm(formula = y ~ x) ## ## Residuals: ## Min 1Q Median 3Q Max ## -0.59256 -0.25274 -0.00479 0.20039 0.69090 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -0.01069 0.27092 -0.039 0.969 ## x 0.99612 0.04366 22.814 1.44e-08 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.3966 on 8 degrees of freedom ## Multiple R-squared: 0.9849, Adjusted R-squared: 0.983 ## F-statistic: 520.5 on 1 and 8 DF, p-value: 1.445e-08 ``` --- ## Linear regression: what's in the summary? .pull-left[ ``` ## ## Call: ## lm(formula = y ~ x) ## ## Residuals: ## Min 1Q Median 3Q Max ## -0.59256 -0.25274 -0.00479 0.20039 0.69090 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -0.01069 0.27092 -0.039 0.969 ## x 0.99612 0.04366 22.814 1.44e-08 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.3966 on 8 degrees of freedom ## Multiple R-squared: 0.9849, Adjusted R-squared: 0.983 ## F-statistic: 520.5 on 1 and 8 DF, p-value: 1.445e-08 ``` ] .pull-right[ - coefficients `\(\beta\)`s - __intercept__ `\(\beta_0 \approx\)` -0.011 - coefficient for `\(x \approx\)` 0.996 - estimates of uncertainty for `\(\beta\)`s - __standard error__ for `\(x \approx\)` 0.044 - p-values (i.e. `Pr(>|t|)`) - __estimated probability__ one would observe a value of 0.996 or larger (or -0.996 or smaller) is `\(1.44 \times 10^{-8}\)` - p-value `\(<\)` conventional threshold of 0.05, __we reject the null hypothesis that the coefficient is zero__, - i.e., there is a __significant__ association between `\(x\)` and `\(y\)` ] --- ## Be careful! __Caveats to keep in mind regarding p-values:__ -- If the true value of a coefficient `\(\beta = 0\)`, then the p-value is sampled from a Uniform(0,1) distribution - i.e. it is just as likely to have value 0.45 as 0.16 or 0.84 -- - `\(\Rightarrow\)` there's a 5% chance that you'll conclude there's a significant association between `\(x\)` and `\(y\)` __even when there is none__ -- As the sample size `\(n\)` gets large, __standard error goes to zero__, and *all* predictors are eventually deemed significant - `\(\Rightarrow\)` while the p-values might be informative, we will explore other approaches to determine which subset of predictors to include --- ## Linear regression: how to predict? We can use the `predict()` function to either get the fitted values of the regression: ```r train_preds <- predict(init_lm) train_preds ``` ``` ## 1 2 3 4 5 6 7 8 9 ## 0.9854296 1.9815524 2.9776751 3.9737978 4.9699205 5.9660432 6.9621659 7.9582886 8.9544113 ## 10 ## 9.9505341 ``` Which is equivalent to using: ```r init_lm$fitted.values ``` ``` ## 1 2 3 4 5 6 7 8 9 ## 0.9854296 1.9815524 2.9776751 3.9737978 4.9699205 5.9660432 6.9621659 7.9582886 8.9544113 ## 10 ## 9.9505341 ``` --- ## How do we predict and plot? For new data we can provide it... `newdata`: ```r predict(init_lm, newdata = tibble(x = c(1.5, 2.5, 3.5))) ``` ``` ## 1 2 3 ## 1.483491 2.479614 3.475736 ``` -- .pull-left[ Display this with `ggplot2`: ```r tibble(x = x, y = y, preds = predict(init_lm)) %>% ggplot(aes(x)) + geom_point(aes(y = y)) + geom_point(aes(y = preds), color = "red") + geom_line(aes(y = preds), linetype = "dashed", color = "red") + theme_bw() ``` ] .pull-right[ <img src="09-linear-reg_files/figure-html/unnamed-chunk-3-1.png" width="432" style="display: block; margin: auto;" /> ] --- ## What's another way to display the regression fit? .pull-left[ ```r tibble(x = x, y = y) %>% ggplot(aes(x, y)) + geom_point() + * stat_smooth(method = "lm") + theme_bw() ``` If the gray error band includes a perfectly horizontal line, then we would __fail to reject__ the null hypothesis that there is no association between `x` and `y` - clearly that is not the case for this example ] .pull-right[ <img src="09-linear-reg_files/figure-html/unnamed-chunk-4-1.png" width="432" style="display: block; margin: auto;" /> ] --- ## Back to the output: `\(R^2\)` In the model summary will see values for: `Multiple R-squared` and `Adjusted R-squared` -- - `\(R^2\)` estimates the __proportion of the variance__ of the response `y` explained by the linear regression model (is a value between 0 and 1) - `\(R^2\)` is just a ratio: variance of predictions / variance of `y` ```r var(predict(init_lm)) / var(y) ``` ``` ## [1] 0.9848623 ``` -- - Adjusted `\(R^2\)` __adjusts__ for the number of variables in the model (always use this instead of the raw `\(R^2\)`) - Adjusted `\(R^2\)` provides intuition as to how well a linear model fits to the data, as opposed to the mean-squared error --- ## Linear regression: other points to keep in mind A proper discussion of each of the points given below could span one or more lectures in a conventional regression class! Here we just mention them, as caveats. - You may find it useful to transform your variables so that they are distributed more normally. This is less critical for predictors (though it may be helpful) and more critical for the response variable (otherwise the assumptions of linear regression may be violated). Keep this in mind as you perform EDA and think about the statistical learning that you will perform in your analyses. -- - Outliers may adversely affect your regression estimates. In a linear regression setting, outliers may be identified via the [__"Cook's distance"__](https://en.wikipedia.org/wiki/Cook%27s_distance) We offer no general heuristic regarding how to deal with outliers, other than you should scrupulously document how you deal with them! -- - Beware collinearity! Collinearity is when one predictor variable is linearly associated with another. Collinearity is not necessarily harmful outside a linear regression setting, but must be dealt with in linear regression analyses. The general process is to perform the linear regression fit, compute the "variance inflation factor" via the `vif()` function in the `car` package, remove a variable if its vif is greater than 5, and repeat the fitting process until no more variables are to be removed. --- ## Last Point: do I really need to split my data? If your entire analysis workflow involves fitting one linear regression model to your data, there is no need to split your data into training and test datasets. Just learn the model, and interpret it. However, if you intend to learn multiple models (e.g., linear regression, a regression tree, random forest), then you should estimate the `\(\beta\)`'s using only the training data and you should generate predictions and compute the MSE using the test data. --- ## Model diagnostics MSE is unit-dependent `\(\Rightarrow\)` you cannot use its value alone to determine the quality of the underlying model -- Useful diagnostic (for *any* regression analysis, not just a linear regression analysis!) is to plot predicted responses (for the test set data!) versus the observed responses: .pull-left[ Key points: - If the data are completely uninformative, the data will lie on a horizontal locus: every input with generate the same prediction, the average observed value - If the model is "perfect," the data will lie along the diagonal line - Real-life models will generate plots with behaviors between this two extremes, with additional intrinsic scatter ] .pull-right[ <img src="http://www.stat.cmu.edu/~pfreeman/Fig_Diagnostic_I.png" width="80%" style="display: block; margin: auto;" /> ] --- ## Model diagnostics A variation on the diagnostic presented on the previous slide is to plot model residuals (observed response minus predicted response) versus the predicted responses: .pull-left[ Conditional on the value of the predicted response, __the residuals should have zero mean__ <img src="http://www.stat.cmu.edu/~pfreeman/Fig_Diagnostic_II.png" width="80%" style="display: block; margin: auto;" /> ] .pull-right[ Standardized residuals should be normally distributed with variance 1 (normality may be checked using `qqnorm()`) <img src="http://www.stat.cmu.edu/~pfreeman/Fig_Diagnostic_III.png" width="70%" style="display: block; margin: auto;" /> ]