class: center, middle, inverse, title-slide # Supervised Learning ## Linear regression ### June 23rd, 2022 --- ## Simple linear regression We assume a __linear relationship__ for `\(Y = f(X)\)`: `$$Y_{i}=\beta_{0}+\beta_{1} X_{i}+\epsilon_{i}, \quad \text { for } i=1,2, \dots, n$$` - `\(Y_i\)` is the `\(i\)`th value for the __response__ variable - `\(X_i\)` is the `\(i\)`th value for the __predictor__ variable -- - `\(\beta_0\)` is an _unknown_, constant __intercept__: average value for `\(Y\)` if `\(X = 0\)` - `\(\beta_1\)` is an _unknown_, constant __slope__: increase in average value for `\(Y\)` for each one-unit increase in `\(X\)` -- - `\(\epsilon_i\)` is the __random__ noise: assume __independent, identically distributed__ (_iid_) from Normal distribution `$$\epsilon_i \overset{iid}{\sim}N(0, \sigma^2) \quad \text{ with constant variance } \sigma^2$$` --- ## Simple linear regression estimation We are estimating the __conditional expection (mean)__ for `\(Y\)`: `$$\mathbb{E}[Y_i| X_i] = \beta_0 + \beta_1X_i$$` - average value for `\(Y\)` given the value for `\(X\)` -- How do we estimate the __best fitting__ line? -- __Ordinary least squares (OLS)__ - by minimizing the __residual sum of squares (RSS)__ `$$R S S\left(\beta_{0}, \beta_{1}\right)=\sum_{i=1}^{n}\left[Y_{i}-\left(\beta_{0}+\beta_{1} X_{i}\right)\right]^{2}=\sum_{i=1}^{n}\left(Y_{i}-\beta_{0}-\beta_{1} X_{i}\right)^{2}$$` -- `$$\widehat{\beta}_{1}=\frac{\sum_{i=1}^{n}\left(X_{i}-\bar{X}\right)\left(Y_{i}-\bar{Y}\right)}{\sum_{i=1}^{n}\left(X_{i}-\bar{X}\right)^{2}} \quad \text{ and } \quad \widehat{\beta}_{0}=\bar{Y}-\widehat{\beta}_{1} \bar{X}$$` where `\(\bar{X} = \frac{1}{n}\sum_{i=1}^n X_i\)` and `\(\bar{Y} = \frac{1}{n}\sum_{i=1}^n Y_i\)` --- ## Connection to covariance and correlation [__Covariance__](https://en.wikipedia.org/wiki/Covariance) describes the __joint variability of two variables__ `$$\text{Cov}(X, Y) = \sigma_{X,Y} = \mathbb{E}[(X-\mathbb{E}[X])(Y-\mathbb{E}[Y])]$$` -- We compute the __sample covariance__ (use `\(n - 1\)` since we are using the means and want [__unbiased estimates__](https://lazyprogrammer.me/covariance-matrix-divide-by-n-or-n-1/)) `$$\hat{\sigma}_{X,Y} = \frac{1}{n-1} \sum_{i=1}^{n}\left(X_{i}-\bar{X}\right)\left(Y_{i}-\bar{Y}\right)$$` -- __Correlation__ is a _normalized_ form of covariance, ranges from -1 to 1 `$$\rho_{X,Y} = \frac{\text{Cov}(X,Y)}{\sigma_X \cdot \sigma_Y}$$` __Sample correlation__ uses the sample covariance and standard deviations, e.g. `\(s_X^2 = \frac{1}{n-1} \sum_i (X_i - \bar{X})^2\)` `$$r_{X,Y} = \frac{\sum_{i=1}^{n}\left(X_{i}-\bar{X}\right)\left(Y_{i}-\bar{Y}\right)}{\sqrt{\sum_{i=1}^{n}\left(X_{i}-\bar{X}\right)^{2} \sum_{i=1}^{n}\left(Y_{i}-\bar{Y}\right)^{2}}}$$` --- ## Connection to covariance and correlation So we have the following: `$$\widehat{\beta}_{1}=\frac{\sum_{i=1}^{n}\left(X_{i}-\bar{X}\right)\left(Y_{i}-\bar{Y}\right)}{\sum_{i=1}^{n}\left(X_{i}-\bar{X}\right)^{2}} \quad \text{ compared to } \quad r_{X,Y} = \frac{\sum_{i=1}^{n}\left(X_{i}-\bar{X}\right)\left(Y_{i}-\bar{Y}\right)}{\sqrt{\sum_{i=1}^{n}\left(X_{i}-\bar{X}\right)^{2} \sum_{i=1}^{n}\left(Y_{i}-\bar{Y}\right)^{2}}}$$` -- `\(\Rightarrow\)` Can rewrite `\(\hat{\beta}_1\)` as: `$$\widehat{\beta}_{1} = r_{X,Y} \cdot \frac{s_Y}{s_X}$$` `\(\Rightarrow\)` Can rewrite `\(r_{X,Y}\)` as: `$$r_{X,Y} = \widehat{\beta}_{1} \cdot \frac{s_X}{s_Y}$$` -- Can think of `\(\widehat{\beta}_{1}\)` weighting the ratio of variance between `\(X\)` and `\(Y\)`... --- ## Gapminder data Health and income outcomes for 184 countries from 1960 to 2016 from the famous [Gapminder project](https://www.gapminder.org/data) ```r library(tidyverse) library(dslabs) gapminder <- as_tibble(gapminder) clean_gapminder <- gapminder %>% filter(year == 2011, !is.na(gdp)) %>% mutate(log_gdp = log(gdp)) clean_gapminder ``` ``` ## # A tibble: 168 x 10 ## country year infant_mortality life_expectancy fertility population gdp ## <fct> <int> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 Albania 2011 14.3 77.4 1.75 2886010 6.32e 9 ## 2 Algeria 2011 22.8 76.1 2.83 36717132 8.11e10 ## 3 Angola 2011 107. 58.1 6.1 21942296 2.70e10 ## 4 Antigua and… 2011 7.2 75.9 2.12 88152 8.02e 8 ## 5 Argentina 2011 12.7 76 2.2 41655616 4.73e11 ## 6 Armenia 2011 15.3 73.5 1.5 2967984 4.29e 9 ## 7 Australia 2011 3.8 82.2 1.88 22542371 5.73e11 ## 8 Austria 2011 3.4 80.7 1.44 8423559 2.31e11 ## 9 Azerbaijan 2011 32.5 70.8 1.96 9227512 2.14e10 ## 10 Bahamas 2011 11.1 72.6 1.9 366711 6.76e 9 ## # … with 158 more rows, and 3 more variables: continent <fct>, region <fct>, ## # log_gdp <dbl> ``` --- ## Modeling life expectancy .pull-left[ Interested in modeling a country's __life expectancy__ ```r clean_gapminder %>% ggplot(aes(x = life_expectancy)) + geom_histogram(color = "black", fill = "darkblue", alpha = 0.3) + theme_bw() + labs(x = "Life expectancy") ``` ] .pull-right[ <img src="11-Linear-reg_files/figure-html/unnamed-chunk-1-1.png" width="504" /> ] --- ### Relationship between life expectancy and log(GDP) .pull-left[ ```r gdp_plot <- clean_gapminder %>% * ggplot(aes(x = log_gdp, y = life_expectancy)) + geom_point(alpha = 0.5) + theme_bw() + labs(x = "log(GDP)", y = "Life expectancy") gdp_plot ``` We fit linear regression models using `lm()`, formula is input as: `response ~ predictor` ```r *init_lm <- lm(life_expectancy ~ log_gdp, * data = clean_gapminder) ``` ] .pull-right[ <img src="11-Linear-reg_files/figure-html/unnamed-chunk-2-1.png" width="504" /> ] --- ## View the model `summary()` ```r summary(init_lm) ``` ``` ## ## Call: ## lm(formula = life_expectancy ~ log_gdp, data = clean_gapminder) ## ## Residuals: ## Min 1Q Median 3Q Max ## -18.901 -4.781 1.879 5.335 13.962 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 24.174 5.758 4.198 4.38e-05 *** ## log_gdp 1.975 0.242 8.161 7.87e-14 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 7.216 on 166 degrees of freedom ## Multiple R-squared: 0.2864, Adjusted R-squared: 0.2821 ## F-statistic: 66.61 on 1 and 166 DF, p-value: 7.865e-14 ``` --- ## Inference with OLS Reports the intercept and coefficient estimates: `\(\quad \hat{\beta}_0 \approx 24.174 \quad, \quad \hat{\beta}_1 \approx 1.975\)` -- Estimates of uncertainty for `\(\beta\)`s via __standard errors__: `\(\quad \widehat{SE}(\hat{\beta}_0) \approx 5.758 \quad, \quad \widehat{SE}(\hat{\beta}_1) \approx 0.242\)` -- `\(t\)`-statistics are coefficients `Estimates` / `Std. Error`, i.e., number of standard deviations from 0 - _p-values_ (i.e., `Pr(>|t|)`): estimated probability observing value as extreme as |`t value`| __given the null hypothesis__ `\(\beta = 0\)` - p-value `\(<\)` conventional threshold of `\(\alpha = 0.05\)`, __sufficient evidence to reject the null hypothesis that the coefficient is zero__, - Typically |`t values`| `\(> 2\)` indicate __significant__ relationship at `\(\alpha = 0.05\)` - i.e., there is a __significant__ association between `life_expectancy` and `log_gdp` --- ## Be careful! __Caveats to keep in mind regarding p-values:__ -- If the true value of a coefficient `\(\beta = 0\)`, then the p-value is sampled from a [Uniform(0,1) distribution](https://en.wikipedia.org/wiki/Continuous_uniform_distribution) - i.e., it is just as likely to have value 0.45 as 0.16 or 0.84 or 0.9999 or 0.00001... -- `\(\Rightarrow\)` Hence why we typically only reject for low `\(\alpha\)` values like 0.05 - Controlling the Type 1 error rate at `\(\alpha = 0.05\)`, i.e., the probability of a __false positive__ mistake - 5% chance that you'll conclude there's a significant association between `\(x\)` and `\(y\)` __even when there is none__ -- Remember what a standard error is? `\(SE = \frac{\sigma}{\sqrt{n}}\)` - `\(\Rightarrow\)` As `\(n\)` gets large __standard error goes to zero__, and *all* predictors are eventually deemed significant - While the p-values might be informative, we will explore other approaches to determine which subset of predictors to include (e.g., holdout performance) --- ## Back to the model summary: `Multiple R-squared` Back to the connection between the coefficient and correlation: `$$r_{X,Y} = \widehat{\beta}_{1} \cdot \frac{s_X}{s_Y} \quad \Rightarrow \quad r^2_{X,Y} = \widehat{\beta}_{1}^2\cdot \frac{s_X^2}{s_Y^2}$$` Compute the correlation with `cor()`: ```r *with(clean_gapminder, cor(log_gdp, life_expectancy)) ``` ``` ## [1] 0.5351189 ``` -- The squared `cor` matches the reported `Multiple R-squared` ```r *with(clean_gapminder, cor(log_gdp, life_expectancy))^2 ``` ``` ## [1] 0.2863522 ``` --- ## Back to the model summary: `Multiple R-squared` Back to the connection between the coefficient and correlation: `$$r_{X,Y} = \widehat{\beta}_{1} \cdot \frac{s_X}{s_Y} \quad \Rightarrow \quad r^2_{X,Y} = \widehat{\beta}_{1}^2\cdot \frac{s_X^2}{s_Y^2}$$` `\(r^2\)` (or also `\(R^2\)`) estimates the __proportion of the variance__ of `\(Y\)` explained by `\(X\)` - More generally: variance of model predictions / variance of `\(Y\)` ```r *var(predict(init_lm)) / var(clean_gapminder$life_expectancy) ``` ``` ## [1] 0.2863522 ``` --- ## Generating predictions We can use the `predict()` function to either get the fitted values of the regression: ```r train_preds <- predict(init_lm) head(train_preds) ``` ``` ## 1 2 3 4 5 6 ## 68.74401 73.78465 71.61243 64.66585 77.26605 67.97876 ``` Which is equivalent to using: ```r head(init_lm$fitted.values) ``` ``` ## 1 2 3 4 5 6 ## 68.74401 73.78465 71.61243 64.66585 77.26605 67.97876 ``` --- ## Predictions for new data Or we can provide it `newdata` which __must contain the explanatory variables__: .pull-left[ ```r us_data <- clean_gapminder %>% filter(country == "United States") new_us_data <- us_data %>% dplyr::select(country, gdp) %>% * slice(rep(1, 3)) %>% mutate(adj_factor = c(0.25, 0.5, 0.75), log_gdp = log(gdp * adj_factor)) new_us_data$pred_life_exp <- * predict(init_lm, newdata = new_us_data) gdp_plot + geom_point(data = new_us_data, aes(x = log_gdp, y = pred_life_exp), color = "darkred", size = 5) ``` ] .pull-right[ <img src="11-Linear-reg_files/figure-html/unnamed-chunk-3-1.png" width="432" style="display: block; margin: auto;" /> ] --- ### Plot observed values against predictions Useful diagnostic (for __any type of model__, not just linear regression!) .pull-left[ ```r clean_gapminder %>% * mutate(pred_vals = predict(init_lm)) %>% ggplot(aes(x = pred_vals, y = life_expectancy)) + geom_point(alpha = 0.5) + * geom_abline(slope = 1, intercept = 0, linetype = "dashed", color = "red", size = 2) + theme_bw() ``` - "Perfect" model will follow __diagonal__ ] .pull-right[ <img src="11-Linear-reg_files/figure-html/unnamed-chunk-4-1.png" width="432" style="display: block; margin: auto;" /> ] --- ### Plot observed values against predictions Can augment the data with model output using the [`broom` package](https://cran.r-project.org/web/packages/broom/vignettes/broom.html) .pull-left[ ```r clean_gapminder <- * broom::augment(init_lm, clean_gapminder) clean_gapminder %>% ggplot(aes(x = .fitted, * y = life_expectancy)) + geom_point(alpha = 0.5) + * geom_abline(slope = 1, intercept = 0, linetype = "dashed", color = "red", size = 2) + theme_bw() ``` - Adds various columns from model fit we can use in plotting for model diagnostics ] .pull-right[ <img src="11-Linear-reg_files/figure-html/unnamed-chunk-5-1.png" width="432" style="display: block; margin: auto;" /> ] --- ## Plot residuals against predicted values .pull-left[ - Residuals = observed - predicted - Conditional on the predicted values, the __residuals should have a mean of zero__ ```r clean_gapminder %>% ggplot(aes(x = .fitted, * y = .resid)) + geom_point(alpha = 0.5) + * geom_hline(yintercept = 0, linetype = "dashed", color = "red", size = 2) + # To plot the residual mean * geom_smooth(se = FALSE) + theme_bw() ``` - Residuals __should NOT display any pattern__ ] .pull-right[ <img src="11-Linear-reg_files/figure-html/unnamed-chunk-6-1.png" width="432" style="display: block; margin: auto;" /> ] --- ## Multiple regression We can include as many variables as we want (assuming `\(n > p\)`!) `$$Y=\beta_{0}+\beta_{1} X_{1}+\beta_{2} X_{2}+\cdots+\beta_{p} X_{p}+\epsilon$$` OLS estimates in matrix notation ( `\(\boldsymbol{X}\)` is a `\(n \times p\)` matrix): `$$\hat{\boldsymbol{\beta}} = (\boldsymbol{X} ^T \boldsymbol{X})^{-1}\boldsymbol{X}^T\boldsymbol{Y}$$` -- Can just add more variables to the formula in `R` ```r *multiple_lm <- lm(life_expectancy ~ log_gdp + fertility, data = clean_gapminder) ``` - Use the `Adjusted R-squared` when including multiple variables `\(= 1 - \frac{(1 - R^2)(n - 1)}{(n - p - 1)}\)` - Adjusts for the number of variables in the model `\(p\)` - Adding more variables __will always increase__ `Multiple R-squared` --- ### What about the Normal distribution assumption??? `$$Y=\beta_{0}+\beta_{1} X_{1}+\beta_{2} X_{2}+\cdots+\beta_{p} X_{p}+\epsilon$$` - `\(\epsilon_i\)` is the __random__ noise: assume __independent, identically distributed__ (_iid_) from Normal distribution `$$\epsilon_i \overset{iid}{\sim}N(0, \sigma^2) \quad \text{ with constant variance } \sigma^2$$` -- __OLS doesn't care about this assumption__, it's just estimating coefficients! -- In order to perform inference, __we need to impose additional assumptions__ By assuming `\(\epsilon_i \overset{iid}{\sim}N(0, \sigma^2)\)`, what we really mean is: `$$Y \overset{iid}{\sim}N(\beta_{0}+\beta_{1} X_{1}+\beta_{2} X_{2}+\cdots+\beta_{p} X_{p}, \sigma^2)$$` -- So we're estimating the mean `\(\mu\)` of this conditional distribution, but what about `\(\sigma^2\)`? -- [Unbiased estimate](https://bradleyboehmke.github.io/HOML/linear-regression.html#simple-linear-regression) `\(\hat{\sigma}^2 = \frac{RSS}{n - (p + 1)}\)`, its square root is the `Residual standard error` - __Degrees of freedom__: `\(n - (p + 1)\)`, data supplies us with `\(n\)` "degrees of freedom" and we used up `\(p + 1\)` --- ### Check the assumptions about normality with [`ggfortify`](https://cran.r-project.org/web/packages/ggfortify/vignettes/plot_lm.html) ```r library(ggfortify) *autoplot(multiple_lm, ncol = 4) + theme_bw() ``` <img src="11-Linear-reg_files/figure-html/more-resid-plots-1.png" width="1080" style="display: block; margin: auto;" /> - Standardized residuals = residuals `/ sd(`residuals`)` (see also `.std.resid` from `augment`)