Supervised Learning

class: center, middle, inverse, title-slide

# Supervised Learning
## Linear regression
### June 23rd, 2022

---

## Simple linear regression

We assume a __linear relationship__ for `$Y = f(X)$`:

`$$Y_{i}=\beta_{0}+\beta_{1} X_{i}+\epsilon_{i}, \quad \text { for } i=1,2, \dots, n$$`

- `$Y_i$` is the `$i$`th value for the __response__ variable
  
- `$X_i$` is the `$i$`th value for the __predictor__ variable

--
  
- `$\beta_0$` is an _unknown_, constant __intercept__: average value for `$Y$` if `$X = 0$`
  
- `$\beta_1$` is an _unknown_, constant __slope__: increase in average value for `$Y$` for each one-unit increase in `$X$`
  
--
  
- `$\epsilon_i$` is the __random__ noise: assume __independent, identically distributed__ (_iid_) from Normal distribution

`$$\epsilon_i \overset{iid}{\sim}N(0, \sigma^2) \quad \text{ with constant variance } \sigma^2$$`

---

## Simple linear regression estimation

We are estimating the __conditional expection (mean)__ for `$Y$`:

`$$\mathbb{E}[Y_i| X_i] = \beta_0 + \beta_1X_i$$`

- average value for `$Y$` given the value for `$X$`

How do we estimate the __best fitting__ line?

__Ordinary least squares (OLS)__ - by minimizing the __residual sum of squares (RSS)__

`$$R S S\left(\beta_{0}, \beta_{1}\right)=\sum_{i=1}^{n}\left[Y_{i}-\left(\beta_{0}+\beta_{1} X_{i}\right)\right]^{2}=\sum_{i=1}^{n}\left(Y_{i}-\beta_{0}-\beta_{1} X_{i}\right)^{2}$$`

where `$\bar{X} = \frac{1}{n}\sum_{i=1}^n X_i$` and `$\bar{Y} = \frac{1}{n}\sum_{i=1}^n Y_i$`

---

## Connection to covariance and correlation

[__Covariance__](https://en.wikipedia.org/wiki/Covariance) describes the __joint variability of two variables__

`$$\text{Cov}(X, Y) = \sigma_{X,Y} = \mathbb{E}[(X-\mathbb{E}[X])(Y-\mathbb{E}[Y])]$$`

We compute the __sample covariance__ (use `$n - 1$` since we are using the means and want [__unbiased estimates__](https://lazyprogrammer.me/covariance-matrix-divide-by-n-or-n-1/))

`$$\hat{\sigma}_{X,Y} = \frac{1}{n-1} \sum_{i=1}^{n}\left(X_{i}-\bar{X}\right)\left(Y_{i}-\bar{Y}\right)$$`

__Correlation__ is a _normalized_ form of covariance, ranges from -1 to 1

`$$\rho_{X,Y} = \frac{\text{Cov}(X,Y)}{\sigma_X \cdot \sigma_Y}$$`

__Sample correlation__ uses the sample covariance and standard deviations, e.g. `$s_X^2 = \frac{1}{n-1} \sum_i (X_i - \bar{X})^2$`

`$$r_{X,Y} = \frac{\sum_{i=1}^{n}\left(X_{i}-\bar{X}\right)\left(Y_{i}-\bar{Y}\right)}{\sqrt{\sum_{i=1}^{n}\left(X_{i}-\bar{X}\right)^{2} \sum_{i=1}^{n}\left(Y_{i}-\bar{Y}\right)^{2}}}$$`

---

## Connection to covariance and correlation

So we have the following:

`$$\widehat{\beta}_{1}=\frac{\sum_{i=1}^{n}\left(X_{i}-\bar{X}\right)\left(Y_{i}-\bar{Y}\right)}{\sum_{i=1}^{n}\left(X_{i}-\bar{X}\right)^{2}} \quad \text{ compared to } \quad r_{X,Y} = \frac{\sum_{i=1}^{n}\left(X_{i}-\bar{X}\right)\left(Y_{i}-\bar{Y}\right)}{\sqrt{\sum_{i=1}^{n}\left(X_{i}-\bar{X}\right)^{2} \sum_{i=1}^{n}\left(Y_{i}-\bar{Y}\right)^{2}}}$$`

`$\Rightarrow$` Can rewrite `$\hat{\beta}_1$` as:

`$$\widehat{\beta}_{1} = r_{X,Y} \cdot \frac{s_Y}{s_X}$$`

`$\Rightarrow$` Can rewrite `$r_{X,Y}$` as:

`$$r_{X,Y} = \widehat{\beta}_{1} \cdot \frac{s_X}{s_Y}$$`

Can think of `$\widehat{\beta}_{1}$` weighting the ratio of variance between `$X$` and `$Y$`...

---

## Gapminder data

Health and income outcomes for 184 countries from 1960 to 2016 from the famous [Gapminder project](https://www.gapminder.org/data)

```r
library(tidyverse)
library(dslabs)
gapminder <- as_tibble(gapminder)
clean_gapminder <- gapminder %>%
  filter(year == 2011, !is.na(gdp)) %>%
  mutate(log_gdp = log(gdp))
clean_gapminder
```

```
## # A tibble: 168 x 10
##    country       year infant_mortality life_expectancy fertility population       gdp
##    <fct>        <int>            <dbl>           <dbl>     <dbl>      <dbl>     <dbl>
##  1 Albania       2011             14.3            77.4      1.75    2886010   6.32e 9
##  2 Algeria       2011             22.8            76.1      2.83   36717132   8.11e10
##  3 Angola        2011            107.             58.1      6.1    21942296   2.70e10
##  4 Antigua and…  2011              7.2            75.9      2.12      88152   8.02e 8
##  5 Argentina     2011             12.7            76        2.2    41655616   4.73e11
##  6 Armenia       2011             15.3            73.5      1.5     2967984   4.29e 9
##  7 Australia     2011              3.8            82.2      1.88   22542371   5.73e11
##  8 Austria       2011              3.4            80.7      1.44    8423559   2.31e11
##  9 Azerbaijan    2011             32.5            70.8      1.96    9227512   2.14e10
## 10 Bahamas       2011             11.1            72.6      1.9      366711   6.76e 9
## # … with 158 more rows, and 3 more variables: continent <fct>, region <fct>,
## #   log_gdp <dbl>
```

---

## Modeling life expectancy

.pull-left[

Interested in modeling a country's __life expectancy__

```r
clean_gapminder %>%
  ggplot(aes(x = life_expectancy)) +
  geom_histogram(color = "black", 
                 fill = "darkblue",
                 alpha = 0.3) +
  theme_bw() +
  labs(x = "Life expectancy")
```

]
.pull-right[

]

---

### Relationship between life expectancy and log(GDP)

.pull-left[

```r
gdp_plot <- clean_gapminder %>%
* ggplot(aes(x = log_gdp,
             y = life_expectancy)) +
  geom_point(alpha = 0.5) +
  theme_bw() +
  labs(x = "log(GDP)",
       y = "Life expectancy")
gdp_plot
```

We fit linear regression models using `lm()`, formula is input as: `response ~ predictor`

```r
*init_lm <- lm(life_expectancy ~ log_gdp,
*             data = clean_gapminder)
```

]
.pull-right[
<img src="11-Linear-reg_files/figure-html/unnamed-chunk-2-1.png" width="504" />
]

---

## View the model `summary()`

```r
summary(init_lm)
```

```
## 
## Call:
## lm(formula = life_expectancy ~ log_gdp, data = clean_gapminder)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -18.901  -4.781   1.879   5.335  13.962 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   24.174      5.758   4.198 4.38e-05 ***
## log_gdp        1.975      0.242   8.161 7.87e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.216 on 166 degrees of freedom
## Multiple R-squared:  0.2864,	Adjusted R-squared:  0.2821 
## F-statistic: 66.61 on 1 and 166 DF,  p-value: 7.865e-14
```

---

## Inference with OLS

Reports the intercept and coefficient estimates: `$\quad \hat{\beta}_0 \approx 24.174 \quad, \quad \hat{\beta}_1 \approx 1.975$`

Estimates of uncertainty for `$\beta$`s via __standard errors__: `$\quad \widehat{SE}(\hat{\beta}_0) \approx 5.758 \quad, \quad \widehat{SE}(\hat{\beta}_1) \approx 0.242$`

`$t$`-statistics are coefficients `Estimates` / `Std. Error`, i.e., number of standard deviations from 0

- _p-values_ (i.e., `Pr(>|t|)`): estimated probability observing value as extreme as |`t value`| __given the null hypothesis__ `$\beta = 0$`
  
  - p-value `$<$` conventional threshold of `$\alpha = 0.05$`, __sufficient evidence to reject the null hypothesis that the coefficient is zero__,
  
  - Typically |`t values`| `$> 2$` indicate __significant__ relationship at `$\alpha = 0.05$`

- i.e., there is a __significant__ association between `life_expectancy` and `log_gdp`

---

## Be careful!

__Caveats to keep in mind regarding p-values:__

If the true value of a coefficient `$\beta = 0$`, then the p-value is sampled from a [Uniform(0,1) distribution](https://en.wikipedia.org/wiki/Continuous_uniform_distribution)

- i.e., it is just as likely to have value 0.45 as 0.16 or 0.84 or 0.9999 or 0.00001...

`$\Rightarrow$` Hence why we typically only reject for low `$\alpha$` values like 0.05

- Controlling the Type 1 error rate at `$\alpha = 0.05$`, i.e., the probability of a __false positive__ mistake
  
  - 5% chance that you'll conclude there's a significant association between `$x$` and `$y$` __even when there is none__

Remember what a standard error is? `$SE = \frac{\sigma}{\sqrt{n}}$`

- `$\Rightarrow$` As `$n$` gets large __standard error goes to zero__, and *all* predictors are eventually deemed significant

- While the p-values might be informative, we will explore other approaches to determine which subset of predictors to include (e.g., holdout performance)

---

## Back to the model summary: `Multiple R-squared`

Back to the connection between the coefficient and correlation:

`$$r_{X,Y} = \widehat{\beta}_{1} \cdot \frac{s_X}{s_Y} \quad \Rightarrow \quad r^2_{X,Y} = \widehat{\beta}_{1}^2\cdot \frac{s_X^2}{s_Y^2}$$`

Compute the correlation with `cor()`:

```r
*with(clean_gapminder, cor(log_gdp, life_expectancy))
```

```
## [1] 0.5351189
```

The squared `cor` matches the reported `Multiple R-squared`

```r
*with(clean_gapminder, cor(log_gdp, life_expectancy))^2
```

```
## [1] 0.2863522
```

---

## Back to the model summary: `Multiple R-squared`

Back to the connection between the coefficient and correlation:

`$$r_{X,Y} = \widehat{\beta}_{1} \cdot \frac{s_X}{s_Y} \quad \Rightarrow \quad r^2_{X,Y} = \widehat{\beta}_{1}^2\cdot \frac{s_X^2}{s_Y^2}$$`

`$r^2$` (or also `$R^2$`) estimates the __proportion of the variance__ of `$Y$` explained by `$X$`

- More generally: variance of model predictions / variance of `$Y$`

```r
*var(predict(init_lm)) / var(clean_gapminder$life_expectancy)
```

```
## [1] 0.2863522
```

---

## Generating predictions

We can use the `predict()` function to either get the fitted values of the regression:

```r
train_preds <- predict(init_lm)
head(train_preds)
```

```
##        1        2        3        4        5        6 
## 68.74401 73.78465 71.61243 64.66585 77.26605 67.97876
```

Which is equivalent to using:

```r
head(init_lm$fitted.values)
```

```
##        1        2        3        4        5        6 
## 68.74401 73.78465 71.61243 64.66585 77.26605 67.97876
```

---

## Predictions for new data

Or we can provide it `newdata` which __must contain the explanatory variables__:

.pull-left[

```r
us_data <- clean_gapminder %>% 
  filter(country == "United States")

new_us_data <- us_data %>%
  dplyr::select(country, gdp) %>%
* slice(rep(1, 3)) %>%
  mutate(adj_factor = c(0.25, 0.5, 0.75),
         log_gdp = log(gdp * adj_factor))
new_us_data$pred_life_exp <- 
* predict(init_lm, newdata = new_us_data)
gdp_plot +
  geom_point(data = new_us_data,
             aes(x = log_gdp,
                 y = pred_life_exp),
             color = "darkred", size = 5)
```
]
.pull-right[
<img src="11-Linear-reg_files/figure-html/unnamed-chunk-3-1.png" width="432" style="display: block; margin: auto;" />

]

---

### Plot observed values against predictions

Useful diagnostic (for __any type of model__, not just linear regression!)

.pull-left[

```r
clean_gapminder %>%
* mutate(pred_vals = predict(init_lm)) %>%
  ggplot(aes(x = pred_vals,
             y = life_expectancy)) +
  geom_point(alpha = 0.5) +
* geom_abline(slope = 1, intercept = 0,
              linetype = "dashed",
              color = "red",
              size = 2) +
  theme_bw()
```

- "Perfect" model will follow __diagonal__

]
.pull-right[
<img src="11-Linear-reg_files/figure-html/unnamed-chunk-4-1.png" width="432" style="display: block; margin: auto;" />

]

---

### Plot observed values against predictions

Can augment the data with model output using the [`broom` package](https://cran.r-project.org/web/packages/broom/vignettes/broom.html)

.pull-left[

```r
clean_gapminder <- 
* broom::augment(init_lm, clean_gapminder)
clean_gapminder %>%
  ggplot(aes(x = .fitted, 
*            y = life_expectancy)) +
  geom_point(alpha = 0.5) +
* geom_abline(slope = 1, intercept = 0,
              linetype = "dashed",
              color = "red",
              size = 2) +
  theme_bw()
```

- Adds various columns from model fit we can use in plotting for model diagnostics

]
.pull-right[
<img src="11-Linear-reg_files/figure-html/unnamed-chunk-5-1.png" width="432" style="display: block; margin: auto;" />

]

---

## Plot residuals against predicted values

.pull-left[

- Residuals = observed - predicted

- Conditional on the predicted values, the __residuals should have a mean of zero__

```r
clean_gapminder %>%
  ggplot(aes(x = .fitted, 
*            y = .resid)) +
  geom_point(alpha = 0.5) +
* geom_hline(yintercept = 0,
             linetype = "dashed",
             color = "red",
             size = 2) +
  # To plot the residual mean
* geom_smooth(se = FALSE) +
  theme_bw()
```

- Residuals __should NOT display any pattern__

]
.pull-right[
<img src="11-Linear-reg_files/figure-html/unnamed-chunk-6-1.png" width="432" style="display: block; margin: auto;" />

]

---

## Multiple regression

We can include as many variables as we want (assuming `$n > p$`!)

`$$Y=\beta_{0}+\beta_{1} X_{1}+\beta_{2} X_{2}+\cdots+\beta_{p} X_{p}+\epsilon$$`

OLS estimates in matrix notation ( `$\boldsymbol{X}$` is a `$n \times p$` matrix):
  
`$$\hat{\boldsymbol{\beta}} = (\boldsymbol{X} ^T \boldsymbol{X})^{-1}\boldsymbol{X}^T\boldsymbol{Y}$$`

Can just add more variables to the formula in `R`

```r
*multiple_lm <- lm(life_expectancy ~ log_gdp + fertility,
                  data = clean_gapminder)
```

- Use the `Adjusted R-squared` when including multiple variables `$= 1 - \frac{(1 - R^2)(n - 1)}{(n - p - 1)}$`

- Adjusts for the number of variables in the model `$p$`  
  
  - Adding more variables __will always increase__ `Multiple R-squared`

---

### What about the Normal distribution assumption???

`$$Y=\beta_{0}+\beta_{1} X_{1}+\beta_{2} X_{2}+\cdots+\beta_{p} X_{p}+\epsilon$$`

- `$\epsilon_i$` is the __random__ noise: assume __independent, identically distributed__ (_iid_) from Normal distribution

`$$\epsilon_i \overset{iid}{\sim}N(0, \sigma^2) \quad \text{ with constant variance } \sigma^2$$`

__OLS doesn't care about this assumption__, it's just estimating coefficients!

In order to perform inference, __we need to impose additional assumptions__

By assuming `$\epsilon_i \overset{iid}{\sim}N(0, \sigma^2)$`, what we really mean is:

`$$Y \overset{iid}{\sim}N(\beta_{0}+\beta_{1} X_{1}+\beta_{2} X_{2}+\cdots+\beta_{p} X_{p}, \sigma^2)$$`

So we're estimating the mean `$\mu$` of this conditional distribution, but what about `$\sigma^2$`?
  
--

[Unbiased estimate](https://bradleyboehmke.github.io/HOML/linear-regression.html#simple-linear-regression) `$\hat{\sigma}^2 = \frac{RSS}{n - (p + 1)}$`, its square root is the `Residual standard error`

- __Degrees of freedom__:  `$n - (p + 1)$`, data supplies us with `$n$` "degrees of freedom" and we used up `$p + 1$`

---

### Check the assumptions about normality with [`ggfortify`](https://cran.r-project.org/web/packages/ggfortify/vignettes/plot_lm.html)

```r
library(ggfortify)
*autoplot(multiple_lm, ncol = 4) + theme_bw()
```

- Standardized residuals = residuals `/ sd(`residuals`)` (see also `.std.resid` from `augment`)