Supervised Learning

class: center, middle, inverse, title-slide

# Supervised Learning
## Logistic regression
### Ron Yurko
### 06/18/2020

---

## The setting: [Figure 4.2 (ISLR)](http://faculty.marshall.usc.edu/gareth-james/ISL/)

--
.pull-left[
Left: Linear regression
- __not limited to be within [0, 1]!__
]
.pull-right[
Right: __Logistic regression__
- __respects the observed range of outcomes!__
]

---

## Generalized linear models (GLMs): review

Linear regression: estimate __mean value__ of response variable `$Y$`, given predictor variables `$X_1,\dots,X_p$`:
$$
E[Y|X] = \beta_0 + \beta_1 X_1 + \cdots + \beta_p X_p
$$

--
In a __GLM__, we include a __link function__ `$g$` that transforms the linear model:

$$
g(E[Y|X]) = \beta_0 + \beta_1 X_1 + \cdots + \beta_p X_p
$$

- Use `$g$` to reduce the range of possible values for `$E[Y \vert X]$` from `$(-\infty,\infty)$` to, e.g., `$[0,1]$` or `$[0,\infty)$`, etc.

--
In a GLM you specify a __probability distribution family__ that governs the observed response values

--
- e.g. if `$Y$` are zero and the positive integers, the family could be [Poisson](https://en.wikipedia.org/wiki/Poisson_distribution)

- e.g. if `$Y$` are just 0 and 1, the family is [Bernoulli](https://en.wikipedia.org/wiki/Bernoulli_distribution) and extends to [Binomial](https://en.wikipedia.org/wiki/Binomial_distribution) for `$n$` independent trials

---

## Logistic regression

Assuming that we are dealing with two classes, the possible observed values for `$Y$` are 0 and 1, 
$$
Y \vert X \sim {\rm Binomial}(n=1,p=E[Y\vert X]) = \text{Bernoulli}(p = E[Y\vert X])
$$

--
To limit the regression betweewn `$[0, 1]$`: use the __logit__ function, aka the __log-odds ratio__

$$
\text{logit}(p(X)) = \log \left[ \frac{p(X)}{1 - p(X)} \right] =  \log\left[\frac{E[Y \vert X]}{1-E[Y \vert X]}\right] = \beta_0 + \beta_1 X_1 + \cdots + \beta_p X_p
$$

--
meaning
$$
p(X) = E[Y \vert X] = \frac{e^{\beta_0 + \beta_1 X_1 + \cdots + \beta_p X_p}}{1 + e^{\beta_0 + \beta_1 X_1 + \cdots + \beta_p X_p}}
$$
---

## Major difference between linear and logistic regression

Logistic regression __involves numerical optimization__

- `$Y_i$` is observed response for `$n$` observations - either 0 or 1

--
- we need to use an iterative algorithm to find `$\beta$`'s that maximize the __likelihood__

`$$\prod_{i=1}^{n} p\left(X_{i}\right)^{Y_{i}}\left(1-p\left(X_{i}\right)\right)^{1-Y_{i}}$$`

--
- Newton's method: start with initial guess, calculate gradient of log-likelihood, add amount proportional to the gradient to parameters, moving up log-likelihood surface

--
- means logistic regression runs more slowly than linear regression

--
- if you're interested: [you use iteratively re-weighted least squares, Section 12.3.1](http://www.stat.cmu.edu/~cshalizi/uADA/15/lectures/12.pdf)

---

## Inference with logistic regression

__Major motivation__ for logistic regression (and all GLMs) is __inference__

- how does the response change when we change a predictor by one unit?

--
For linear regression, the answer is straightforward

`$$E[Y \vert X] = \beta_0 + \beta_1 X_1$$`

--
For logistic regression... it is a little _less_ straightforward,

$$
E[Y \vert X] = \frac{e^{\beta_0 + \beta_1 X_1 + \cdots + \beta_p X_p}}{1 + e^{\beta_0 + \beta_1 X_1 + \cdots + \beta_p X_p}}
$$

--
- the predicted response varies __non-linearly__ with the predictor variable values

- one convention is to fall back upon the concept of __odds__

---

## The odds interpretation

Pretend the predicted probability is 0.8 given a particular predictor variable value
- just pretend we only have one predictor variable

--
This means we expect that we were to repeatedly sample response values given that predictor variable value: __we expect class 1 to appear 4 times as often as class 0__

`$$Odds = \frac{E[Y \vert X]}{1-E[Y \vert X]} = \frac{0.8}{1-0.8} = 4 = e^{\beta_0+\beta_1X}$$`

Thus we say that for the given predictor variable value, the `$Odds$` are 4 (or 4-1) in favor of class 1

How does the odds change if I change the value of a predictor variable by one unit?

`$$Odds_{\rm new} = e^{\beta_0+\beta_1(X+1)} = e^{\beta_0+\beta_1X}e^{\beta_1} = e^{\beta_1}Odds_{\rm old}$$`

For every unit change in `$X$`, the odds change by a __factor__ `$e^{\beta_1}$`

---

## Example data: NFL field goal attempts

Created dataset using [`nflscrapR-data`](https://github.com/ryurko/nflscrapR-data) of all NFL field goal attempts from 2009 to 2019

```r
nfl_fg_attempts <- read_csv("http://www.stat.cmu.edu/cmsac/sure/materials/data/glm_examples/nfl_fg_attempt_data.csv")
nfl_fg_attempts
```

```
## # A tibble: 10,811 x 11
##    kicker_player_id kicker_player_n…   qtr score_different… home_team posteam posteam_type
##    <chr>            <chr>            <dbl>            <dbl> <chr>     <chr>   <chr>       
##  1 00-0020962       R.Bironas            1                0 PIT       TEN     away        
##  2 00-0020962       R.Bironas            2                0 PIT       TEN     away        
##  3 00-0020962       R.Bironas            4                0 PIT       TEN     away        
##  4 00-0020737       J.Reed               4               -3 PIT       PIT     home        
##  5 00-0020737       J.Reed               5                0 PIT       PIT     home        
##  6 00-0004091       P.Dawson             1                0 CLE       CLE     home        
##  7 00-0010072       R.Longwell           1               -3 CLE       MIN     away        
##  8 00-0004091       P.Dawson             2               -7 CLE       CLE     home        
##  9 00-0010072       R.Longwell           4               12 CLE       MIN     away        
## 10 00-0006800       J.Hanson             1              -14 NO        DET     away        
## # … with 10,801 more rows, and 4 more variables: kick_distance <dbl>, pbp_season <dbl>,
## #   abs_score_diff <dbl>, is_fg_made <dbl>
```

---

## Fitting a logistic regression model

.pull-left[
- We use the `glm` function (similar to `lm`)

- __Specify the family is `binomial`__

```r
*init_logit <- glm(is_fg_made ~ kick_distance,
                  data = nfl_fg_attempts,
*                 family = "binomial")
```

- View predicted probability relationship

```r
nfl_fg_attempts %>%
* mutate(pred_prob = init_logit$fitted.values) %>%
  ggplot(aes(x = kick_distance)) +
  geom_line(aes(y = pred_prob), 
            color = "blue") +
  geom_point(aes(y = is_fg_made), 
             alpha = 0.3,
             color = "darkorange") +
  theme_bw()
```

]
.pull-right[
<img src="11-logit-reg_files/figure-html/unnamed-chunk-3-1.png" width="504" style="display: block; margin: auto;" />

]

---

```r
summary(init_logit)
```

```
## 
## Call:
## glm(formula = is_fg_made ~ kick_distance, family = "binomial", 
##     data = nfl_fg_attempts)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.7752   0.2420   0.4025   0.6252   1.5136  
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    5.916656   0.145371   40.70   <2e-16 ***
## kick_distance -0.104365   0.003255  -32.06   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 9593.1  on 10810  degrees of freedom
## Residual deviance: 8277.5  on 10809  degrees of freedom
## AIC: 8281.5
## 
## Number of Fisher Scoring iterations: 5
```

---

## Logistic regression output

```
Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.7752   0.2420   0.4025   0.6252   1.5136 
```
The __deviance residuals__ are, for each observation,
$$
d_i = \mbox{sign}(y_i-\hat{p}_i) \sqrt{-2[y_i \log \hat{p}_i + (1-y_i) \log (1 - \hat{p}_i)]}
$$
where `$y_i$` is the `$i^{\rm th}$` observed response and `$\hat{p}_i$` is the estimated probability of success

```
Coefficients:
               Estimate Std. Error z value Pr(>|z|)    
(Intercept)    5.916656   0.145371   40.70   <2e-16 ***
kick_distance -0.104365   0.003255  -32.06   <2e-16 ***
```
The intercept of the prediction curve is `$e^{5.916656}$` and `$Odds_{\rm new}/Odds_{\rm old} = e^{-0.104365}$`.

---

## Logistic regression output

```
    Null deviance: 9593.1  on 10810  degrees of freedom
Residual deviance: 8277.5  on 10809  degrees of freedom
AIC: 8281.5
```

```r
logLik(init_logit)   # the maximum log-likelihood value
```

```
## 'log Lik.' -4138.732 (df=2)
```

- residual deviance is -2 times -4138.732, or 8277.5

- AIC is `$2k - 2\log{\cal L}$` = `$2 \cdot k - 2 \cdot (-4138.732)$` = 8281.5, where `$k$` is the number of degrees of freedom (here, `df` = 2)

- These are all metrics of quality of fit of the model, __we will consider these to be less important than test-set performances__

---

## Logistic regression predictions

To generate logistic regression predictions there are few things to keep in mind...

- the `fitted.values` __are on the probability scale__: all are between 0 and 1

--
- but the __default__ for `predict(init_logit)` is __the log-odds scale!__

--
- we change this with the `type` argument: `predict(init_logit, type = "response")`

--
How do we predict the class? e.g make or miss field goal?

```r
pred_fg_outcome <- ifelse(init_logit$fitted.values > 0.5,
                          "make", "miss")
```

- typically if predicted probability is > 0.5 then we predict success, else failure

---

## Model assessment

Most straight-forward way is the __confusion matrix__ (rows are predictions, and columns are observed):

```r
table("Predictions" = pred_fg_outcome, "Observed" = nfl_fg_attempts$is_fg_made)
```

```
##            Observed
## Predictions    0    1
##        make 1662 8994
##        miss   94   61
```

--
__In-sample misclassification rate__:

```r
mean(ifelse(fitted(init_logit) < 0.5, 0, 1) != nfl_fg_attempts$is_fg_made)
```

```
## [1] 0.1593747
```

--
__Brier score__:

```r
mean((nfl_fg_attempts$is_fg_made - fitted(init_logit))^2)
```

```
## [1] 0.1197629
```

---

## __Well-calibrated__ if actual probabilities match predicted probabilities

.pull-left[

```r
nfl_fg_attempts %>%
  mutate(pred_prob = init_logit$fitted.values,
         bin_pred_prob = round(pred_prob / 0.05) * .05) %>%
  # Group by bin_pred_prob:
  group_by(bin_pred_prob) %>%
  # Calculate the calibration results:
  summarize(n_attempts = n(),
            bin_actual_prob = mean(is_fg_made)) %>%
  ggplot(aes(x = bin_pred_prob, y = bin_actual_prob)) +
  geom_point(aes(size = n_attempts)) +
  geom_smooth(method = "loess", se = FALSE) +
  geom_abline(slope = 1, intercept = 0, 
              color = "black", linetype = "dashed") +
  coord_equal() + 
  scale_x_continuous(limits = c(0,1)) + 
  scale_y_continuous(limits = c(0,1)) + 
  labs(size = "Number of attempts",
       x = "Estimated make probability",
       y = "Observed make probability") + 
  theme_bw() +
  theme(legend.position = "bottom")
```

]
.pull-right[
- e.g. if model says the probability of rain for a group of days is 50%, it better rain on half those days... or something is incorrect about the probability!

]

---

## BONUS: Leave-one-season-out cross validation (with [`purrr`](https://purrr.tidyverse.org/))

In many datasets rather than random holdout folds, you might have particular holdouts of interest (e.g. seasons, games, etc.)

```r
nfl_fg_loso_cv_preds <- # generate holdout predictions for every row based season
  map_dfr(unique(nfl_fg_attempts$pbp_season), 
          function(season) {
            # Separate test and training data:
            test_data <- nfl_fg_attempts %>%
              filter(pbp_season == season)
            train_data <- nfl_fg_attempts %>%
              filter(pbp_season != season)
            
            # Train model:
            fg_model <- glm(is_fg_made ~ kick_distance, data = train_data,
                            family = "binomial")
            
            # Return tibble of holdout results:
            tibble(test_pred_probs = predict(fg_model, newdata = test_data,
                                             type = "response"),
                   test_actual = test_data$is_fg_made,
                   test_season = season) 
          })
```

---

## BONUS: overall holdout performance

__Misclassification rate__:

```r
nfl_fg_loso_cv_preds %>%
  mutate(test_pred = ifelse(test_pred_probs < .5, 0, 1)) %>%
  summarize(mcr = mean(test_pred != test_actual))
```

```
## # A tibble: 1 x 1
##     mcr
##   <dbl>
## 1 0.160
```

__Brier score__:

```r
nfl_fg_loso_cv_preds %>%
  summarize(brier_score = mean((test_actual - test_pred_probs)^2))
```

```
## # A tibble: 1 x 1
##   brier_score
##         <dbl>
## 1       0.120
```

---

## BONUS: holdout performance by season

```r
nfl_fg_loso_cv_preds %>%
  mutate(test_pred = ifelse(test_pred_probs < .5, 0, 1)) %>%
  group_by(test_season) %>%
  summarize(mcr = mean(test_pred != test_actual)) %>%
  ggplot(aes(x = test_season, y = mcr)) +
  geom_bar(stat = "identity", width = .1) + geom_point(size = 5) +
  theme_bw() +
  scale_x_continuous(breaks = unique(nfl_fg_loso_cv_preds$test_season))
```