class: center, middle, inverse, title-slide # Supervised Learning ## Logistic regression ### Ron Yurko ### 06/18/2020 --- ## The setting: [Figure 4.2 (ISLR)](http://faculty.marshall.usc.edu/gareth-james/ISL/) <img src="http://www.stat.cmu.edu/~pfreeman/Figure_4.2.png" width="90%" style="display: block; margin: auto;" /> -- .pull-left[ Left: Linear regression - __not limited to be within [0, 1]!__ ] .pull-right[ Right: __Logistic regression__ - __respects the observed range of outcomes!__ ] --- ## Generalized linear models (GLMs): review Linear regression: estimate __mean value__ of response variable `\(Y\)`, given predictor variables `\(X_1,\dots,X_p\)`: $$ E[Y|X] = \beta_0 + \beta_1 X_1 + \cdots + \beta_p X_p $$ -- In a __GLM__, we include a __link function__ `\(g\)` that transforms the linear model: $$ g(E[Y|X]) = \beta_0 + \beta_1 X_1 + \cdots + \beta_p X_p $$ - Use `\(g\)` to reduce the range of possible values for `\(E[Y \vert X]\)` from `\((-\infty,\infty)\)` to, e.g., `\([0,1]\)` or `\([0,\infty)\)`, etc. -- In a GLM you specify a __probability distribution family__ that governs the observed response values -- - e.g. if `\(Y\)` are zero and the positive integers, the family could be [Poisson](https://en.wikipedia.org/wiki/Poisson_distribution) - e.g. if `\(Y\)` are just 0 and 1, the family is [Bernoulli](https://en.wikipedia.org/wiki/Bernoulli_distribution) and extends to [Binomial](https://en.wikipedia.org/wiki/Binomial_distribution) for `\(n\)` independent trials --- ## Logistic regression Assuming that we are dealing with two classes, the possible observed values for `\(Y\)` are 0 and 1, $$ Y \vert X \sim {\rm Binomial}(n=1,p=E[Y\vert X]) = \text{Bernoulli}(p = E[Y\vert X]) $$ -- To limit the regression betweewn `\([0, 1]\)`: use the __logit__ function, aka the __log-odds ratio__ $$ \text{logit}(p(X)) = \log \left[ \frac{p(X)}{1 - p(X)} \right] = \log\left[\frac{E[Y \vert X]}{1-E[Y \vert X]}\right] = \beta_0 + \beta_1 X_1 + \cdots + \beta_p X_p $$ -- meaning $$ p(X) = E[Y \vert X] = \frac{e^{\beta_0 + \beta_1 X_1 + \cdots + \beta_p X_p}}{1 + e^{\beta_0 + \beta_1 X_1 + \cdots + \beta_p X_p}} $$ --- ## Major difference between linear and logistic regression Logistic regression __involves numerical optimization__ - `\(Y_i\)` is observed response for `\(n\)` observations - either 0 or 1 -- - we need to use an iterative algorithm to find `\(\beta\)`'s that maximize the __likelihood__ `$$\prod_{i=1}^{n} p\left(X_{i}\right)^{Y_{i}}\left(1-p\left(X_{i}\right)\right)^{1-Y_{i}}$$` -- - Newton's method: start with initial guess, calculate gradient of log-likelihood, add amount proportional to the gradient to parameters, moving up log-likelihood surface -- - means logistic regression runs more slowly than linear regression -- - if you're interested: [you use iteratively re-weighted least squares, Section 12.3.1](http://www.stat.cmu.edu/~cshalizi/uADA/15/lectures/12.pdf) --- ## Inference with logistic regression __Major motivation__ for logistic regression (and all GLMs) is __inference__ - how does the response change when we change a predictor by one unit? -- For linear regression, the answer is straightforward `$$E[Y \vert X] = \beta_0 + \beta_1 X_1$$` -- For logistic regression... it is a little _less_ straightforward, $$ E[Y \vert X] = \frac{e^{\beta_0 + \beta_1 X_1 + \cdots + \beta_p X_p}}{1 + e^{\beta_0 + \beta_1 X_1 + \cdots + \beta_p X_p}} $$ -- - the predicted response varies __non-linearly__ with the predictor variable values - one convention is to fall back upon the concept of __odds__ --- ## The odds interpretation Pretend the predicted probability is 0.8 given a particular predictor variable value - just pretend we only have one predictor variable -- This means we expect that we were to repeatedly sample response values given that predictor variable value: __we expect class 1 to appear 4 times as often as class 0__ `$$Odds = \frac{E[Y \vert X]}{1-E[Y \vert X]} = \frac{0.8}{1-0.8} = 4 = e^{\beta_0+\beta_1X}$$` Thus we say that for the given predictor variable value, the `\(Odds\)` are 4 (or 4-1) in favor of class 1 -- How does the odds change if I change the value of a predictor variable by one unit? -- `$$Odds_{\rm new} = e^{\beta_0+\beta_1(X+1)} = e^{\beta_0+\beta_1X}e^{\beta_1} = e^{\beta_1}Odds_{\rm old}$$` For every unit change in `\(X\)`, the odds change by a __factor__ `\(e^{\beta_1}\)` --- ## Example data: NFL field goal attempts Created dataset using [`nflscrapR-data`](https://github.com/ryurko/nflscrapR-data) of all NFL field goal attempts from 2009 to 2019 ```r nfl_fg_attempts <- read_csv("http://www.stat.cmu.edu/cmsac/sure/materials/data/glm_examples/nfl_fg_attempt_data.csv") nfl_fg_attempts ``` ``` ## # A tibble: 10,811 x 11 ## kicker_player_id kicker_player_n… qtr score_different… home_team posteam posteam_type ## <chr> <chr> <dbl> <dbl> <chr> <chr> <chr> ## 1 00-0020962 R.Bironas 1 0 PIT TEN away ## 2 00-0020962 R.Bironas 2 0 PIT TEN away ## 3 00-0020962 R.Bironas 4 0 PIT TEN away ## 4 00-0020737 J.Reed 4 -3 PIT PIT home ## 5 00-0020737 J.Reed 5 0 PIT PIT home ## 6 00-0004091 P.Dawson 1 0 CLE CLE home ## 7 00-0010072 R.Longwell 1 -3 CLE MIN away ## 8 00-0004091 P.Dawson 2 -7 CLE CLE home ## 9 00-0010072 R.Longwell 4 12 CLE MIN away ## 10 00-0006800 J.Hanson 1 -14 NO DET away ## # … with 10,801 more rows, and 4 more variables: kick_distance <dbl>, pbp_season <dbl>, ## # abs_score_diff <dbl>, is_fg_made <dbl> ``` --- ## Fitting a logistic regression model .pull-left[ - We use the `glm` function (similar to `lm`) - __Specify the family is `binomial`__ ```r *init_logit <- glm(is_fg_made ~ kick_distance, data = nfl_fg_attempts, * family = "binomial") ``` - View predicted probability relationship ```r nfl_fg_attempts %>% * mutate(pred_prob = init_logit$fitted.values) %>% ggplot(aes(x = kick_distance)) + geom_line(aes(y = pred_prob), color = "blue") + geom_point(aes(y = is_fg_made), alpha = 0.3, color = "darkorange") + theme_bw() ``` ] .pull-right[ <img src="11-logit-reg_files/figure-html/unnamed-chunk-3-1.png" width="504" style="display: block; margin: auto;" /> ] --- ```r summary(init_logit) ``` ``` ## ## Call: ## glm(formula = is_fg_made ~ kick_distance, family = "binomial", ## data = nfl_fg_attempts) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -2.7752 0.2420 0.4025 0.6252 1.5136 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) 5.916656 0.145371 40.70 <2e-16 *** ## kick_distance -0.104365 0.003255 -32.06 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: 9593.1 on 10810 degrees of freedom ## Residual deviance: 8277.5 on 10809 degrees of freedom ## AIC: 8281.5 ## ## Number of Fisher Scoring iterations: 5 ``` --- ## Logistic regression output ``` Deviance Residuals: Min 1Q Median 3Q Max -2.7752 0.2420 0.4025 0.6252 1.5136 ``` The __deviance residuals__ are, for each observation, $$ d_i = \mbox{sign}(y_i-\hat{p}_i) \sqrt{-2[y_i \log \hat{p}_i + (1-y_i) \log (1 - \hat{p}_i)]} $$ where `\(y_i\)` is the `\(i^{\rm th}\)` observed response and `\(\hat{p}_i\)` is the estimated probability of success ``` Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 5.916656 0.145371 40.70 <2e-16 *** kick_distance -0.104365 0.003255 -32.06 <2e-16 *** ``` The intercept of the prediction curve is `\(e^{5.916656}\)` and `\(Odds_{\rm new}/Odds_{\rm old} = e^{-0.104365}\)`. --- ## Logistic regression output ``` Null deviance: 9593.1 on 10810 degrees of freedom Residual deviance: 8277.5 on 10809 degrees of freedom AIC: 8281.5 ``` ```r logLik(init_logit) # the maximum log-likelihood value ``` ``` ## 'log Lik.' -4138.732 (df=2) ``` - residual deviance is -2 times -4138.732, or 8277.5 - AIC is `\(2k - 2\log{\cal L}\)` = `\(2 \cdot k - 2 \cdot (-4138.732)\)` = 8281.5, where `\(k\)` is the number of degrees of freedom (here, `df` = 2) - These are all metrics of quality of fit of the model, __we will consider these to be less important than test-set performances__ --- ## Logistic regression predictions To generate logistic regression predictions there are few things to keep in mind... - the `fitted.values` __are on the probability scale__: all are between 0 and 1 -- - but the __default__ for `predict(init_logit)` is __the log-odds scale!__ -- - we change this with the `type` argument: `predict(init_logit, type = "response")` -- How do we predict the class? e.g make or miss field goal? -- ```r pred_fg_outcome <- ifelse(init_logit$fitted.values > 0.5, "make", "miss") ``` - typically if predicted probability is > 0.5 then we predict success, else failure --- ## Model assessment Most straight-forward way is the __confusion matrix__ (rows are predictions, and columns are observed): ```r table("Predictions" = pred_fg_outcome, "Observed" = nfl_fg_attempts$is_fg_made) ``` ``` ## Observed ## Predictions 0 1 ## make 1662 8994 ## miss 94 61 ``` -- __In-sample misclassification rate__: ```r mean(ifelse(fitted(init_logit) < 0.5, 0, 1) != nfl_fg_attempts$is_fg_made) ``` ``` ## [1] 0.1593747 ``` -- __Brier score__: ```r mean((nfl_fg_attempts$is_fg_made - fitted(init_logit))^2) ``` ``` ## [1] 0.1197629 ``` --- ## __Well-calibrated__ if actual probabilities match predicted probabilities .pull-left[ ```r nfl_fg_attempts %>% mutate(pred_prob = init_logit$fitted.values, bin_pred_prob = round(pred_prob / 0.05) * .05) %>% # Group by bin_pred_prob: group_by(bin_pred_prob) %>% # Calculate the calibration results: summarize(n_attempts = n(), bin_actual_prob = mean(is_fg_made)) %>% ggplot(aes(x = bin_pred_prob, y = bin_actual_prob)) + geom_point(aes(size = n_attempts)) + geom_smooth(method = "loess", se = FALSE) + geom_abline(slope = 1, intercept = 0, color = "black", linetype = "dashed") + coord_equal() + scale_x_continuous(limits = c(0,1)) + scale_y_continuous(limits = c(0,1)) + labs(size = "Number of attempts", x = "Estimated make probability", y = "Observed make probability") + theme_bw() + theme(legend.position = "bottom") ``` ] .pull-right[ - e.g. if model says the probability of rain for a group of days is 50%, it better rain on half those days... or something is incorrect about the probability! <img src="11-logit-reg_files/figure-html/unnamed-chunk-7-1.png" width="504" /> ] --- ## BONUS: Leave-one-season-out cross validation (with [`purrr`](https://purrr.tidyverse.org/)) In many datasets rather than random holdout folds, you might have particular holdouts of interest (e.g. seasons, games, etc.) -- ```r nfl_fg_loso_cv_preds <- # generate holdout predictions for every row based season map_dfr(unique(nfl_fg_attempts$pbp_season), function(season) { # Separate test and training data: test_data <- nfl_fg_attempts %>% filter(pbp_season == season) train_data <- nfl_fg_attempts %>% filter(pbp_season != season) # Train model: fg_model <- glm(is_fg_made ~ kick_distance, data = train_data, family = "binomial") # Return tibble of holdout results: tibble(test_pred_probs = predict(fg_model, newdata = test_data, type = "response"), test_actual = test_data$is_fg_made, test_season = season) }) ``` --- ## BONUS: overall holdout performance __Misclassification rate__: ```r nfl_fg_loso_cv_preds %>% mutate(test_pred = ifelse(test_pred_probs < .5, 0, 1)) %>% summarize(mcr = mean(test_pred != test_actual)) ``` ``` ## # A tibble: 1 x 1 ## mcr ## <dbl> ## 1 0.160 ``` __Brier score__: ```r nfl_fg_loso_cv_preds %>% summarize(brier_score = mean((test_actual - test_pred_probs)^2)) ``` ``` ## # A tibble: 1 x 1 ## brier_score ## <dbl> ## 1 0.120 ``` --- ## BONUS: holdout performance by season ```r nfl_fg_loso_cv_preds %>% mutate(test_pred = ifelse(test_pred_probs < .5, 0, 1)) %>% group_by(test_season) %>% summarize(mcr = mean(test_pred != test_actual)) %>% ggplot(aes(x = test_season, y = mcr)) + geom_bar(stat = "identity", width = .1) + geom_point(size = 5) + theme_bw() + scale_x_continuous(breaks = unique(nfl_fg_loso_cv_preds$test_season)) ``` <img src="11-logit-reg_files/figure-html/mcr-cv-year-1.png" width="504" style="display: block; margin: auto;" />