class: center, middle, inverse, title-slide # Supervised Learning ## Generalized linear models ### Ron Yurko ### 06/17/2020 --- ## Probability distributions A __distribution__ is a mathematical function `\(f(x \vert \theta)\)` where - `\(x\)` may take on continuous or discrete values over the _domain_ (i.e. all possible inputs) of `\(f(x \vert \theta)\)` -- - `\(\theta\)` is a set of parameters governing the shape of the distribution - e.g. `\(\theta = \{\mu,\sigma^2\}\)` for a [normal or Gaussian](https://en.wikipedia.org/wiki/Normal_distribution) distribution) - the `\(\vert\)` symbol means that the shape of the distribution is *conditional* on the values of `\(\theta\)` -- - `\(f(x \vert \theta) \geq 0\)` for all `\(x\)` - `\(\sum_x f(x \vert \theta) = 1\)` or `\(\int_x f(x \vert \theta) = 1\)`. -- We use `\(f\)` to denote the distribution for its: - __probability density function (PDF)__ if `\(x\)` is continuous - __probability mass function (PMF)__ if `\(x\)` is discrete --- ## Probability distribution examples: [normal distribution](https://en.wikipedia.org/wiki/Normal_distribution) .pull-left[ Normal distribution PDF (`dnorm`): `$$f(x \vert \mu, \sigma^2)=\frac{1}{\sigma \sqrt{2 \pi}} e^{-\frac{1}{2}\left(\frac{x-\mu}{\sigma}\right)^{2}}$$` - we write `\(X \sim N(\mu, \sigma^2)\)` - __standard normal__: `\(N(0, 1)\)` - can plot density curves with __stat_function()__ ```r tibble(x = c(-5, 5)) %>% ggplot(aes(x)) + * stat_function(fun = dnorm, n = 101, * args = list(mean = 0, sd = 1)) + stat_function(fun = dnorm, color = "red", args = list(mean = -2, sd = sqrt(0.5))) + theme_bw() ``` ] .pull-right[ <img src="10-glms_files/figure-html/unnamed-chunk-1-1.png" width="504" /> ] --- ## Probability distribution examples: [binomial distribution](https://en.wikipedia.org/wiki/Binomial_distribution) .pull-left[ Binomial distribution PMF (`dbinom`): `$$f(x \vert n, p)= \binom{n}{x} p^{x}(1-p)^{n-x}$$` - model for the probability of `\(x\)` successes in `\(n\)` independent trials (`size`), each with success probability of `\(p\)` (`prob`) - we write `\(X \sim \text{Binomial}(n, p)\)` - `R` uses `d` for both PDFs and PMFs ```r tibble(x = 0:20) %>% mutate(binom1 = dbinom(x, size = 20, prob = 0.5), binom2 = dbinom(x, size = 20, prob = 0.1)) %>% ggplot(aes(x)) + geom_point(aes(y = binom1)) + geom_point(aes(y = binom2), color = "red") + theme_bw() ``` ] .pull-right[ <img src="10-glms_files/figure-html/unnamed-chunk-2-1.png" width="504" /> ] --- ## Distributions and regression Why does this matter? -- - Because linear regression, and generalized variants, __make assumptions__ about how observed data are distributed around the true regression line, conditional on a value of `\(x\)` -- For simple linear regression, our goal is to estimate `\(E[Y \vert x]\)`, assuming that for every value of `\(x\)`... -- - the distribution governing the possible values of `\(Y\)` is a __normal distribution__ - _Note:_ capitalize `\(Y\)` because values are __random variables__ (random samples from distribution) -- - the __mean__ of the normal distribution is `\(E[Y \vert x] = \mu(y \vert x) = \beta_0 + \beta_1 x\)` - the __variance__ of the normal distribution is `\(\sigma^2\)`, which is a constant (i.e., does not vary with `\(x\)`) -- - `\(\Rightarrow Y \sim N(\beta_0 + \beta_1 x, \sigma^2)\)`, same as yesterday: `\(Y = \beta_0 + \beta_1 x + \epsilon\)`, where `\(\epsilon \sim N(0, \sigma^2)\)` -- However, just because these assumptions are made in simple linear regression doesn't mean that all linear regression-related models utilize the same assumptions. __They don't__. When we step back from these assumptions, we enter the realm of __generalized linear models (GLMs)__. --- ## Maximum likelihood estimation In generalized regression, we 1. assume a (family of) distribution(s) that govern observed response values `\(Y\)`, and -- 2. estimate the parameters `\(\theta\)` of that distribution. -- Estimation is done by maximizing the __likelihood function__: `$$\mathcal{L} = \prod_{i=1}^n f(Y_i \vert \theta)$$` to find the __maximum likelihood estimators (MLEs)__ (typically maximize `\(\mathcal{l} = \log{\cal L}\)`, the __log-likelihood__) -- Leaving many details under the rug: - the maximum is the point at which the derivative of the likelihood function is zero - you don't need to check the second derivative: wherever the derivative equals zero, it's a maximum value, not a minimum value --- ## MLE for regression __Determining the value of `\(\theta\)` that achieves the maximum likelihood can be difficult__ -- It may require __numerical optimization__ - wherein the computer, using an algorithm, searches over possible values of `\(\theta\)` to find the optimal one -- For linear regression, `\({\cal L}\)` can be maximized analytically: `$$\hat{\boldsymbol{\beta}} = (\boldsymbol{X} ^T \boldsymbol{X})^{-1}\boldsymbol{X}^T\boldsymbol{Y}$$` - the `\(\hat{\boldsymbol{\beta}}\)` estimates that minimize the __residual sum of squares (RSS)__ are the MLEs! - estimate for `\(\hat{\sigma}^2\)` is the __training data MSE__ `\(= \frac{1}{n} \sum_i^n (Y_i - (\hat{\beta}_0 + \hat{\beta}_1 x_i))^2\)` -- When we generalize linear regression to other distributions... we are usually not so lucky. --- ## Generalization example In typical linear regression, the distribution is normal and the domain of `\(Y \vert x\)` is `\((-\infty,\infty)\)`. -- What, however, happens if we know that 1. the domain of observed values of the response is actually `\([0,\infty]\)`? and -- 2. the observed values are __discrete__, with possible values 0, 1, 2, ... -- __The normal distribution doesn't hold here__ - Any idea of what distribution could possibly govern `\(Y \vert x\)`? - Remember, we might not know truly how `\(Y \vert x\)` is distributed, but any assumption we make has to fit with the limitations imposed by points 1 and 2 above --- ## Generalization: Poisson regression A distribution that fulfills the conditions imposed on the last slide is the [__Poisson__ distribution](https://en.wikipedia.org/wiki/Poisson_distribution), `$$f(x \vert \lambda) = \frac{\lambda^xe^{-\lambda}}{x!},\text{ where }x = 0, 1, 2, \dots$$` -- - has a single parameter `\(\lambda\)`, which is __both__ the mean AND variance of the distribution - in general the variance governs the distribution's shape -- - distribution of independent event occurences in an interval, e.g. soccer goals in a match - `\(\lambda\)` is the average number of the events in an interval -- So, when we apply generalized linear regression in this context, we would identify the family as Poisson. But there's another step in generalization... --- ## Generalization: Link Function Let's keep things in the realm of one predictor. The linear function is $$ \beta_0 + \beta_1 x \,. $$ -- The range of this function is `\((-\infty,\infty)\)`. But in our Poisson regression example, we know that `\(Y\)` cannot be negative, *so we need to tranform the linear function somehow* so that the domain of the transformed function is, in this case, `\([0,\infty)\)`. (Or we could punt and use simple linear regression, but our results may not be meaningful, particularly if we estimate `\({\hat Y}\)` as being negative!) -- __There is usually no unique transformation__, but rather ones that are conventionally used. For this case: `$$g(\lambda \vert x) = \log(\lambda \vert x) = \beta_0 + \beta_1 x$$` `\(g\)` is the __link__ function. -- To tie this all together, given response data whose values are limited to being either 0 or positive integers, with no upper bound, we 1. assume that `\(Y \vert x\)` is Poisson distributed; 2. assume that `\(\lambda \vert x = e^{\beta_0 + \beta_1 x}\)`; and 3. use numerical optimization to estimate `\(\beta_0\)` and `\(\beta_1\)` by maximizing the likelihood function. Note that in the end, we still can perform statistical inference, because we can still determine directly how the predicted response varies with linear coefficients. --- ## Exercises What family might be appropriate for... -- 1. `\(Y \vert x\)` continuous, but bounded between 0 and 1? -- [Beta distribution](https://en.wikipedia.org/wiki/Beta_distribution) -- 2. `\(Y \vert x\)` continuous, but bounded between 0 and `\(\infty\)`? -- [Gamma distribution](https://en.wikipedia.org/wiki/Gamma_distribution) -- 3. `\(Y \vert x\)` discrete, but can only take on the values 0 and 1? -- [Bernoulli distribution](https://en.wikipedia.org/wiki/Bernoulli_distribution) Case 3 is special, as you will see.