Supervised Learning

class: center, middle, inverse, title-slide

# Supervised Learning
## Generalized linear models
### Ron Yurko
### 06/17/2020

---

## Probability distributions

A __distribution__ is a mathematical function `$f(x \vert \theta)$` where

- `$x$` may take on continuous or discrete values over the _domain_ (i.e. all possible inputs) of `$f(x \vert \theta)$`

--
- `$\theta$` is a set of parameters governing the shape of the distribution
  - e.g. `$\theta = \{\mu,\sigma^2\}$` for a [normal or Gaussian](https://en.wikipedia.org/wiki/Normal_distribution) distribution)

- the `$\vert$` symbol means that the shape of the distribution is *conditional* on the values of `$\theta$`

--
- `$f(x \vert \theta) \geq 0$` for all `$x$`

- `$\sum_x f(x \vert \theta) = 1$` or `$\int_x f(x \vert \theta) = 1$`.

--
We use `$f$` to denote the distribution for its:

- __probability density function (PDF)__ if `$x$` is continuous

- __probability mass function (PMF)__ if `$x$` is discrete

---

## Probability distribution examples: [normal distribution](https://en.wikipedia.org/wiki/Normal_distribution)

.pull-left[
Normal distribution PDF (`dnorm`):

`$$f(x \vert \mu, \sigma^2)=\frac{1}{\sigma \sqrt{2 \pi}} e^{-\frac{1}{2}\left(\frac{x-\mu}{\sigma}\right)^{2}}$$`
- we write `$X \sim N(\mu, \sigma^2)$`

- __standard normal__: `$N(0, 1)$`

- can plot density curves with __stat_function()__

```r
tibble(x = c(-5, 5)) %>%
  ggplot(aes(x)) +
* stat_function(fun = dnorm, n = 101,
*               args = list(mean = 0, sd = 1)) +
  stat_function(fun = dnorm, color = "red",
                args = list(mean = -2, 
                            sd = sqrt(0.5))) +
  theme_bw()
```

]
.pull-right[

]

---

## Probability distribution examples: [binomial distribution](https://en.wikipedia.org/wiki/Binomial_distribution)

.pull-left[
Binomial distribution PMF (`dbinom`):

`$$f(x \vert n, p)= \binom{n}{x} p^{x}(1-p)^{n-x}$$`
- model for the probability of `$x$` successes in `$n$` independent trials (`size`), each with success probability of `$p$` (`prob`)

- we write `$X \sim \text{Binomial}(n, p)$`

- `R` uses `d` for both PDFs and PMFs

```r
tibble(x = 0:20) %>%
  mutate(binom1 = dbinom(x, size = 20,
                         prob = 0.5),
         binom2 = dbinom(x, size = 20,
                         prob = 0.1)) %>%
  ggplot(aes(x)) + geom_point(aes(y = binom1)) +
  geom_point(aes(y = binom2), color = "red") +
  theme_bw()
```

]
.pull-right[

]

---

## Distributions and regression

Why does this matter?

--
- Because linear regression, and generalized variants, __make assumptions__ about how observed data are distributed around the true regression line, conditional on a value of `$x$`

--
For simple linear regression, our goal is to estimate `$E[Y \vert x]$`, assuming that for every value of `$x$`...

--
- the distribution governing the possible values of `$Y$` is a __normal distribution__
  - _Note:_ capitalize `$Y$` because values are __random variables__ (random samples from distribution)

--
- the __mean__ of the normal distribution is `$E[Y \vert x] = \mu(y \vert x) = \beta_0 + \beta_1 x$`

- the __variance__ of the normal distribution is `$\sigma^2$`, which is a constant (i.e., does not vary with `$x$`)

--
- `$\Rightarrow Y \sim N(\beta_0 + \beta_1 x, \sigma^2)$`, same as yesterday: `$Y = \beta_0 + \beta_1 x + \epsilon$`, where `$\epsilon \sim N(0, \sigma^2)$`

--
However, just because these assumptions are made in simple linear regression doesn't mean that all linear regression-related models utilize the same assumptions. __They don't__. When we step back from these assumptions, we enter the realm of __generalized linear models (GLMs)__.

---

## Maximum likelihood estimation

In generalized regression, we

1. assume a (family of) distribution(s) that govern observed response values `$Y$`, and

--
2. estimate the parameters `$\theta$` of that distribution.

--
Estimation is done by maximizing the __likelihood function__:
`$$\mathcal{L} = \prod_{i=1}^n f(Y_i \vert \theta)$$`

to find the __maximum likelihood estimators (MLEs)__ (typically maximize `$\mathcal{l} = \log{\cal L}$`, the __log-likelihood__)

--
Leaving many details under the rug:

- the maximum is the point at which the derivative of the likelihood function is zero

- you don't need to check the second derivative: wherever the derivative equals zero, it's a maximum value, not a minimum value

---

## MLE for regression

__Determining the value of `$\theta$` that achieves the maximum likelihood can be difficult__

--
It may require __numerical optimization__ 
  - wherein the computer, using an algorithm, searches over possible values of `$\theta$` to find the optimal one

--
For linear regression, `${\cal L}$` can be maximized analytically:

`$$\hat{\boldsymbol{\beta}} = (\boldsymbol{X} ^T \boldsymbol{X})^{-1}\boldsymbol{X}^T\boldsymbol{Y}$$`

- the `$\hat{\boldsymbol{\beta}}$` estimates that minimize the __residual sum of squares (RSS)__ are the MLEs!

- estimate for `$\hat{\sigma}^2$` is the __training data MSE__ `$= \frac{1}{n} \sum_i^n (Y_i - (\hat{\beta}_0 + \hat{\beta}_1 x_i))^2$`

--
When we generalize linear regression to other distributions... we are usually not so lucky.

---

## Generalization example

In typical linear regression, the distribution is normal and the domain of `$Y \vert x$` is `$(-\infty,\infty)$`.

--
What, however, happens if we know that

1. the domain of observed values of the response is actually `$[0,\infty]$`? and

--
2. the observed values are __discrete__, with possible values 0, 1, 2, ...

--
__The normal distribution doesn't hold here__

- Any idea of what distribution could possibly govern `$Y \vert x$`?

- Remember, we might not know truly how `$Y \vert x$` is distributed, but any assumption we make has to fit with the limitations imposed by points 1 and 2 above

---

## Generalization: Poisson regression

A distribution that fulfills the conditions imposed on the last slide is the [__Poisson__ distribution](https://en.wikipedia.org/wiki/Poisson_distribution),

`$$f(x \vert \lambda) = \frac{\lambda^xe^{-\lambda}}{x!},\text{ where }x = 0, 1, 2, \dots$$`

--
- has a single parameter `$\lambda$`, which is __both__ the mean AND variance of the distribution
  - in general the variance governs the distribution's shape

--
- distribution of independent event occurences in an interval, e.g. soccer goals in a match

- `$\lambda$` is the average number of the events in an interval

--
So, when we apply generalized linear regression in this context, we would identify the family as Poisson.

But there's another step in generalization...

---

## Generalization: Link Function

Let's keep things in the realm of one predictor. The linear function is
$$
\beta_0 + \beta_1 x \,.
$$

--
The range of this function is `$(-\infty,\infty)$`. But in our Poisson regression example, we know that `$Y$` cannot be negative, *so we need to tranform the linear function somehow* so that the domain of the transformed function is, in this case, `$[0,\infty)$`. (Or we could punt and use simple linear regression, but our results may not be meaningful, particularly if we estimate `${\hat Y}$` as being negative!)

--
__There is usually no unique transformation__, but rather ones that are conventionally used. For this case:

`$$g(\lambda \vert x) = \log(\lambda \vert x) = \beta_0 + \beta_1 x$$`
`$g$` is the __link__ function.

--
To tie this all together, given response data whose values are limited to being either 0 or positive integers, with no upper bound, we

1. assume that `$Y \vert x$` is Poisson distributed;
2. assume that `$\lambda \vert x = e^{\beta_0 + \beta_1 x}$`; and
3. use numerical optimization to estimate `$\beta_0$` and `$\beta_1$` by maximizing the likelihood function.

Note that in the end, we still can perform statistical inference, because we can still determine directly how the predicted response varies with linear coefficients.

---

## Exercises

What family might be appropriate for...

--
1. `$Y \vert x$` continuous, but bounded between 0 and 1?

--
[Beta distribution](https://en.wikipedia.org/wiki/Beta_distribution)

--
2. `$Y \vert x$` continuous, but bounded between 0 and `$\infty$`?

--
[Gamma distribution](https://en.wikipedia.org/wiki/Gamma_distribution)

--
3. `$Y \vert x$` discrete, but can only take on the values 0 and 1?

--
[Bernoulli distribution](https://en.wikipedia.org/wiki/Bernoulli_distribution)

Case 3 is special, as you will see.