Supervised Learning

class: center, middle, inverse, title-slide

# Supervised Learning
## Model assessment vs selection
### June 22nd, 2022

---

## Supervised learning

What is __statistical learning?__
--
 [Preface of Introduction to Statistical Learning with Applications in R (ISLR)](https://www.statlearning.com/):

> _refers to a set of tools for modeling and understanding complex datasets_

--
What is __supervised learning?__

__Goal__: uncover associations between a set of __predictor__ (aka __independent__ / __explanatory__) variables / features and a single __response__ (or __dependent__) variable

--
- Ex: MLB records the batting average (hits / at-bats) for every player in each season. Can we accurately predict player X's batting average in year `$t+1$` using their batting average in year `$t$`? Can we uncover meaningful relationships between other measurements (which are __predictors__ / __features__) and batting average in year `$t+1$` (which is the __response__)?

--
- Ex: We are provided player tracking data from the NFL, all x,y coordinates of every player on the field at every tenth of the second. Can we predict how far a ball-carrier will go at any given moment they have the football?

---

## Examples of statistical learning methods / algorithms

__You are probably already familiar with statistical learning__ - even if you did not know what the phrase meant before

--
Examples of statistical learning algorithms include:

- Generalized linear models (GLMs) and penalized versions (Lasso, elastic net)

- Smoothing splines, Generalized additive models (GAMs)

- Decision trees and its variants (e.g., random forest, boosting)

- Neural networks (e.g., convolutional neural networks)

--
Two main types of estimation __given values for predictors__:

- __Regression__ models: estimate _average_ value of response

- __Classification__ models: determine _the most likely_ class of a set of discrete response variable classes

---

## Which method should I use in my analysis?

--
__Depends on your goal__ - the big picture: __inference__ vs __prediction__

Let `$Y$` be the response variable, and `$X$` be the predictors, then the __learned__ model will take the form:

$$
\hat{Y}=\hat{f}(X)
$$

--
- Care about the details of `$\hat{f}(X)$`? `$\Rightarrow$` You want to perform statistical inference

- Fine with treating `$\hat{f}(X)$` as a obscure/mystical machine? `$\Rightarrow$` Your interest is prediction

--
Any algorithm can be used for prediction, however options are limited for inference

_Active area of research on using more mystical models for statistical inference_

---

### Model flexibility vs interpretability

Generally speaking: __tradeoff__ between a model's _flexibility_ (i.e. how "wiggly" it is) and how __interpretable__ it is

- Simpler the parametric form of the model `$\Rightarrow$` the easier it is to interpret

- Hence why __linear regression__ is popular in practice

.footnote[[ISLR Figure 2.7](https://www.statlearning.com/)]

---

## Model flexibility vs interpretability

- __Parametric__ models, for which we can write down a mathematical expression for `$f(X)$` __before observing the data__, _a priori_ (e.g. linear regression), __are inherently less flexible__

--
- __Nonparametric__ models, in which `$f(X)$` is __estimated from the data__ (e.g. kernel regression)

---

## Model flexibility vs interpretability

- If your goal is prediction `$\Rightarrow$` your model can be as arbitrarily flexible as it needs to be

- We'll discuss how one estimates the optimal amount of flexibility shortly...

---

## Looks about right...

---

## Model assessment vs selection, what's the difference?

--
__Model assessment__:

- __evaluating how well a learned model performs__, via the use of a single-number metric

--
__Model selection__:

- selecting the best model from a suite of learned models (e.g., linear regression, random forest, etc.)

---

## Model __flexibility__ ([ISLR Figure 2.9](https://www.statlearning.com/))

- Left panel: intuitive notion of the meaning of model flexibility

- Data are generated from a smoothly varying non-linear model (shown in black), with random noise added:
$$
Y = f(X) + \epsilon
$$

---

## Model __flexibility__

Orange line: an inflexible, fully parametrized model (simple linear regression)

--
- __Cannot__ provide a good estimate of `$f(X)$`

--
- Cannot __overfit__ by modeling the noisy deviations of the data from `$f(X)$`

---

## Model __flexibility__

Green line: an overly flexible, nonparametric model

--
- __It can__ provide a good estimate of `$f(X)$` 
--
, __BUT__ it goes too far and overfits by modeling the noise

--
__This is NOT generalizable__: bad job of predicting response given new data NOT used in learning the model

---

## So... how do we deal with flexibility?

__GOAL__: We want to learn a statistical model that provides a good estimate of `$f(X)$` __without overfitting__

--
There are two common approaches:

- We can __split the data into two groups__: 
  - __training__ data: data used to train models, 
  
  - __test__ data: data used to test them
  
  - By assessing models using "held-out" test set data, we act to ensure that we get a __generalizable(!)__ estimate of `$f(X)$`

--
- We can __repeat data splitting `$k$` times__:

- Each observation is placed in the "held-out" / test data exactly once
  
  - This is called __k-fold cross validation__ (typically set `$k$` to 5 or 10)

--
`$k$`-fold cross validation is the preferred approach, but the tradeoff is that CV analyses take `${\sim}k$` times longer than analyses that utilize data splitting

---

### Model assessment

- Right panel shows __a metric of model assessment__, the mean squared error (MSE) as a function of flexibility for both a training and test datasets

--
- Training error (gray line) __decreases as  flexibility increases__

--
- Test error (red line) decreases while flexibility increases __until__ the point a good estimate of `$f(X)$` is reached, and then it __increases as it overfits to the training data__

---

## Brief note on reproducibility

An important aspect of a statistical analysis is that it be reproducible. You should...

1. Record your analysis in a notebook, via, e.g., `R Markdown` or `Jupyter`. A notebook should be complete such that if you give it and datasets to someone, that someone should be able to recreate the entire analysis and achieve the exact same results. To ensure the achivement of the exact same results, you should...

2. Manually set the random-number generator seed before each instance of random sampling in your analysis (such as when you assign data to training or test sets, or to folds):

```r
set.seed(101)    # can be any number...
sample(10,3)     # sample three numbers between 1 and 10 inclusive
```

```
## [1]  9 10  6
```

```r
set.seed(101)
sample(10,3)     # voila: the same three numbers!
```

```
## [1]  9 10  6
```

---

## Model assessment metrics

__Loss function__ (aka _objective_ or _cost_ function) is a metric that represents __the quality of fit of a model__

--
For regression we typically use __mean squared error (MSE)__
- quadratic loss: squared differences between model predictions `$\hat{f}(X)$` and observed data `$Y$`

`$$\text{MSE} = \frac{1}{n} \sum_i^n (Y_i - \hat{f}(X_i))^2$$`

--
For classification, the situation is not quite so clear-cut
- __misclassification rate (MCR)__: percentage of predictions that are wrong

- __area under curve (AUC)__ (we'll come back to this later)

--
- interpretation can be affected by __class imbalance__: 
  - if two classes are equally represented in a dataset, an MCR of 2% is good
  
  - but if one class comprises 99% of the data, that 2% is no longer such a good result...

---

## Back to model selection

__Model selection__: picking the best model from a suite of possible models

--
- Picking the best covariance constraints (e.g. _VVV_) or number of clusters with BIC

- Picking the best regression model based on __MSE__, or best classification model based on __MCR__

--
Two things to keep in mind:

1. __Ensure an apples-to-apples comparison of metrics__
  - every model should be learned using __the same training and test data__! Do not resample the data between the time when you, e.g., perform linear regression and vs you perform random forest.

2. __An assessment metric is a random variable__, i.e., if you choose different data to be in your training set, the metric will be different.

--
For regression, a third point should be kept in mind: __a metric like the MSE is unit-dependent__
  - an MSE of 0.001 in one analysis context is not necessarily better or worse than an MSE of 100 in another context

---

## An example __true__ model

---

## The repeated experiments...

---

## The linear regression fits

<img src="10-Model-assess_files/figure-html/unnamed-chunk-13-1.png" width="1008" style="display: block; margin: auto;" />
Look at the plots. For any given value of `$x$`:

- The *average* estimated `$y$` value is offset from the truth (__high bias__)
- The dispersion (variance) in the estimated `$y$` values is relatively small (__low variance__)

---

## The spline fits

<img src="10-Model-assess_files/figure-html/unnamed-chunk-15-1.png" width="1008" style="display: block; margin: auto;" />
Look at the plots. For any given value of `$x$`:

--
- The *average* estimated `$y$` value approximately matches the truth (__low bias__)
- The dispersion (variance) in the estimated `$y$` values is relatively large (__high variance__)

---

### Bias-variance tradeoff

"Best" model minimizes the test-set MSE, where the __true__ MSE can be decomposed into:

--
$$
{\rm MSE} = {\rm (Bias)}^2 + {\rm Variance}
$$
--
<img src="http://www.stat.cmu.edu/~pfreeman/Flexibility.png" width="60%" style="display: block; margin: auto;" />

Towards the left: high bias, low variance. Towards the right: low bias, high variance.

__Optimal amount of flexibility lies somewhere in the middle__