\[ \newcommand{\Expect}[1]{\mathbb{E}\left[ #1 \right]} \newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]} \newcommand{\Cov}[1]{\mathrm{Cov}\left[ #1 \right]} \newcommand{\Indicator}[1]{\mathbb{1}\left\{ #1 \right\}} \newcommand{\ModelClass}{\mathcal{M}} \newcommand{\OptimalModel}{m^*} \DeclareMathOperator*{\argmin}{argmin} \DeclareMathOperator{\tr}{tr} \newcommand{\Rademacher}{\mathcal{R}} \]

Reading: Hand, Mannila, and Smyth (2001), ch. 7, “Score Functions for Data Mining Algorithms”; Shalizi (n.d.), ch. 3, “Model Evaluation”

1 Background and agenda

  • We’ve looked at a bunch of different data-mining tasks
    • Regression: Guess the value of a quantitative variable from other features
    • Classification: Guess the value of a qualitative variable from other features
    • Dimension reduction: Boil a large number of features into a smaller set of continuous variables, while preserving information
    • Clustering: Boil a large number of features into a small set of categorical variables, while preserving information
  • We’ve approached all of these as prediction problems, and looked at ways of measuring how good the predictions are
  • The agenda for today is to handle this “how good are our predictions?” question in a systematic, unified way
    • Many classes would cover this before the various specific instances of the question that we’ve looked at, but a lot of students find that too abstract and hard to connect to the more concrete questions
  • Specifically, we’ll look at:
    • Loss functions that measure how bad (or good!) a prediction is
    • The idea of risk, or expected loss on new data
    • Why average loss on training data is an optimistic estimate of true risk
      • Why picking the regression with the highest \(R^2\) is a bad idea
    • Why the optimism is especially big for complicated, flexible, “high capacity” models
    • Some ways of getting better estimates of risk

2 Loss functions

2.1 How much does being wrong cost us?

  • We try to predict \(Y\), and make the guess \(\hat{Y}\)
  • Usually, \(\hat{Y} \neq Y\), so we’re wrong, but how badly wrong?
  • A loss function is a function \(\ell(y, \hat{y})\) which we choose to use to measure how bad it is when we predict \(\hat{y}\) but the actual outcome is \(y\)
    • Typically, we require \(\ell(y, \hat{y}) \geq 0\), but sometimes it makes sense to allow it to go negative (negative loss = gain)
      • This most often makes sense when “loss” and “gain” have actual monetary meanings
      • Let’s ignore this possibility for the rest of today
    • Often, but not always, we require \(\ell(y, y) = 0\) (no loss if we’re right)
    • Often, but not always, we require \(\ell(y, \hat{y}) > 0\) if \(\hat{y} \neq y\) (always some loss if we’re wrong)
  • We want to make predictions where the loss will typically be small
    • More about “typically” in a moment
  • There are usually multiple loss functions we could use
    • Ideally, we have some real-world sense of what’s important and what’s not, and use that
      • Actual costs and benefits can be hard to determine
      • Actual costs and benefits may change over time
      • The same analyses/predictions may need to be re-used in different contexts with different actual costs and benefits
      • Whose costs and benefits?
    • In practice, we often use “standard” loss functions, for a variety of reasons:
      • Nice mathematical properties
      • Nice computational properties
      • Traditionally used in the field of study, so it’s easier to communicate and compare
      • Traditionally used, so there’s existing methods / software / folklore
  • Let’s look at some examples of loss functions

2.2 Some loss functions for regression

  • Squared error, \(\ell(y, \hat{y}) = (y-\hat{y})^2\)
    • Symmetric between negative and positive errors, 0 iff \(y=\hat{y}\)
    • The standard choice for regression since the late 1700s (not a typo: Farebrother (1999))
    • Minimizes nicely by calculus, because \(\frac{d\epsilon^2}{d\epsilon}=2\epsilon\)
    • We’ve seen that minimizing the mean squared error calls for the conditional expectation function, \(\Expect{Y|X}\)
      • Remember that if \(p(y)\) is the pdf, then \(\argmin_{m}{\int{(y-m)^2 p(y) dy}} = \int{y p(y) dy}\)
    • If \(\epsilon \approx 0\), then \(\epsilon^2 \ll \epsilon\), i.e., doesn’t care very much about small errors
    • If \(\epsilon \gg 1\), then \(\epsilon^2 \gg \epsilon\), i.e., really bothered by large errors
      • Said differently, because \(\frac{d\epsilon^2}{d\epsilon}=2\epsilon\), reducing big errors even a little improves the loss much more rapidly than reducing already-small errors does
    • This makes squared error sensitive to potential outliers
  • Absolute error, \(\ell(y, \hat{y}) = |y-\hat{y}|\)
    • Symmetric between negative and positive errors, 0 iff \(y=\hat{y}\)
    • Often used in earlier approaches to curve-fitting, before least squares took over
    • Does not minimize nicely by calculus, because \(\frac{d|\epsilon|}{d\epsilon} = +1\) if \(\epsilon > 0\), \(=-1\) if \(\epsilon < 0\), and is ill-defined at \(\epsilon=0\)
      • Can you find \(\argmin_{m}{\int{|y-m| p(y) dy}}\) (without looking it up)?
    • Because the slope of the loss function is constant, we get the same reduction in loss from equal-size improvements to a big error as to a small error
    • Absolute error is more resistant to potential outliers than squared error
      • But minimizing absolute error tries really hard to fit data points which are already pretty well predicted, unlike minimizing squared error
    • Whether resistance to outliers is good or bad depends on whether those points really are outliers which we should ignore, or whether they’re important pieces of information we should incorporate…
  • Huber’s “robust” loss: \[ \ell(y, \hat{y}) = \left\{ \begin{array}{cc} (y-\hat{y})^2 & |y-\hat{y}| \leq 1\\ 2|y-\hat{y}|-1 & |y-\hat{y}| \geq 1 \end{array} \right. \]
    • Acts like squared error for small errors
    • Acts like absolute error for big errors
    • Tries to be outlier-resistant while not also “sweating the small stuff”
    • Doesn’t minimize nicely by calculus

Comprehension question: Why is Huber’s loss \(2|y-\hat{y}|-1\) for larger errors? That is, what explains the factor of 2 and the offset of \(-1\)?

2.3 Some loss functions for classification

  • “Mis-labeling” or “0/1” (or “0-1”) loss: \[ \ell(y, \hat{y}) = \Indicator{y \neq \hat{y}} \]
    • That is, the loss \(=0\) if \(\hat{y} = y\) and \(=1\) otherwise
    • Simple, easy to understand
    • We’ve seen that we minimize the average 0/1 loss by always predicting the most-probable class
      • Doesn’t matter if the probability is \(99.999999\)% or \(50.000001\)%
  • Margin-based loss: distance to classification boundary for mis-classified points, \(-\) that distance for correctly classified points
    • Not so simple, or related to probability, but does distinguish between cases which were hard to call (near the boundary) from those which should have been easy (far from the boundary)
    • Flips the signs from the way we defined margin in lecture 7, but that’s because now we want a loss, with big positive numbers if we’re wrong
    • Sometimes we truncate to 0 for correctly-classified points, as in the hinge loss
  • Log loss, a.k.a. log probability loss: we predict a probability for each class, call this distribution \(\hat{p}\), and our loss is \(-\log{\hat{p}(y)}\)
    • Notice: this depends on both the prediction (\(\hat{p})\) and the actual outcome (\(y\))
    • Notice: minus sign means it hurts when something we predict is improbable happens (\(\hat{p}(y)\) is low)
    • We only get 0 when we’re certain and we’re right (\(\hat{p}(y) = 1\))
    • Errors made when we’re confident are more painful than errors made when the probabilities are nearly balanced

2.4 Some loss functions for distributions

  • Suppose all our model predicts is a pdf, say \(\hat{p}\)
    • Think of mixture models in clustering, or factor models in dimension reduction, or kernel density estimates
      • (If you haven’t seen factor models in 402, we’ll look at them a bit soon in the context of recommender systems)
  • The log loss, a.k.a. log probability loss, a.k.a. negative normalized log likelihood, is still a (good) loss function: \(-\log{hat{p}(y)}\)
  • Other potential loss functions need to be things we can evaluate using just the prediction \(\hat{p}\) and the observed value \(y\), without the true distribution
    • Integrated squared error (Hand, Mannila, and Smyth 2001, 219): \(\int{(\hat{p}(y) - p(y))^2 dy} = \int{\hat{p}^2(y) dy} + \int{p^2(y) dy} - 2\int{\hat{p}(y) p(y) dy}\)
      • The first term, \(\int{\hat{p}^2(y) dy}\), is just an integral we can do numerically
      • The second term, \(\int{p^2(y) dy}\), doesn’t care about our prediction, so it doesn’t matter for minimization
      • The third term, \(- 2\int{\hat{p}(y) p(y) dy}\), is going to be \(\approx -\frac{2}{n}\sum_{i=1}^{n}{\hat{p}(y_i)}\) in a large sample
      • We can use as our loss function \(\left(\int{\hat{p}^2(y) dy}\right) - 2\hat{p}(y_i)\)
    • Could you repeat this for the integrated absolute error, \(\int{|\hat{p}(y) - p(y)| dy}\)?

3 From loss to risk

  • We have our loss function \(\ell\), which works with the kind of prediction we’re making
  • Our prediction comes from some model \(m\), so our loss on a new data point \((X,Y)\) will be \(\ell(Y, m(X))\)
    • Or, for things like mixture models without predictor features \(X\), \(\ell(Y, m)\); you can make the little notational adjustments throughout as needed
  • Our loss on a new data point is a random quantity

COMPREHENSION QUESTION: Which elements of \(\ell(Y, m(X))\) are random?

  • Because the loss is random, we want to boil the whole distribution of losses into some sense of how big the typical loss is
  • The usual way to do this is to define the risk as the expected loss on new data: \[ r(m) \equiv \Expect{\ell(Y, m(X))} \]

COMPREHENSION QUESTION: For a fixed model \(m\), is the risk \(r(m)\) random? Why or why not?

  • Examples:
    • For regression with squared error loss, the risk is the expected squared error, and is minimized by using the conditional expectation function
      • Can you work out what minimizes the risk of a regression with absolute error loss?
    • For classification with 0/1 loss, the risk is the probability of mis-classifying, and is minimized by predicting the most-probable class
    • For classification with log loss, the risk is the cross-entropy \(-\sum_{x}{\sum_{y}{p(y|x) \log{\hat{p}(y|x)}}}\), and is minimized by predicting the true conditional distribution \(p(y|x)\)
    • For density estimation with log loss, the risk is (again) the cross-entropy \(-\int{p(y) \log{\hat{p}(y)} dy}\), and is minimized (again) by predicting the true pdf \(p(y)\)

3.1 Why the expected loss?

  • Using the expected loss is natural but not mandatory!
  • Could have used the median loss instead
  • Or the mode of the loss distribution (if that’s well-defined)
  • Or the probability that the loss exceeds some critical value
  • Or some high quantile of the loss distribution (95%, 99%, 99.5%, etc.)
    • This is basically what people in finance mean by the “value at risk” (VaR) of a portfolio or trading strategy (yes, distinguishing “risk” from “value at risk” is confusing)
      • Back in 2007/2008/2009, a lot of financial firms which thought they had plenty of reserves to cover their 99% or even 99.99% VaR turned out to not have anything of the kind, leading to a lot of pain for everyone
      • Part of the problem was that they were using 10 years of data to estimate the size of once-in-a-hundred-years events
      • We’ll find out soon whether things are any better now
    • High quantiles of the loss function are also basically what people mean when they talk about “thirty-year floods” or “hundred-year storms”

3.2 In-sample or empirical risk

  • We have a (hopefully) big data set \((x_1, y_1), \ldots (x_n, y_n)\)
  • The in-sample risk or empirical risk is just the sample average of the loss: \[ \hat{r}(m) = \frac{1}{n}\sum_{i=1}^{n}{\ell(y_i, m(x_i))} \]
  • This is a random quantity, because the data are random
  • But, for a fixed model \(m\), as \(n\rightarrow\infty\) \[ \hat{r}(m) \rightarrow r(m) \] because sample averages converge on expectations, by the law of large numbers: \[ \frac{1}{n}\sum_{i=1}^{n}{\ell(y_i, m(x_i))} \rightarrow \Expect{\ell(Y, m(X))} \]
  • In fact, the empirical risk is an unbiased estimate of the true risk, \(\Expect{\hat{r}(m)} = r(m)\), for a fixed model
    • For future reference, define the deviation in the risk as \(\gamma(m) = \hat{r}(m) - r(m)\)
    • This is a random variable, with \(\Expect{\gamma(m)} = 0\), so it’s sometimes positive and sometimes negative
    • We can’t measure \(\gamma(m)\) directly, but it’ll be important to us in a bit

3.3 Empirical risk minimization

  • We have a big space or set of possible models we can work with, say \(\ModelClass\)
    • E.g., this could be the set of all linear functions on our \(p\) predictive features (as in 401)
    • E.g., all logistic-linear functions on our \(p\) predictive features (as in logistic regression)
    • E.g., all mixtures of up to 7 Gaussians
    • E.g., all mixtures of two multinomial distributions on four binary features, with conditional independence between features (as in HW 9)
    • E.g., all classification trees with a fixed set of splits on our \(p\) predictive features
    • E.g., all classification trees with an arbitrary set of splits on our \(p\) predictive features
  • Any particular model is a point in this set, \(m\in\ModelClass\)
  • For any loss function \(\ell\) and model class \(\ModelClass\), there is an optimal model, which minimizes the risk: \[ \OptimalModel = \argmin_{m \in \ModelClass}{r(m)} \]
    • What counts as the optimal model depends on both the loss function \(\ell\) and the space of alternatives \(\ModelClass\),
    • Unfortunately, we don’t know the true risks, so we can’t minimize them
  • A very simple strategy for estimating the model is to minimize the empirical risk (empirical risk minimization, ERM): \[ \hat{m} = \argmin_{m \in \ModelClass}{\hat{r}(m)} \]
  • This is such an obvious thing to try that it took centuries to realize there were alternatives, and so to give it a name
    • For squared error loss, this is the method of least squares
    • For log probability loss, this is the method of maximum likelihood
  • How well will \(\hat{m}\) approximate \(\OptimalModel\)?

3.4 Why empirical risk minimization is optimistic

  • The intuition: we’ve picked \(\hat{m}\) to fit this particular data set, so it’s partly going to pick up general patterns that will show up in new data, and partly exploiting the quirks and accidents of the training data (“memorize the noise”)
  • The math: re-write empirical risk minimization in terms of the true risk \(r(m)\) and the deviation in the risk \(\gamma(m)\): \[ \hat{m} = \argmin_{m \in \ModelClass}{r(m) + \gamma(m)} \]
    • Picking the model with the smallest empirical risk is partly about picking out the model with the smallest true risk (small value of \(r(m)\))
    • But it’s also partly about picking out models which were lucky on the sample, i.e., had very negative deviations \(\gamma(m)\)
      • Remember \(\Expect{\gamma(m)} = 0\) for every particular \(m\)
    • This implies that \(\Expect{\gamma(\hat{m})} < 0\)
    • So now we take expectations of both sides: \[\begin{eqnarray} \Expect{\hat{r}(\hat{m})} & = & \Expect{r(\hat{m})} + \Expect{\gamma(\hat{m})}\\ & < & \Expect{r(\hat{m})} \end{eqnarray}\]
  • The model we pick by minimizing the in-sample loss will do worse on new data (on average)

3.5 A toy example

  • Say \(Y \sim \mathcal{N}(7, 1)\), with \(n=20\) data points, shown by the tick marks on the horizontal axis in this plot:

  • Let’s use squared error, with \(\ModelClass =\) all constants, so we just want to predict the expected value of \(Y\)
  • True risk when predicting \(m\) is \(1+(m-7)^2\)
    • Why is this the true risk?
    • To think through: what would the true risk be if the variance of the Gaussian were \(\sigma^2 \neq 1\)? Does it matter that \(Y\) is Gaussian or would we get the same answer for any distribution with the same expected value and variance?
  • The true risk is clearly minimized at the true expected value:

  • The empirical risk is minimized at the sample mean:

  • The difference between the two curves is what we called the risk deviation:

    • \(\gamma(m)\) is really a random function (a stochastic process), and this is one draw from its distribution (one realization of the process)
  • Because I’m simulating everything, I can repeat this many times, to evaluate \(\gamma(m)\) for any fixed \(m\)
    • At every \(m\), I’m going to get a Gaussian distribution, centered at 0 (why?)

  • I can also do the same thing for the \(\hat{m}\) chosen by empirical risk minimization, and then things look different

3.6 Why do we still try to minimize the empirical risk?

  • Because the deviations shrink!
  • By the law of large numbers, for any fixed model \(m\), as \(n\rightarrow\infty\), \[ \gamma(m) \rightarrow 0 \]
  • There’s a worst-case or maximal deviation: \[ \Gamma(\ModelClass) \equiv \min_{m \in \ModelClass}{|\gamma(m)|} \]
  • Suppose that \(\Gamma(\ModelClass) \rightarrow 0\); this is called uniform convergence (of the empirical risk to the true risk)
    • Usually true if there are a finite number of parameters; more details later
    • The important thing to know now: the larger the model space, the slower \(\Gamma\) goes to 0
  • If \(\Gamma(\ModelClass) \rightarrow 0\), then then \(r(\hat{m}) \rightarrow r(\OptimalModel)\)
    • This is called risk consistency
    • Proof: see backup
    • Uniform convergence isn’t necessary for risk consistency, but it is sufficient

3.7 Optimal functions, and approximation and estimation errors

  • If we imagine working with all possible functions of the features, say \(\mathcal{F}\), there is some function which minimizes the risk: \[ f_0 \equiv \argmin_{f \in \mathcal{F}}{r(f)} \]
    • We usually can’t hope to find this risk-optimal function, because \(\mathcal{F}\) is just too large to search effectively, and anyway we don’t know the true risks
  • Instead, we use models, and classes of models like \(\ModelClass\)
  • The total risk of our estimated model \(\hat{m}\) can be written as \[ r(\hat{m}) = r(f_0) + (r(\OptimalModel) - r(f_0)) + (r(\hat{m}) - r(\OptimalModel)) \]
  • The first term is the risk which would still be there even for the best possible model
  • The second term is the approximation error, the extra risk due to using the model class \(\mathcal{M}\)
    • Approximation error is 0 if the truly optimal function \(f_0 \in \mathcal{M}\)
  • The third term is the estimation error, the extra risk due to not finding the best model within our class of models
  • We saw a version of all this before, as the bias-variance decomposition, but that’s specifically for squared error loss, and this is more general

3.8 The approximation-estimation tradeoff

  • Suppose we have two model classes, \(\ModelClass_1\) and \(\ModelClass_2\), with \(\ModelClass_1 \subset \ModelClass_2\)
  • There will (in general) be two optimal models, \(\OptimalModel_1\) and \(\OptimalModel_2\), and two estimates, \(\hat{m}_1\) and \(\hat{m}_2\)
  • Because \(\ModelClass_1 \subset \ModelClass_2\):
    1. \(\hat{r}(\hat{m}_1) \geq \hat{r}(\hat{m}_2)\): the in-sample risk can only be lowered by searching over a larger space of models
    2. \(r(\OptimalModel_1) \geq r(\OptimalModel_2)\): the best possible true risk can only be lowered by searching over a larger space of models
    3. \(\Gamma(\ModelClass_1) \leq \Gamma(\ModelClass_2)\): the maximal deviation can only be increased by searching over a larger space of models
    1. means that comparing in-sample risks always prefers the bigger, more flexible model class
    1. means that bigger, more flexible model classes can get closer to the optimal function, and will have smaller approximation error
    1. means that estimates in bigger, more flexible model classes are noisier, so they have bigger estimation error
  • So we have a trade-off between approximation error and estimation error
    • Larger, more flexible models with smaller approximation error have bigger estimation error, and the in-sample loss is more optimistic as an estimate of the true risk
    • Again, the bias-variance trade-off is a particular case of this

3.9 Moral: never evaluate prediction errors on the training data

  • Don’t use the in-sample error on the training data to estimate risk
    • This runs in to conflict with the scientific desire to “fit the data”, and with the organizational desire to claim impressive, optimistic results
    • Nonetheless, you should hold firm on this point
      • According to the alumni I talk to, this is the single most important lesson from this class!
  • If you need to estimate the true risk, estimate the true risk

3.9.1 Why you should not use \(R^2\)

  • \(R^2\) for a regression is of course directly related to the in-sample error for squared loss
    • Specifically, \(R^2 = 1-\frac{MSE}{\hat{\sigma^2_Y}}\)
    • Picking the model which maximizes \(R^2\) is picking the model with the best in-sample MSE, and is a recipe for over-fitting (see Shalizi (n.d.), ch. 3, especially Figure 3.6 and accompanying discussion in the text)
    • “Adjusted” \(R^2\) doesn’t fix the problem (it still overfits badly, see same citation)
    • For more on why \(R^2\) and adjusted \(R^2\) are things you’re just generally better off ignoring, see Shalizi (2015), ch. 10

4 Simple estimates of the true risk

In order of increasing pain/hassle:

  1. The empirical risk
  2. Corrections based on model complexity
  3. Data splitting or cross-validation
  4. Statistical learning theory

4.1 The empirical risk

  • Unbiased if you’ve got a completely fixed model, with no adjustments or adaptations to the training data
    • This can happen in scientific applications if you’ve got a strong theory, but it’s rare for data mining projects
  • Otherwise, negatively biased (as we’ve seen)
  • The negative bias is bigger for more complicated models (as we’ve seen)
  • Generally, not a good idea

4.2 Simple corrections / penalizations

  • We pick \(\hat{m}\) by minimizing \(\hat{r}(m) = r(m) + \gamma(m)\)
  • What if instead we minimized \(\hat{r}(m) - g(m)\) instead?
    • This is called penalized estimation, and \(g(m)\) is the penalty function
  • The ideal \(g(m) = \gamma(m)\), but it’s hard to see how we could get that without knowing the true risk function
  • Instead, people use a lot of approximations; the next few sub-sections go over a very prominent example

4.2.1 The covariance penalty and Stein’s unbiased risk estimate

  • Suppose we’re doing regression, with squared error loss. We have data \((X_1, Y_1), \ldots (X_n, Y_n)\), from which we estimate a regression function \(\hat{m}\). The in-sample MSE is \[ \hat{r}(\hat{m}) = \frac{1}{n}\sum_{i=1}^{n}{(Y_i - \hat{m}(X_i))^2} \]
  • Now consider a new data set \((X_1, Y^{\prime}_1), \ldots (X_n, Y^{\prime}_n)\) where the regressor features \(X\) are the same, and, for each \(i\), \(Y_i\) and \(Y_i^{\prime}\) have the same distribution, but are conditionally independent given \(X_i\). The MSE on this data set would be \[ r(\hat{m}|X_{1:n}) = \frac{1}{n}\sum_{i=1}^{n}{(Y^{\prime}_i - \hat{m}(X_i))^2} \]
    • This is conditional on the \(X_i\)s, so it’s not quite \(r(\hat{m})\) as we’ve defined it, but if the \(X_i\)s are randomly generated and \(n\) is large, by the law of large numbers, \(r(\hat{m}|X_{1:n}) \approx r(\hat{m})\).
  • It is possible to show (Efron 2004; Ye 1998) that \[ \Expect{r(\hat{m}|X_{1:n}) - \hat{r}(\hat{m})} = \frac{2}{n}\sum_{i=1}^{n}{\Cov{Y_i, \hat{m}(X_i)}} \]
    • In fact, you will show this in HW10 (by easy steps, not all at once)
  • Turn this around: \[ \hat{r}(\hat{m}) + \frac{2}{n}\sum_{i=1}^{n}{\Cov{Y_i, \hat{m}(X_i)}} \] is an unbiased estimate of the risk
  • This is called using the “covariance penalty”

4.2.2 Linear smoothers and Stein’s unbiased risk estimate

  • The problem with the covariance penalty is: how on Earth do we figure out \(\Cov{Y_i, \hat{m}(X_i)}\)?!?
  • One situation where we can find it fairly easily is with linear smoothers
    • Remember that in a linear smoother, our regression predictions are weighted sums of the observed \(Y\) values, \[ \hat{m}(x_0) = \sum_{i=1}^{n}{w(x_i, x_0) y_i} \]
    • In particular, if we stack the observed \(Y\) values into an \(n\times 1\) matrix \(\mathbf{y}\), the fitted values are \(\mathbf{w} \mathbf{y}\), where \(w_{ij} = w(x_i, x_j)\)
    • Linear regression and nearest neighbors are both linear smoothers; so are regression trees (with a fixed choice of splits); so are kernel smoothers and splines (which most of you will have seen in other classes)
  • If we’re willing to assume that \(Y = \mu(X) + \epsilon\), where \(\Var{\epsilon} = \sigma^2\) and \(\epsilon\) is uncorrelated from one data point to the next, then \[ \Cov{Y_i, \hat{m}(X_i)} = \sigma^2 w_{ii} \]
    • You will also prove this in HW10 (again, by easy stages)
  • So here is our unbiased estimate of the risk of a linear smoother: \[ \hat{r}(\hat{m}) + 2\frac{\sigma^2}{n}\tr{\mathbf{w}} \]
    • This is called “Stein’s unbiased risk estimator” (SURE), after Stein (1981)
    • We usually still need to estimate \(\sigma^2\)

4.2.3 Effective degrees of freedom

  • As you saw in HW5, for a linear regression with \(p\) coefficients = \(p\) degrees of freedom, \[ \tr{\mathbf{w}} = p \]
  • We therefore define \(\tr{\mathbf{w}}\) to be the effective degrees of freedom of a linear smoother with smoother matrix \(\mathbf{w}\)

COMPREHENSION QUESTION: Explain why \(k\) nearest neighbor regression has \(n/k\) effective degrees of freedom

  • For regressions that aren’t linear smoothers, but where the noise variance is still always \(\sigma^2\), the number of effective degrees of freedom is \[ \frac{1}{\sigma^2}\sum_{i=1}^{n}{\Cov{Y_i, \hat{m}(X_i)}} \]
    • If each \(Y_i\) has its own variance \(\sigma^2_i\), you could define the degrees of freedom as \(\sum_{i=1}^{n}{\Cov{Y_i, \hat{m}(X_i)}/\sigma^2_i}\), but by this point you might as well just talk about the covariance penalty, rather than the degrees of freedom

4.2.4 Summing up

  • For ordinary linear regression, the number of degrees of freedom is important mostly because it tells us about over-fitting
  • We can generalize the idea of degrees of freedom to linear smoothers, where we look at how much influence each \(Y_i\) has on the corresponding fitted value
  • We can generalize the idea of effective degrees of freedom still further, to the covariances between \(Y_i\)s and fitted values
  • Effective degrees of freedom / summed covariances give us unbiased estimates of regression risk
    • Didn’t say anything about the variance of that estimate, though though…

4.3 Data splitting or cross-validation

  • Since we want to know how well we generalize to new data, try generalizing to new data
  • More formally:
    • Put the data in some random order
    • Use data points \(1, 2, \ldots n/2\) to estimate \(\hat{m}\)
    • Then \(\hat{m}\) is statistically independent of data points \((n/2)+1, (n/2)+2, \ldots n\)
    • So \(\frac{1}{n/2}\sum_{i=(n/2)+1}^{n}{\ell(y_i, \hat{m}(x_i))}\) will be an unbiased estimate of \(r(\hat{m})\), and will converge on \(r(\hat{m})\) as \(n\rightarrow \infty\) (law of large numbers again)
    • 50-50 split isn’t essential, just simplifies the notation
  • Cross-validation essentially averages over multiple different splits
  • If we use data splitting or cross-validation to pick a winner, the score of the winner on the test set will still be a bit optimistic as an estimate of the true risk
    • See Hand, Mannila, and Smyth (2001), sec. 7.5
    • See Tibshirani and Tibshirani (2009) for a way of correcting for this
  • Data splitting and cross-validation are conceptually simple, but computationally more involved than something like SURE
    • In \(v\)-fold cross-validation, we need to re-estimate each model \(v\) times, and we end up calculating \(n\) predictions for each model
  • We’ll come back to cross-validation in great detail next week

4.4 Statistical learning theory

  • For a fixed model \(m\), \(\gamma(m)\) has a well-behaved distribution
    • Expectation 0 (by definition of \(\gamma\))
    • Variance \(\propto 1/n\) (for independent samples)
    • Gaussian distribution for large \(n\) (by central limit theorem)
  • Unfortunately, for our random, adaptively-chosen \(\hat{m}\), \(\gamma(\hat{m})\) has a weirder distribution
    • Expectation \(< 0\) (as we’ve seen)
    • Variance, limiting distribution…?
  • Statistical learning theory tries to give results that look like this: if we pick \(\hat{m}\) from \(\mathcal{M}\) by minimizing the empirical error on a sample of size \(n\), then with probability at least \(\delta\), \[ r(\hat{m}) \leq \hat{r}(\hat{m}) + q(n, \delta, \mathcal{M}) \]
    • This is basically a confidence interval for the true risk of \(\hat{m}\), where \(\delta\) is the confidence level
    • The function \(q(n, \delta, \mathcal{M})\) tells us about the width of the confidence interval, so
      • \(q(n,\delta,\mathcal{M})\) should \(\rightarrow 0\) as \(n\rightarrow\infty\)
      • \(q(n,\delta,\mathcal{M})\) should shrink as \(\delta \rightarrow 0\), and widen as \(\delta \rightarrow 1\)
      • \(q(n, \delta, \mathcal{M})\) should get wider as \(\mathcal{M}\) gets more complicated and more flexible
  • All of the work lies in figuring out that function \(q\)
  • Some ways of getting at the function \(q\):
    • If we replaced the real \(Y\)’s with random noise, how well could \(\mathcal{M}\) match that noise, on average? (Rademacher complexity and variants: see backup)
    • How many different data sets of size \(n\) could \(\mathcal{M}\) fit perfectly? (stochastic complexity)
    • How many different ways could models from \(\mathcal{M}\) label the points in a data set of size \(n\)? (growth function)
      • Modifications of that idea for continuous-valued models
    • What’s the largest \(n\) which \(\mathcal{M}\) is guaranteed to be able to fit perfectly? (Vapnik-Chervonenkis (VC) dimension, which controls the growth function)
  • This is a fascinating but very intricate subject
    • Take 36-702, or read Mohri, Rostamizadeh, and Talwalkar (2012)
    • A lot of probability theory going in to getting estimates of \(\Gamma(\ModelClass)\)
    • Bounds on performance may be rather loose, but often they hold with very, very few assumptions
    • Sometimes more inspirational (“this is the sort of thing we should look at”) than practical

5 Key things to remember

  • We need to pick a loss function to decide what counts as good or bad predictions
  • We usually want low expected loss = risk
  • Comparing in-sample risks is a recipe for over-fitting, and so for trouble later
  • There are some simple penalty terms which try to correct for over-fitting (like effective degrees of freedom / the covariance penalty)
  • Or we could use data splitting or cross-validation

6 Backup material

6.1 Uniform convergence of empirical risks implies risk consistency

Claim 1: \(r(\hat{m}) - r(\OptimalModel) \leq 2\Gamma(\ModelClass)\)

Proof: \[\begin{eqnarray} r(\hat{m}) - r(\OptimalModel) & = & r(\hat{m}) - \hat{r}(\OptimalModel) + \hat{r}(\OptimalModel) - r(\OptimalModel)\\ & \leq & r(\hat{m}) - \hat{r}(\hat{m}) + \hat{r}(\OptimalModel) - r(\OptimalModel)\\ & \leq & |\hat{r}(\hat{m}) - \hat{r}(\hat{m})| + |r(\OptimalModel) - \hat{r}(\OptimalModel)|\\ & \leq & 2\Gamma(\ModelClass) \end{eqnarray}\] \(\Box\)

  • (Why is the second line true? That is, what gives us the inequality?)
  • This bound is often loose, meaning \(r(\hat{m})\) might be closer to \(r(\OptimalModel)\) than \(2\Gamma(\ModelClass)\), but it’s certainly no worse

Claim 2: If \(\Gamma(\ModelClass) \rightarrow 0\) as \(n\rightarrow \infty\), then \(r(\hat{m}) \rightarrow r(\OptimalModel)\)

Proof: By the definition of \(\OptimalModel\) and Claim 1, \[ r(\OptimalModel) \leq r(\hat{m}) \leq r(\OptimalModel) + 2\Gamma(\ModelClass) \] \(\Box\)

6.2 Rademacher complexity bounds on risk

  • The Rademacher complexity of the model class \(\ModelClass\) is \[ \Rademacher_n(\ModelClass) \equiv \Expect{ \max_{m\in \ModelClass}{\frac{1}{n} \sum_{i=1}^{n}{Z_i m(X_i) }}} \] where the \(Z_i\) are IID random variables, \(\pm 1\) with equal probability, independent of the \(X_i\)
    • The \(Z_i\)s are called Rademacher random variables
    • The expectation is over the \(X_i\) and the \(Z_i\) together
    • This is basically “How well could \(\mathcal{M}\) seem to fit pure noise, on average?”
  • If \(\mathcal{M}\) is also binary-valued, and we use the 0/1 loss, then, with probability at least \(\delta\) \[ \max_{m \in \ModelClass}{\left( r(m) - \hat{r}(m)\right)} \leq \Rademacher_n(\ModelClass) + \sqrt{\frac{-\ln{1-\delta}}{2n}} \]
    • So we can use the right-hand side as bound for how much worse \(r(\hat{m})\) could be than \(\hat{r}(\hat{m})\)
    • Notice that this bound will only go to zero if \(\Rademacher_n(\ModelClass) \rightarrow 0\) as \(n\rightarrow 0\)
    • There are generalizations to other loss functions, regression, etc., but this gives the basic idea (Bartlett and Mendelson 2002)
  • There are many situations where we can either calculate $ _n()$, or calculate upper bounds on it (Bartlett and Mendelson 2002, @Mohri-Rostamizadeh-Talwalkar)
  • Otherwise, we can estimate \(\Rademacher_n(\ModelClass)\) by simulating the \(Z_i\)’s and trying to fit them using \(\ModelClass\), then averaging over many simulations
    • This holds the \(X_i\) fixed, but if \(n\) is big then sample average over the \(X_i\)s should be close to the expectation over new \(X\)s anyway
    • There are formulas to account for how much wider we need to make the confidence intervals if we estimate Rademacher complexity this way (Bartlett and Mendelson 2002)
  • Remarkably, Rademacher complexity seems to predict how well humans (or at least undergraduates) do at learning tasks (Zhu, Rogers, and Gibson 2009)

References

Bartlett, Peter L., and Shahar Mendelson. 2002. “Rademacher and Gaussian Complexities: Risk Bounds and Structural Results.” Journal of Machine Learning Research 3:463–82. http://jmlr.csail.mit.edu/papers/v3/bartlett02a.html.

Efron, Bradley. 2004. “The Estimation of Prediction Error: Covariance Penalties and Cross-Validation.” Journal of the American Statistical Association 99:619–32. https://doi.org/10.1198/016214504000000692.

Farebrother, Richard William. 1999. Fitting Linear Relationships: A History of the Calculus of Observations 1750–1900. New York: Springer-Verlag.

Hand, David, Heikki Mannila, and Padhraic Smyth. 2001. Principles of Data Mining. Cambridge, Massachusetts: MIT Press.

Mohri, Mehryar, Afshin Rostamizadeh, and Ameet Talwalkar. 2012. Foundations of Machine Learning. Cambridge, Massachusetts: MIT Press.

Shalizi, Cosma Rohilla. 2015. “The Truth About Linear Regression.” Online Manuscript. http:///www.stat.cmu.edu/~cshalizi/TALR.

———. n.d. Advanced Data Analysis from an Elementary Point of View. Cambridge, England: Cambridge University Press. http://www.stat.cmu.edu/~cshalizi/ADAfaEPoV.

Stein, Charles M. 1981. “Estimation of the Mean of a Multivariate Normal Distribution.” Annals of Statistics 9:1135–51. https://doi.org/10.1214/aos/1176345632.

Tibshirani, Ryan J., and Robert Tibshirani. 2009. “A Bias Correction for the Minimum Error Rate in Cross-Validation.” Annals of Applied Statistics 3:822–29. http://arxiv.org/abs/0908.2904.

Ye, Jianming. 1998. “On Measuring and Correcting the Effects of Data Mining and Model Selection.” Journal of the American Statistical Association 93:120–31. https://doi.org/10.1080/01621459.1998.10474094.

Zhu, Xiaojin, Timothy Rogers, and Bryan Gibson. 2009. “Human Rademacher Complexity.” In Advances in Neural Information Processing Systems 22, edited by Y. Bengio, D. Schuurmans, John Lafferty, C. K. I. Williams, and A. Culotta, 2322–30. http://papers.nips.cc/paper/3771-human-rademacher-complexity.