\[ \newcommand{\Prob}[1]{\mathbb{P}\left( #1 \right)} \newcommand{\Expect}[1]{\mathbb{E}\left[ #1 \right]} \newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]} \newcommand{\Cov}[1]{\mathrm{Cov}\left[ #1 \right]} \newcommand{\Risk}{r} \newcommand{\EmpRisk}{\hat{\Risk}} \newcommand{\Loss}{\ell} \newcommand{\OptimalStrategy}{\sigma} \DeclareMathOperator*{\argmin}{argmin} \newcommand{\ModelClass}{S} \newcommand{\OptimalModel}{s^*} \DeclareMathOperator{\tr}{tr} \newcommand{\Indicator}[1]{\mathbb{1}\left\{ #1 \right\}} \newcommand{\myexp}[1]{\exp{\left( #1 \right)}} \newcommand{\eqdist}{\stackrel{d}{=}} \newcommand{\Rademacher}{\mathcal{R}} \newcommand{\EmpRademacher}{\hat{\Rademacher}} \newcommand{\Growth}{\Pi} \newcommand{\VCD}{\mathrm{VCdim}} \newcommand{\OptDomain}{\Theta} \newcommand{\OptDim}{p} \newcommand{\optimand}{\theta} \newcommand{\altoptimand}{\optimand^{\prime}} \newcommand{\ObjFunc}{M} \newcommand{\outputoptimand}{\optimand_{\mathrm{out}}} \newcommand{\Hessian}{\mathbf{h}} \newcommand{\Penalty}{\Omega} \newcommand{\Lagrangian}{\mathcal{L}} \newcommand{\HoldoutRisk}{\tilde{\Risk}} \]

Previously

We regularize (penalize, constrain) to trade more approximation error (bias) for less estimation error (variance)
Because estimation error shrinks with sample size \(n\), we should regularize less as \(n\rightarrow\infty\), but how much less?
“Method of sieves”: Use theory to work out how fast to relax the constraint, in advance of seeing the data
Can we let the data tell us how much to regularize?

Model selection

“How much should we regularize?” is a special case of “Which model class should we use?”
- Different levels of constraint \(=\) different model classes (\(\ModelClass_c\) in last lecture’s notation)
“Should” here means “in order to have low risk”
This is the problem of model selection

A reasonable (?) goal for model selection

We have \(q\) model classes \(\ModelClass_1, \ModelClass_2, \ldots \ModelClass_q\)
Each one will have its optimal strategy \(\OptimalModel_1, \ldots \OptimalModel_q\)
Each one will also have its ERM \(\hat{s}_1, \ldots \hat{s}_q\)
A really ambitious goal: pick the model class \(k^{\dagger}\) which minimizes \(\Risk(\OptimalModel_k)\) (with high probability)
A more reasonable goal: pick the model class \(k^*\) which minimizes \(\Risk(\hat{s}_k)\) (with high probability)
- Don’t try to pick the One Best Model; do try to pick the model which will generalize best from these data
A still more reasonable goal: pick a model class \(\hat{k}\) so that \(\Risk(\hat{s}_{\hat{k}}) - \Risk(\hat{s}_{k^*})\) is not too big, with high probability
- It’s OK to pick the wrong model class if it’s close in risk to the right one

Data splitting, a.k.a sampling splitting, a.k.a. hold-out

We have \(n\) data points, say \(Z_1, \ldots Z_n\), all IID
Divide this into a training set of \(n_t\) points and a selection set of \(n_s\) points, say \(D_t\) and \(D_s\)
- Divide at random, without looking at the data
- No overlap
Fit every model on the training set \(D_t\), evaluate on the selection set \(D_s\)
Define \[ \HoldoutRisk(s) = \frac{1}{n_s}\sum_{j\in D_s}{\Loss(z_j, s)} \]
The hold-out procedure
- for each model class \(i \in 1:q\), fit \(\hat{s}_i\), using \(D_t\) and only \(D_t\)
- Calculate \(\HoldoutRisk(\hat{s}_i)\), using \(D_s\) and only \(D_s\)
- Return \(\hat{k} = \argmin_{i \in 1:q}{\HoldoutRisk(\hat{s}_i)}\)

Data-splitting is an unbiased estimate of the risk

Each \(\hat{s}_i\) is a function of the training data \(D_t\)
The training data is independent of the selection data \(D_s\)
Therefore \[ \Expect{\HoldoutRisk(\hat{s}_i)} = \Risk(\hat{s}_i) \]
The whole problem of optimism has gone away

A crude but still informative result on splitting

Pick \(\hat{k}\) by data splitting. Suppose the loss function is bounded, \(0 \leq \Loss \leq m\). Then, for any probability \(\alpha \in (0,1)\), \[ \Prob{\Risk(\hat{s}_{\hat{k}}) \leq \Risk(\hat{s}_{k^*}) + m\sqrt{\frac{2\log{(2q/\alpha)}}{n_s}}} \geq 1-\alpha \]

We’ll want \(n_s \rightarrow \infty\) to get increasingly tight bounds
\(\hat{s}_k\) is fit to only \(n_t\) data points so \(\Risk(\hat{s}_k)\) is \(O(\sqrt{\frac{\log{n_t/d_k}}{n_t/d_k}})\) above \(\Risk(\OptimalModel_k)\), we really want \(n_s \rightarrow\infty\) as well
The bound gets looser as \(q\) grows, but only like \(\sqrt{\frac{\log{q}}{n_s}}\), so we could even let \(q\) grow with \(n_s\)

Proving the result on splitting

The loss function \(\Loss\) is bounded
So \(\HoldoutRisk\) has the bounded difference property with bound \(m/n_s\)
So the McDiarmid/bounded difference inequality says \[ \Prob{|\HoldoutRisk(\hat{s}_i) - \Expect{\HoldoutRisk(\hat{s}_i)}| \geq \epsilon} \leq 2\myexp{-2n_s \frac{\epsilon^2}{m^2}} \]
But we just saw that \(\Expect{\HoldoutRisk(\hat{s}_i)} = \Risk(\hat{s}_i)\), so \[ \Prob{|\HoldoutRisk(\hat{s}_i) - \Risk(\hat{s}_i)| \geq \epsilon} \leq 2\myexp{-2n_s \frac{\epsilon^2}{m^2}} \]

Proving the result on splitting (2)

Using the union bound, \[ \Prob{\max_{i \in 1:q}{|\HoldoutRisk(\hat{s}_i) - \Risk(\hat{s}_i)|} \geq \epsilon} \leq 2q\myexp{-2n_s \frac{\epsilon^2}{m^2}} \]
Abbreviate \(\max_{i \in 1:q}{|\HoldoutRisk(\hat{s}_i) - \Risk(\hat{s}_i)|}\) as \(\Delta\)
\(\Delta\) controls the excess risk of \(\hat{s}_{\hat{k}}\) over \(\hat{s}_{k^*}\): \[\begin{eqnarray} \Risk(\hat{s}_{\hat{k}}) - \Risk(\hat{s}_{k^*}) & = & \Risk(\hat{s}_{\hat{k}}) - \HoldoutRisk(\hat{s}_{k^*}) + \HoldoutRisk(\hat{s}_{k^*}) - \Risk(\hat{s}_{k^*})\\ & \leq & \Risk(\hat{s}_{\hat{k}}) - \HoldoutRisk(\hat{s}_{\hat{k}}) + \HoldoutRisk(\hat{s}_{k^*}) - \Risk(\hat{s}_{k^*})\\ & \leq & |\Risk(\hat{s}_{\hat{k}}) - \HoldoutRisk(\hat{s}_{\hat{k}})| + |\HoldoutRisk(\hat{s}_{k^*}) - \Risk(\hat{s}_{k^*})|\\ & \leq & 2 \Delta \end{eqnarray}\]
- We’ve seen this argument before!

Proving the result on splitting (3)

Put the pieces together: \[\begin{eqnarray} \Risk(\hat{s}_{\hat{k}}) - \Risk(\hat{s}_{k^*}) & \leq & 2 \Delta\\ \Prob{\Delta \geq \epsilon} & \leq & 2q\myexp{-2n_s \frac{\epsilon^2}{m^2}} \end{eqnarray}\]
So \[ \Prob{\Risk(\hat{s}_{\hat{k}}) - \Risk(\hat{s}_{k^*}) \geq \epsilon} \leq 2q\myexp{- \frac{n_s}{2}\frac{\epsilon^2}{m^2}} \]
Inverting the bound gives the result

Why not stop here?

Splitting the sample once is pretty crude and unstable
- Because it involves the random choice of split
Making sure \(n_s\rightarrow\infty\) means we’re leave a lot of data unused in fitting
Maybe if we used all \(n\) data points to estimate, we’d be able to pick a more complex model class

Cross-validation (CV)

Simple or leave-one-out CV

for each data point \(i \in 1:n\)
- fit each model \(j\) to all the data except \(z_i\), get \(\hat{s}_{j, -i}\)
- calculate \(XV_{ij} = \Loss(z_i, \hat{s}_{j, -i})\) \[ LOOCV_j = n^{-1}\sum_{i=1}^{n}{XV_{ij}} \]

\(k\)-fold CV (or \(v\)-fold CV)

randomly divide data into \(k\) equal parts (the “folds”)
for each fold \(f \in 1:k\)
- fit each model to all the data except fold \(f\), get \(\hat{s}_{j, -f}\)
- calculate \(XV_{fj} = (k/n) \sum_{i \in \text{fold} f}{\Loss(z_i, \hat{s}_{j, -f})}\) \[ KFCV_j = k^{-1}\sum_{f=1}^{k}{XV_{fj}} \]

Cross-validation: why, roughly?

Think about LOOCV first \[ XV_{ij} = \Loss(z_i, \hat{s}_{j, -i}) \]
This is an unbiased estimate of \(\Risk(\hat{s}_{j, -i})\)
Presumably \(\hat{s}_{j, -i}\) is close to \(\hat{s}_j\) using all \(n\) data points
- Close in risk \(=\) “error stability” of lecture 13
Presumably all the different \(\hat{s}_{j, -i}\) are going to have very similar risks, so averaging \(XV_{ij}\) across \(i\) will give us something close to the risk of using \(n-1\) data points in model \(j\)
Presumably that risk will be close to that of using \(n\) data points
KFCV is similar but apparently more biased (\(\frac{k-1}{k}n\) vs. \(n-1\) training points)

Bias-variance again

\(XV_{ij}\) is an unbiased estimate of \(\Risk(\hat{s}_{j, -i})\) because \(Z_i\) is independent of \(\hat{s}_{j, -i}\)
\(XV_{ij}\) and \(XV_{i^{\prime}j}\) are not independent
- \(\hat{s}_{j, -i}\) and \(\hat{s}_{j, -i^{\prime}}\) have \(n-2\) training points in common \(\Rightarrow\) they’re very dependent!
Averaging correlated quantities (like the \(XV_{ij}\)s over \(i\)) reduces the variance much less than if they were uncorrelated
- If each term in the average has variance \(v_n\) and correlation \(\rho_n\) to all the other terms, the variance of the average is not \(v_n/n\) but \(v_n/n(1+ (n-1)\rho_n)\)
LOOCV is a low-bias, high-variance estimate of the true risk
KFCV is a higher-bias, lower-variance estimate of the true risk
- Though the random choice of folds introduces some more variance…

Why proving things about CV is hard

It’s an average of complicated, correlated terms!
- But so is a \(U\)-statistic…
Most of the results I know have the flavor of the basic data-splitting result
- the risk of the selected model is close to the risk of the best model
- plus some margin that grows like \(\sqrt{\frac{\log{q}}{n}}\)
  - some results are better in \(n\) but worse in \(q\), \(\frac{\log{q}}{n}\)
- Lots of the work goes in to dealing with the dependence between \(XV_{ij}\) and \(XV_{i^{\prime} j}\)
Fortunately we don’t usually need to prove things about CV…
- Good guide to the literature: Arlot and Celisse (2010)

Morals/guidelines from the actual results

The CV score of \(\hat{k}\) is optimistically biased as an estimate of the risk of \(\hat{k}\)
- Also true for data splitting
- There are corrections (Tibshirani and Tibshirani 2009)
CV will break down (choose badly) if we throw too many models at it
- We want \(\log{q}\) to be \(o(n)\), so it could still be (say) \(O(n^{\alpha})\) which is big, just not that big
LOOCV is (usually) much more time-consuming than KFCV
- \(n\) fits vs. \(k\) fits (we’ll come back to this!)

Predict well, or find the truth?

The risk of the model picked by LOOCV is generally lower than the one picked by KFCV (Arlot and Celisse 2010, sec. 6)
- In fact the risk of LOOCV (usually) converges on the risk of \(\ModelClass_{k^*}\) (an oracle inequality)
If one of the \(\ModelClass_i\) is exactly right, will CV pick it (with high probability as \(n\rightarrow\infty\))? (Arlot and Celisse 2010, sec. 7)
- LOOCV: No; it has \(> 0\) limiting probability of picking larger-than-true models (which still predict well)
- KFCV: Does better at picking the One True Model
- Splitting, with a selection set much larger than the training set, will do even better than KFCV
This trade-off (predict well vs. select the truth) isn’t just a weakness of CV, no model selection method does well at both (Claeskens and Hjort 2008)

LOOCV: Do we really have to?

We’re not even pretending that any of our strategies are accurate representations of reality
- Whether linear models are any less implausible than what we’re using is another story…
\(\Rightarrow\) “select the true model” doesn’t make sense as a goal for us
We only care about risk
\(\Rightarrow\) we want to select the best-predicting model class
\(\Rightarrow\) seems like we need to do LOOCV
But LOOCV is very slow
- Repeat \(n\) times: fit to \(n-1\) data points, predict on 1 data point, just looking at the data takes \(O(n(n-1)+n) = O(n^2)\) operations
- vs. \(O(k(\frac{k-1}{k}n + \frac{n}{k}) = O(kn)\) for KFCV
- More refined calculations in the backup slides
It’d be nice to get (most of) the benefits of LOOCV without all the computational cost

A short-cut for linear smoothers

In regression, a model class is a linear smoother when \[ m(x) = \sum_{i=1}^{n}{w(x, x_i) y_i} \]
- Predictions are weighted sums of the training values of \(y\) (often weighted averages)
- Linear in the \(y_i\), not linear in \(x\)
Most regression methods used in practice are linear smoothers
- Linear regression with OLS, WLS, ridge or lasso estimates
- Kernel smoothers, nearest neighbors, regression trees (given the tree), splines, approximation by function bases, many neural networks…
There is a short-cut to doing LOOCV for linear smoothers (Wahba 1990)
Estimate each model class once, on the complete data, to get \(\hat{m}(x)\), and form the \(n\times n\) matrix \(\mathbf{w}\) where \(w_{ij} = w(x_i, x_j)\) \[ LOOCV = \frac{1}{n}\sum_{i=1}^{n}{\left(\frac{y_i - \hat{m}(x_i)}{1-w_{ii}}\right)^2} \]
Lead Wahba to her generalized cross-validation: \[ GCV = \frac{n^{-1} \sum_{i=1}^{n}{(y_i - \hat{m}(x_i))^2}}{1-n^{-1}\tr{\mathbf{w}}} \]
- The numerator is just the MSE, which we’ll probably have anyway
- \(\tr{\mathbf{w}}=\) effective degrees of freedom
- For many regression problems, GCV works almost as well, or as well, as LOOCV, but needs only one fit

What if we don’t have a liner smoother?

Say \(n\) is big; removing one data point should only change the model a little
Removing a data point shifts the \(\hat{r}(s)\) function a little
Typical size of a shift \(=\) \(\Var{\nabla \hat{r}}\), which is usually \(\propto 1/n\)
How much does that move the location of the minimum? Varies inversely with \(\nabla\nabla \hat{r}\), how sharp the minimum is
These are old friends from basic estimation theory!
We don’t care about the change to the parameters, just to the risk, which works out to \[ LOOCV(\hat{s}_i) \approx \EmpRisk(\hat{s}_i) + n^{-1}\tr{\mathbf{j}_i \mathbf{k}_i^{-1}} \]
- \(\mathbf{j} =\) variance in the loss gradient across data points, \(\mathbf{k} =\) Hessian of the risk
We came up with \(n^{-1}\tr{\mathbf{j}_i \mathbf{k}_i^{-1}}\) as an unbiased estimate of optimism/over-fitting
It turns out to also be a large-\(n\) approximation to LOOCV

Akaike’s Information Criterion (AIC)

Akaike (1973) introduced what came to be called AIC for likelihood models: if we have a \(d\)-dimensional parameter \(\theta\), and maximum-likelihood estimate \(\hat{\theta}\), then \[ AIC = \log{p(z_{1:n}; \hat{\theta})} - d \]
- Actually he had a factor of 2 in there because reasons
He advised picking the model with the highest AIC
This has been hugely influential and many people are very, very into AIC

AIC (2)

In our terms: Akaike was using the log loss, and advised minimizing across \(\ModelClass_i\) \[ \hat{r}(\hat{s}_i) + d_i/n \]
If we use the log loss, and the model class is correctly specified, then \(\mathbf{j} = \mathbf{k}\) (Fisher’s identity), and \(n^{-1}\tr{\mathbf{j} \mathbf{k}^{-1}} = d/n\)
- “correctly specified” \(=\) the true distribution is in the model class
\(\Rightarrow\) AIC is a very fast-and-dirty large-\(n\) approximation to LOOCV
- \(\Rightarrow\) AIC selects models which are too large, even when \(n\rightarrow\infty\), because they predict better than the true model! (Claeskens and Hjort 2008, ch. 2)

Summing up

We need to select the model class which will predict better
- Special case: select how much to regularize
Data or sample splitting is simple and effective, but leaves a lot of data out of the fit
Cross-validation tries to improve on splitting by including more data in the fit
Leave-one-out (usually) has better risk than \(k\)-fold CV, but is slower
- Short-cuts for linear smoothers, including GCV
Our old asymptotic estimate of the risk, \(\EmpRisk(\hat{s}) + n^{-1}\tr{\mathbf{j}\mathbf{k}^{-1}}\), is a large-\(n\) approximation to LOOCV
- AIC is special case of this large-\(n\) approximation
Advice: use LOOCV when you can, \(k\)-fold or approximations when you must

Backup: more on the computational complexity of LOOCV vs KFCV

Typically, fitting a model to \(n\) data points takes \(O(n^r)\) time
- E.g. for OLS, need to find \((\mathbf{x}^T\mathbf{x})^{-1}\mathbf{x}^T \mathbf{y}\), which is \(O(pnp + p^3 + pn + p^2)\) to form \(\mathbf{x}^\mathbf{x}\), invert it, form \(\mathbf{x}^T\mathbf{y}\), and do the final multiplication, \(=O(np^2 + p^3)\), so here we’d say \(r=1\)
  - Yes there are cleverer algorithms for inverting with better time
- For nearest neighbors or kernel smoothing, no actual model set-up cost, \(r=0\)
Typically, making a prediction for one data point takes \(O(n^s)\) time
- For OLS, \(s=0\), but for nearest neighbors or kernel smoothing, \(s=1\)
  - Yes there are approximate nearest neighbor and kernel algorithms which take more like \(O(\log{n})\) time, but add to the time needed to fit the model
Time needed to fit to \(n-1\) data points and predict on one, \(O((n-1)^r + (n-1)^s) = O(n^r + n^s)\)
Total time needed for LOOCV, \(nO(n^r + n^s) = O(n^{r+1}+n^{s+1})\)
- For OLS, \(O(n^2)\); for nearest neighbors or kernels, also \(O(n^2)\)
Time needed to fit to \(\frac{k-1}{k}n\) data points and predict on \(\frac{n}{k}\), \(O((\frac{k-1}{k} n)^r + (n/k) (\frac{k-1}{k} n)^s) = O(n^r + n^{s+1} k^{-1})\)
Total time needed for KFCV is \(k\) times that, or, \(O(k n^r + n^{s+1})\)
- For OLS, \(O(kn)\); for nearest neighbors or kernels, \(O(n^2 \frac{k-1}{k})\)
Just a factor of \(n\), but that can really matter for large \(n\)!

Backup: History of CV

Data splitting and cross-validation go back in statistical practice (as opposed to theory) for many decades
Stone (1974) and Geisser (1975) seem to be the first formal publications
Geisser and Eddy (1979) made the generality of the method clear
Another notable early contributor to cross-validation as a general method was Grace Wahba; her book Wahba (1990) covers this (with references to the original papers from the 1970s)
Computer scientists adopted cross-validation from statisticians around 1990; this is was one of the steps that led to the modern discipline of “machine learning”
- Kearns and Ron (1999) was one of the first to try to analyze CV using learning theory

Backup: Didn’t I promise saying how much we should regularize?!?

We derived a result (for data splitting) about having \(q\) discrete model classes
But I started all this by saying we needed to pick \(\lambda\) in \[ \min_{s}{\EmpRisk(s) + \lambda \Penalty(s)} \]
\(\lambda\) is a continuous quantity — is this all a bait-and-switch?
We could just make up a grid of \(q\) different \(\lambda\) values and search over the grid
But really the dependence on \(\log{q}\) is an artifact of using the union bound
Earlier we got around the union bound by using covering numbers
- Here: nearby values of \(\lambda\) should have very similar risks, so we don’t need to count each distinct \(\lambda\) as an independent model class
This is how people prove results about using CV to pick penalties (e.g., Homrighausen and McDonald (2013);Homrighausen and McDonald (2014);Homrighausen and McDonald (2017) for lasso)

Backup: History of AIC

Akaike proposed AIC in Akaike (1973)
He’d been working before that on other ways of selecting models with good predictive performance, like Akaike (1970)
The original AIC paper made a bunch of assumptions that were very common in the literature on maximum likelihood estimation:
- We care about the likelihood (in our terms: the loss is the log probability loss)
- We estimate by maximizing the likelihood (in our terms: use the ERM)
- The model is correctly specified, so \(\mathbf{j} = \mathbf{k}\)
  - As I said this is Fisher’s identity, a basic result in likelihood theory going back to R. A. Fisher in the 1920s
  - There are actually tests for mis-specification based on estimating \(\mathbf{j}\) and \(\mathbf{k}\) separately, and seeing how far apart they are (White 1994)
AIC really is an unbiased estimate of the log-loss for a correctly-specified probability model
The original AIC paper had a huge impact and people immediate set about trying to generalize it in many different ways
- E.g., the “Bayesian information criterion” (BIC) of Schwarz (1978) (which as many people have pointed out is not really very Bayesian at all)
One of those people was K. Takeuchi, who proposed \(n^{-1}\mathbf{j}\mathbf{k}\) as a generalization which would be robust to model mis-specification
- This paper is in Japanese and apparently has never been translated; I am relying here on the account in Claeskens and Hjort (2008) (ch. 2), and am curious about how the result came to be known outside Japan
The realization that Takeuchi’s criterion coincides with leave-one-out CV in the large-sample limit came later when people were explored the properties of CV (Claeskens and Hjort 2008, sec. 2.9)

Backup: Further reading

Claeskens and Hjort (2008) is the best single reference on model selection from a statistical perspective
Arlot and Celisse (2010) is a good review of cross-validation in theory and practice
Important references on theory of CV:
- Györfi et al. (2002), chs. 7 and 8, on data splitting and leave-one-out CV for regression
  - Those chapters don’t use math beyond what we’ve done already in terms of deviation bounds, uniform convergence, etc., but the proofs are very long and don’t really simplify (though I may make their problem 7.2 an extra credit exercise)
- Laan and Dudoit (2003);Vaart, Dudoit, and Laan (2006) are very general and careful treatments using learning theory
- Other important papers from this perspective: Mitchell and Geer (2009);Lecué and Mitchell (2012);Cornec (2017)

References

Akaike, Hirotugu. 1970. “Statistical Predictor Identification.” Annals of the Institute of Statistical Mathematics 22:203–17. https://doi.org/10.1007/BF02506337.

———. 1973. “Information Theory and an Extension of the Maximum Likelihood Principle.” In Proceedings of the Scond International Symposium on Information Theory, edited by B. N. Petrov and F. Caski, 267–81. Budapest: Akademiai Kiado.

Arlot, Sylvain, and Alain Celisse. 2010. “A Survey of Cross-Validation Procedures for Model Selection.” Statistics Surveys 4:40–79. https://doi.org/10.1214/09-SS054.

Claeskens, Gerda, and Nils Lid Hjort. 2008. Model Selection and Model Averaging. Cambridge, England: Cambridge University Press.

Cornec, Matthieu. 2017. “Concentration Inequalities of the Cross-Validation Estimator for Empirical Risk Minimizer.” Statistics 51:43–60. https://doi.org/10.1080/02331888.2016.1261479.

Geisser, Seymour. 1975. “The Predictive Sample Reuse Method with Applications.” Journal of the American Statistical Association 70:320–28. https://doi.org/10.1080/01621459.1975.10479865.

Geisser, Seymour, and William F. Eddy. 1979. “A Predictive Approach to Model Selection.” Journal of the American Statistical Association 74:153–60. https://doi.org/10.1080/01621459.1979.10481632.

Györfi, László, Michael Kohler, Adam Krzyżak, and Harro Walk. 2002. A Distribution-Free Theory of Nonparametric Regression. New York: Springer-Verlag.

Homrighausen, Darren, and Daniel J. McDonald. 2013. “The Lasso, Persistence, and Cross-Validation.” In Proceedings of the \(30^{th}\) International Conference on Machine Learning, edited by Sanjoy Dasgupta and David McAllester, 28:1031–9. http://jmlr.org/proceedings/papers/v28/homrighausen13.html.

———. 2014. “Leave-One-Out Cross-Validation Is Risk Consistent for Lasso.” Machine Learning 97:65–78. https://doi.org/10.1007/s10994-014-5438-z.

———. 2017. “Risk Consistency of Cross-Validation with Lasso-Type Procedures.” Statistica Sinica 27:1017–36. https://doi.org/10.5705/ss.202015.0355.

Kearns, Michael J., and Dana Ron. 1999. “Algorithmic Stability and Sanity-Check Bounds for Leave-One-Out Cross-Validation.” Neural Computation 11:1427–53. https://doi.org/10.1162/089976699300016304.

Laan, Mark J. van der, and Sandrine Dudoit. 2003. “Unified Cross-Validation Methodology for Selection Among Estimators and a General Cross-Validated Adaptive Epsilon-Net Estimator: Finite Sample Oracle Inequalities and Examples.” 130. U.C. Berkeley Division of Biostatistics Working Paper Series. http://www.bepress.com/ucbbiostat/paper130/.

Lecué, Guillaume, and Charles Mitchell. 2012. “Oracle Inequalities for Cross-Validation Type Procedures.” Electronic Journal of Statistics 6:1803–37. https://doi.org/10.1214/12-EJS730.

Mitchell, Charles, and Sara van de Geer. 2009. “General Oracle Inequalities for Model Selection.” Electronic Journal of Statistics 3:176–204. https://doi.org/10.1214/08-EJS254.

Schwarz, Gideon. 1978. “Estimating the Dimension of a Model.” Annals of Statistics 6:461–64. http://projecteuclid.org/euclid.aos/1176344136.

Stone, M. 1974. “Cross-Validatory Choice and Assessment of Statistical Predictions.” Journal of the Royal Statistical Society B 36:111–47. http://www.jstor.org/stable/2984809.

Tibshirani, Ryan J., and Robert Tibshirani. 2009. “A Bias Correction for the Minimum Error Rate in Cross-Validation.” Annals of Applied Statistics 3:822–29. http://arxiv.org/abs/0908.2904.

Vaart, Aad W. van der, Sandrine Dudoit, and Mark J. van der Laan. 2006. “Oracle Inequalities for Multi-Fold Cross Validation.” Statistics and Decisions 24:1001–21. https://doi.org/10.1524/stnd.2006/24.3.351.

Wahba, Grace. 1990. Spline Models for Observational Data. Philadelphia: Society for Industrial; Applied Mathematics.

White, Halbert. 1994. Estimation, Inference and Specification Analysis. Cambridge, England: Cambridge University Press.

Model Selection I, Mostly Cross-Validation