Model Selection I, Mostly Cross-Validation

36-465/665, Spring 2021

25 March 2021 (Lecture 15)

\[ \newcommand{\Prob}[1]{\mathbb{P}\left( #1 \right)} \newcommand{\Expect}[1]{\mathbb{E}\left[ #1 \right]} \newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]} \newcommand{\Cov}[1]{\mathrm{Cov}\left[ #1 \right]} \newcommand{\Risk}{r} \newcommand{\EmpRisk}{\hat{\Risk}} \newcommand{\Loss}{\ell} \newcommand{\OptimalStrategy}{\sigma} \DeclareMathOperator*{\argmin}{argmin} \newcommand{\ModelClass}{S} \newcommand{\OptimalModel}{s^*} \DeclareMathOperator{\tr}{tr} \newcommand{\Indicator}[1]{\mathbb{1}\left\{ #1 \right\}} \newcommand{\myexp}[1]{\exp{\left( #1 \right)}} \newcommand{\eqdist}{\stackrel{d}{=}} \newcommand{\Rademacher}{\mathcal{R}} \newcommand{\EmpRademacher}{\hat{\Rademacher}} \newcommand{\Growth}{\Pi} \newcommand{\VCD}{\mathrm{VCdim}} \newcommand{\OptDomain}{\Theta} \newcommand{\OptDim}{p} \newcommand{\optimand}{\theta} \newcommand{\altoptimand}{\optimand^{\prime}} \newcommand{\ObjFunc}{M} \newcommand{\outputoptimand}{\optimand_{\mathrm{out}}} \newcommand{\Hessian}{\mathbf{h}} \newcommand{\Penalty}{\Omega} \newcommand{\Lagrangian}{\mathcal{L}} \newcommand{\HoldoutRisk}{\tilde{\Risk}} \]

Previously

Model selection

A reasonable (?) goal for model selection

Data splitting, a.k.a sampling splitting, a.k.a. hold-out

Data-splitting is an unbiased estimate of the risk

A crude but still informative result on splitting

Pick \(\hat{k}\) by data splitting. Suppose the loss function is bounded, \(0 \leq \Loss \leq m\). Then, for any probability \(\alpha \in (0,1)\), \[ \Prob{\Risk(\hat{s}_{\hat{k}}) \leq \Risk(\hat{s}_{k^*}) + m\sqrt{\frac{2\log{(2q/\alpha)}}{n_s}}} \geq 1-\alpha \]

Proving the result on splitting

Proving the result on splitting (2)

Proving the result on splitting (3)

Why not stop here?

Cross-validation (CV)

Simple or leave-one-out CV

\(k\)-fold CV (or \(v\)-fold CV)

Cross-validation: why, roughly?

Bias-variance again

Why proving things about CV is hard

Morals/guidelines from the actual results

Predict well, or find the truth?

LOOCV: Do we really have to?

A short-cut for linear smoothers

What if we don’t have a liner smoother?

Akaike’s Information Criterion (AIC)

AIC (2)

Summing up

Backup: more on the computational complexity of LOOCV vs KFCV

Backup: History of CV

Backup: Didn’t I promise saying how much we should regularize?!?

Backup: History of AIC

Backup: Further reading

References

Akaike, Hirotugu. 1970. “Statistical Predictor Identification.” Annals of the Institute of Statistical Mathematics 22:203–17. https://doi.org/10.1007/BF02506337.

———. 1973. “Information Theory and an Extension of the Maximum Likelihood Principle.” In Proceedings of the Scond International Symposium on Information Theory, edited by B. N. Petrov and F. Caski, 267–81. Budapest: Akademiai Kiado.

Arlot, Sylvain, and Alain Celisse. 2010. “A Survey of Cross-Validation Procedures for Model Selection.” Statistics Surveys 4:40–79. https://doi.org/10.1214/09-SS054.

Claeskens, Gerda, and Nils Lid Hjort. 2008. Model Selection and Model Averaging. Cambridge, England: Cambridge University Press.

Cornec, Matthieu. 2017. “Concentration Inequalities of the Cross-Validation Estimator for Empirical Risk Minimizer.” Statistics 51:43–60. https://doi.org/10.1080/02331888.2016.1261479.

Geisser, Seymour. 1975. “The Predictive Sample Reuse Method with Applications.” Journal of the American Statistical Association 70:320–28. https://doi.org/10.1080/01621459.1975.10479865.

Geisser, Seymour, and William F. Eddy. 1979. “A Predictive Approach to Model Selection.” Journal of the American Statistical Association 74:153–60. https://doi.org/10.1080/01621459.1979.10481632.

Györfi, László, Michael Kohler, Adam Krzyżak, and Harro Walk. 2002. A Distribution-Free Theory of Nonparametric Regression. New York: Springer-Verlag.

Homrighausen, Darren, and Daniel J. McDonald. 2013. “The Lasso, Persistence, and Cross-Validation.” In Proceedings of the \(30^{th}\) International Conference on Machine Learning, edited by Sanjoy Dasgupta and David McAllester, 28:1031–9. http://jmlr.org/proceedings/papers/v28/homrighausen13.html.

———. 2014. “Leave-One-Out Cross-Validation Is Risk Consistent for Lasso.” Machine Learning 97:65–78. https://doi.org/10.1007/s10994-014-5438-z.

———. 2017. “Risk Consistency of Cross-Validation with Lasso-Type Procedures.” Statistica Sinica 27:1017–36. https://doi.org/10.5705/ss.202015.0355.

Kearns, Michael J., and Dana Ron. 1999. “Algorithmic Stability and Sanity-Check Bounds for Leave-One-Out Cross-Validation.” Neural Computation 11:1427–53. https://doi.org/10.1162/089976699300016304.

Laan, Mark J. van der, and Sandrine Dudoit. 2003. “Unified Cross-Validation Methodology for Selection Among Estimators and a General Cross-Validated Adaptive Epsilon-Net Estimator: Finite Sample Oracle Inequalities and Examples.” 130. U.C. Berkeley Division of Biostatistics Working Paper Series. http://www.bepress.com/ucbbiostat/paper130/.

Lecué, Guillaume, and Charles Mitchell. 2012. “Oracle Inequalities for Cross-Validation Type Procedures.” Electronic Journal of Statistics 6:1803–37. https://doi.org/10.1214/12-EJS730.

Mitchell, Charles, and Sara van de Geer. 2009. “General Oracle Inequalities for Model Selection.” Electronic Journal of Statistics 3:176–204. https://doi.org/10.1214/08-EJS254.

Schwarz, Gideon. 1978. “Estimating the Dimension of a Model.” Annals of Statistics 6:461–64. http://projecteuclid.org/euclid.aos/1176344136.

Stone, M. 1974. “Cross-Validatory Choice and Assessment of Statistical Predictions.” Journal of the Royal Statistical Society B 36:111–47. http://www.jstor.org/stable/2984809.

Tibshirani, Ryan J., and Robert Tibshirani. 2009. “A Bias Correction for the Minimum Error Rate in Cross-Validation.” Annals of Applied Statistics 3:822–29. http://arxiv.org/abs/0908.2904.

Vaart, Aad W. van der, Sandrine Dudoit, and Mark J. van der Laan. 2006. “Oracle Inequalities for Multi-Fold Cross Validation.” Statistics and Decisions 24:1001–21. https://doi.org/10.1524/stnd.2006/24.3.351.

Wahba, Grace. 1990. Spline Models for Observational Data. Philadelphia: Society for Industrial; Applied Mathematics.

White, Halbert. 1994. Estimation, Inference and Specification Analysis. Cambridge, England: Cambridge University Press.