Model Regularization and Model Complexity

36-465/665, Spring 2021

23 March 2021 (Lecture 14)

\[ \newcommand{\Prob}[1]{\mathbb{P}\left( #1 \right)} \newcommand{\Expect}[1]{\mathbb{E}\left[ #1 \right]} \newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]} \newcommand{\Cov}[1]{\mathrm{Cov}\left[ #1 \right]} \newcommand{\Risk}{r} \newcommand{\EmpRisk}{\hat{r}} \newcommand{\Loss}{\ell} \newcommand{\OptimalStrategy}{\sigma} \DeclareMathOperator*{\argmin}{argmin} \newcommand{\ModelClass}{S} \newcommand{\OptimalModel}{s^*} \DeclareMathOperator{\tr}{tr} \newcommand{\Indicator}[1]{\mathbb{1}\left\{ #1 \right\}} \newcommand{\myexp}[1]{\exp{\left( #1 \right)}} \newcommand{\eqdist}{\stackrel{d}{=}} \newcommand{\Rademacher}{\mathcal{R}} \newcommand{\EmpRademacher}{\hat{\Rademacher}} \newcommand{\Growth}{\Pi} \newcommand{\VCD}{\mathrm{VCdim}} \newcommand{\OptDomain}{\Theta} \newcommand{\OptDim}{p} \newcommand{\optimand}{\theta} \newcommand{\altoptimand}{\optimand^{\prime}} \newcommand{\ObjFunc}{M} \newcommand{\outputoptimand}{\optimand_{\mathrm{out}}} \newcommand{\Hessian}{\mathbf{h}} \newcommand{\Penalty}{\Omega} \newcommand{\Lagrangian}{\mathcal{L}} \]

Previously

Capacity of a class of models or functions

Capacity and model complexity

Capacity and model complexity

What makes these measures of model complexity?

Why would we ever want high complexity models?

What does regularization do?

How much regularization?

Why we should regularize less and less as \(n\) grows

“The method of sieves”

Let’s be a little concrete

Let’s be a little concrete (2)

Dashed: \(n\)-independent approximation error terms \(\propto 1/c\); dotted lines, estimation error terms proportional to \(c/\sqrt{n}\); solid lines, sum of approximation and estimation error terms; blue smaller \(n\), green larger \(n\). Note log scale on horizontal axis to show details. Observe how the minimum for the solid curves (\(=\) optimal level of the constraint) is larger at larger sample size.

Summing up

Backup: Further reading on sieves

Backup: “Stochastic complexity”

References

Geer, Sara A. van de. 2000. Empirical Processes in M-Estimation. Cambridge, England: Cambridge University Press.

Geman, Stuart, and Chii-Ruey Hwang. 1982. “Nonparametric Maximum Likelihood Estimation by the Method of Sieves.” Annals of Statistics 10:401–14. https://doi.org/10.1214/aos/1176345782.

Grenander, Ulf. 1981. Abstract Inference. New York: Wiley.

———. 1996. Elements of Pattern Theory. Baltimore, Maryland: Johns Hopkins University Press.

Grenander, Ulf, Y. Chow, and D. M. Keenan. 1991. Hands: A Pattern Theoretic Study of Biological Shapes. New York: Springer-Verlag.

Grenander, Ulf, and Kevin M. Manbeck. 1993. “A Stochastic Shape and Color Model for Defect Detection in Potatoes.” Journal of Computational and Graphical Statistics 2:131–51. https://doi.org/10.2307/1390696.

Grenander, Ulf, and Murray Rosenblatt. 1957. Statistical Analysis of Stationary Time Series. New York: Wiley.

Grünwald, Peter D. 2007. The Minimum Description Length Principle. Cambridge, Massachusetts: MIT Press.