Stability and Generalization

36-465/665, Spring 2021

18 March 2021 (Lecture 13)

\[ \newcommand{\Prob}[1]{\mathbb{P}\left( #1 \right)} \newcommand{\Expect}[1]{\mathbb{E}\left[ #1 \right]} \newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]} \newcommand{\Cov}[1]{\mathrm{Cov}\left[ #1 \right]} \newcommand{\Risk}{r} \newcommand{\EmpRisk}{\hat{r}} \newcommand{\Loss}{\ell} \newcommand{\OptimalStrategy}{\sigma} \DeclareMathOperator*{\argmin}{argmin} \newcommand{\ModelClass}{S} \newcommand{\OptimalModel}{s^*} \DeclareMathOperator{\tr}{tr} \newcommand{\Indicator}[1]{\mathbb{1}\left\{ #1 \right\}} \newcommand{\myexp}[1]{\exp{\left( #1 \right)}} \newcommand{\eqdist}{\stackrel{d}{=}} \newcommand{\Rademacher}{\mathcal{R}} \newcommand{\EmpRademacher}{\hat{\Rademacher}} \newcommand{\Growth}{\Pi} \newcommand{\VCD}{\mathrm{VCdim}} \newcommand{\OptDomain}{\Theta} \newcommand{\OptDim}{p} \newcommand{\optimand}{\theta} \newcommand{\altoptimand}{\optimand^{\prime}} \newcommand{\ObjFunc}{M} \newcommand{\outputoptimand}{\optimand_{\mathrm{out}}} \newcommand{\Hessian}{\mathbf{h}} \newcommand{\Penalty}{\Omega} \newcommand{\Lagrangian}{\mathcal{L}} \]

Previously

Today

The learning algorithm

Stability

Hypothesis stability

The algorithm \(A\) is hypothesis stable when, for any two data sets \(Z_{1:n}\) and \(Z^{\prime}_{1:n}\) that differ in only one data point, \(A(Z_{1:n})\) and \(A(Z^{\prime}_{1:n})\) must be close

Error stability

The algorithm \(A\) is \(\beta_n\)-error stable (or just error stable) when, for any two data sets \(Z_{1:n}\) and \(Z^{\prime}_{1:n}\) that differ in only one data point, and any new data point \(z\), \(|\Loss(z, A(Z_{1:n})) - \Loss(z, A(Z^{\prime}_{1:n}))| \leq \beta_n\)

Error stability implies a generalization error bound

Error stability implies a generalization error bound (2)

\[ \Prob{\Risk(A(Z_{1:n})) \leq \EmpRisk(A(Z_{1:n})) + \beta_n + (2n\beta_n + m)\sqrt{\frac{\log{1/\alpha}}{2n}}} \geq 1-\alpha \]

Increasingly stable

\[ \Prob{\Risk(A(Z_{1:n})) - \EmpRisk(A(Z_{1:n})) \leq \beta_n + (2n\beta_n + m)\sqrt{\frac{\log{1/\alpha}}{2n}}} \geq 1-\alpha \]

A stability bound for ridge regression

Suppose we’re doing ridge regression, with penalty factor \(\lambda\), and all the \(X\) vectors are of bounded length, \(\Prob{\|X\| \leq \rho} = 1\), and that the squared-error loss is bounded above by \(m\). Then for any \(\alpha \in (0,1)\), \[ \Prob{\Risk(\text{ridge}) \leq \EmpRisk(\text{ridge}) + \frac{4m\rho^2}{\lambda n} + \left(\frac{8m\rho^2}{\lambda} + m\right)\sqrt{\frac{\log{1/\alpha}}{2n}}} \geq 1-\alpha \]

Stability vs. Rademacher complexity

Summing up

Backup: History

References

Bousquet, Olivier, and André Elisseeff. 2002. “Stability and Generalization.” Journal of Machine Learning Research 2:499–526. http://jmlr.csail.mit.edu/papers/v2/bousquet02a.html.

Domingos, Pedro. 1999. “Process-Oriented Estimation of Generalization Error.” In Proceedings of the Sixteenth International Joint Conference on Artificial Intelligence, 714–19. San Francisco: Morgan Kaufmann. http://www.cs.washington.edu/homes/pedrod/papers/ijcai99.pdf.

Kearns, Michael J., and Dana Ron. 1999. “Algorithmic Stability and Sanity-Check Bounds for Leave-One-Out Cross-Validation.” Neural Computation 11:1427–53. https://doi.org/10.1162/089976699300016304.

Laber, Eric B., Daniel J. Lizotte, Min Qian, William E. Pelham, and Susan A. Murphy. 2014. “Dynamic Treatment Regimes: Technical Challenges and Applications.” Electronic Journal of Statistics 8:1225–72. https://doi.org/10.1214/14-EJS920.