\[ \newcommand{\Prob}[1]{\mathbb{P}\left( #1 \right)} \newcommand{\Expect}[1]{\mathbb{E}\left[ #1 \right]} \newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]} \newcommand{\Cov}[1]{\mathrm{Cov}\left[ #1 \right]} \newcommand{\Risk}{r} \newcommand{\EmpRisk}{\hat{r}} \newcommand{\Loss}{\ell} \newcommand{\OptimalStrategy}{\sigma} \DeclareMathOperator*{\argmin}{argmin} \newcommand{\ModelClass}{S} \newcommand{\OptimalModel}{s^*} \DeclareMathOperator{\tr}{tr} \newcommand{\Indicator}[1]{\mathbb{1}\left\{ #1 \right\}} \newcommand{\myexp}[1]{\exp{\left( #1 \right)}} \newcommand{\eqdist}{\stackrel{d}{=}} \newcommand{\Rademacher}{\mathcal{R}} \newcommand{\EmpRademacher}{\hat{\Rademacher}} \newcommand{\Growth}{\Pi} \newcommand{\VCD}{\mathrm{VCdim}} \newcommand{\OptDomain}{\Theta} \newcommand{\OptDim}{p} \newcommand{\optimand}{\theta} \newcommand{\altoptimand}{\optimand^{\prime}} \newcommand{\ObjFunc}{M} \newcommand{\outputoptimand}{\optimand_{\mathrm{out}}} \newcommand{\Hessian}{\mathbf{h}} \newcommand{\Penalty}{\Omega} \newcommand{\Lagrangian}{\mathcal{L}} \]

Previously

We learn a strategy by minimizing the empirical risk, \[ \hat{s} = \argmin_{s \in \ModelClass}{\EmpRisk(s)} \]
Or maybe we learn a strategy by minimizing a regularized version of the empirical risk \[ \hat{s} = \argmin_{s \in \ModelClass}{\EmpRisk(s) + \lambda \Penalty(s)} \]
- Equivalently: minimize over \(s\) where \(\Penalty(s) \leq c\)
Generalization error bounds have come from uniform convergence of empirical risks to true risks, \(\max_{s}{|\EmpRisk(s) - \Risk(s)|} \rightarrow 0\)
- Rademacher complexity bounds this maximum deviation in a distribution-dependent way
- Growth function, covering number, VC dimension all bound the maximum deviation in a distribution-free way
Regularization \(\Rightarrow\) faster uniform convergence
Approximate, numerical optimization \(\Rightarrow\) an extra source of error

Today

Uniform convergence uses the properties of the strategy set \(\ModelClass\) and the data-generating distribution
From this perspective the optimization algorithm is just a nuisance
But: the algorithm never even looks at most of \(\ModelClass\) in fitting
Do we really need to worry so much that strategies we never thought about might be badly behaved?
Maybe we can use properties of the fitting algorithm to guarantee generalization
Important property: stability, little changes to the data can’t make a big difference

The learning algorithm

For us, a learning algorithm is a function \(A\) from data sets \(Z_1, Z_2, \ldots Z_n\), for short \(Z_{1:n}\), to strategies in \(S\)
- For prediction problems each \(Z_i = (X_i, Y_i)\)
- Symbols: \(A: \mathcal{Z}^n \mapsto S\)
  - More strictly \(A: \mathcal{Z}^* \mapsto S\), the space of countable \(\mathcal{Z}\)-valued sequences
- In a CS course we’d add more restrictions related to computability
Think of the optimization problem plus the method used to approximate the optimum
Randomized algorithms are allowed
- In particular this applies to stochastic gradient descent…
Usually we insist that \(A\) doesn’t care about the order of the \(Z_i\)s but this isn’t always necessary

Stability

Roughly: an algorithm \(A\) is “stable” when small changes to the inputs lead to small (or no) changes to the output
Many, many different specific forms of stability
For learning algorithms: hypothesis stability vs. error stability
- “The output doesn’t change much” vs “The output works about as well”

Hypothesis stability

The algorithm \(A\) is hypothesis stable when, for any two data sets \(Z_{1:n}\) and \(Z^{\prime}_{1:n}\) that differ in only one data point, \(A(Z_{1:n})\) and \(A(Z^{\prime}_{1:n})\) must be close

In words: changing one data point can’t change the output very much
Difficulty: how do we measure distances between models or strategies?
- Not too hard when the output is, say, a curve or surface
- What to do when the output is, e.g., a medical treatment plan? (Laber et al. 2014)
- We don’t really care about the strategy, we care about its risk!
- Think: many local minima nearly as good as the global optimum

Error stability

The algorithm \(A\) is \(\beta_n\)-error stable (or just error stable) when, for any two data sets \(Z_{1:n}\) and \(Z^{\prime}_{1:n}\) that differ in only one data point, and any new data point \(z\), \(|\Loss(z, A(Z_{1:n})) - \Loss(z, A(Z^{\prime}_{1:n}))| \leq \beta_n\)

In words: changing one data point can’t change the performance of the output too much
Avoids comparing outputs, just compares their losses
\(\beta_n\) does usually change with \(n\) (as the notation implies)
- we’ll see that we’d like it to \(\rightarrow 0\) as \(n\rightarrow\infty\), and fairly fast at that

Error stability implies a generalization error bound

The definition of error stability might bring to mind the bounded difference property…
Suppose the loss function is bounded, \(0 \leq \Loss \leq m\), and that the learning algorithm \(A\) is \(\beta_n\)-stable. Pick any probability \(\alpha \in (0,1)\). Then \[ \Prob{\Risk(A(Z_{1:n})) \leq \EmpRisk(A(Z_{1:n})) + \beta_n + (2n\beta_n + m)\sqrt{\frac{\log{1/\alpha}}{2n}}} \geq 1-\alpha \]
Proof: Deferred to homework, but the big idea is in two parts:
- Showing that error stability implies \(\Risk(A(Z_{1:n})) - \EmpRisk(A(Z_{1:n}))\) has the bounded difference property, so by the McDiarmid/bounded difference inequality it concentrates around its expected value
- Showing that the expectation is at most \(\beta_n\)

Error stability implies a generalization error bound (2)

\[ \Prob{\Risk(A(Z_{1:n})) \leq \EmpRisk(A(Z_{1:n})) + \beta_n + (2n\beta_n + m)\sqrt{\frac{\log{1/\alpha}}{2n}}} \geq 1-\alpha \]

Bounds the risk of the output of our algorithm
Does so because of a property of the algorithm (error stability)
This bound says nothing about the risk of other strategies
In particular, does not need uniform convergence, \(\max_{s \in \ModelClass}{|\EmpRisk(s) - \Risk(s)|} \rightarrow 0\)

Increasingly stable

\[ \Prob{\Risk(A(Z_{1:n})) - \EmpRisk(A(Z_{1:n})) \leq \beta_n + (2n\beta_n + m)\sqrt{\frac{\log{1/\alpha}}{2n}}} \geq 1-\alpha \]

We’d like the width of the bound to \(\rightarrow 0\) as \(n\rightarrow \infty\)
For that to be true, 1st part of the bound demands \(\beta_n \rightarrow 0\), or \(\beta_n = o(1)\)
2nd part of the bound demands more, \(\frac{n\beta_n}{\sqrt{n}} \rightarrow 0\), or \(\beta_n = o(1/\sqrt{n})\)
We need the algorithm to be increasingly stable as \(n\) grows, and reasonably quickly
It turns out that many algorithms actually have \(\beta_n = O(1/n)\)

A stability bound for ridge regression

Suppose we’re doing ridge regression, with penalty factor \(\lambda\), and all the \(X\) vectors are of bounded length, \(\Prob{\|X\| \leq \rho} = 1\), and that the squared-error loss is bounded above by \(m\). Then for any \(\alpha \in (0,1)\), \[ \Prob{\Risk(\text{ridge}) \leq \EmpRisk(\text{ridge}) + \frac{4m\rho^2}{\lambda n} + \left(\frac{8m\rho^2}{\lambda} + m\right)\sqrt{\frac{\log{1/\alpha}}{2n}}} \geq 1-\alpha \]

That is, under these conditions ridge regression is error-stable, with \(\beta_n = 4m\rho^2/\lambda n\)
- Notice \(\beta_n \propto 1/n\) so it’s \(o(1/\sqrt{n})\) and this bound gets tighter and tighter as \(n\) grows
Proof is deferred until we look at kernel methods in a few weeks, because this is a special case of a more general result about kernel regression
Higher \(\lambda\) \(\Rightarrow\) smaller \(\beta_n\) \(\Rightarrow\) more stable algorithm
For the bound to be small, need \(\frac{1}{\lambda \sqrt{n}} \ll 1\), or \(\lambda \gg 1/\sqrt{n}\), but usually we want to let \(\lambda \rightarrow 0\) as \(n\) grows…

Stability vs. Rademacher complexity

Here’s our stability bound again: \[ \Prob{\Risk(A(Z_{1:n})) - \EmpRisk(A(Z_{1:n})) \leq \beta_n + (2n\beta_n + m)\sqrt{\frac{\log{1/\alpha}}{2n}}} \geq 1-\alpha \]
And here’s our Rademacher bound from lecture 9: \[ \Prob{\max_{s \in \mathcal{S}}{\Risk(s) - \EmpRisk(s)} \leq 2\Rademacher_n + m\sqrt{\frac{\log{1/\alpha}}{2n}}} \geq 1-\alpha \]
\(\Rademacher_n\) typically \(O(1/\sqrt{n})\), good algorithms have \(\beta_n = O(1/n)\), so both bounds end up \(O(1/\sqrt{n})\)
\(\beta_n\) is acting kind of like \(\Rademacher_n\) (or vice versa)
The Rademacher bound holds uniformly over all strategies, including the ERM \(\hat{s}\) and whatever the algorithm produces, \(A(Z_{1:n})\)
- In that sense, the Rademacher bound is stronger bound than the stability bound
- On the other hand, the stability bound can hold even if uniform convergence doesn’t
The stability bound only tells us about the output of the algorithm \(A\)
The Rademacher bound will force \(\Risk(\hat{s}) \rightarrow \Risk(\OptimalModel)\) (as we’ve seen)
The stability bound will not guarantee that \(\Risk(A(Z_{1:n})) \rightarrow \Risk(\OptimalModel)\)
- The algorithm might be stable, but not converge on the optimal strategy
- Extreme case: \(A\) just ignores the data and always outputs its favorite \(s_0 \in \ModelClass\); very tight generalization error bound, much stability, no convergence to the optimum (unless that’s \(s_0\))

Summing up

Stability of the learning algorithm is another, separate root to a generalization error bound
This can work even if uniform convergence doesn’t
Stability often comes from penalties or constraints on the optimization
There are some advantages to being able to rely on uniform convergence though

Backup: History

Some work on algorithmic stability back to the 1970s
Key paper: Kearns and Ron (1999) first clearly distinguished hypothesis stability and error stability, and argued for the latter
- More intuitive definition of error stability, that adding one new data point can’t change the performance very much; leads to slightly more awkward bounds, though
Modern stability bounds (like the one quoted) descend from Bousquet and Elisseeff (2002)
Domingos (1999) gives an interesting related approach, basically looking at how many distinct models the optimization algorithm searched through

References

Bousquet, Olivier, and André Elisseeff. 2002. “Stability and Generalization.” Journal of Machine Learning Research 2:499–526. http://jmlr.csail.mit.edu/papers/v2/bousquet02a.html.

Domingos, Pedro. 1999. “Process-Oriented Estimation of Generalization Error.” In Proceedings of the Sixteenth International Joint Conference on Artificial Intelligence, 714–19. San Francisco: Morgan Kaufmann. http://www.cs.washington.edu/homes/pedrod/papers/ijcai99.pdf.

Kearns, Michael J., and Dana Ron. 1999. “Algorithmic Stability and Sanity-Check Bounds for Leave-One-Out Cross-Validation.” Neural Computation 11:1427–53. https://doi.org/10.1162/089976699300016304.

Laber, Eric B., Daniel J. Lizotte, Min Qian, William E. Pelham, and Susan A. Murphy. 2014. “Dynamic Treatment Regimes: Technical Challenges and Applications.” Electronic Journal of Statistics 8:1225–72. https://doi.org/10.1214/14-EJS920.