Recap

We control (or at least estimate) $r(\hat{s}) - \hat{r}(\hat{s})$ and $r(\hat{s}) - r(s^*)$
If we can find maximum deviation $\Gamma_n = \max_{s \in S}{|\hat{r}(s) - r(s)|}$ that’d be enough
- Because $|r(\hat{s}) - \hat{r}(\hat{s})| \leq \Gamma_n$ and $r(\hat{s}) - r(s^*) \leq 2\Gamma_n$
$\Gamma_n \rightarrow 0$ is uniform convergence (of $\hat{r}(s)$ to $r(s)$)
Could try to prove uniform convergence by direct approximation arguments, but the math is too hard
You were promised a short-cut…

What we really need is the expected maximum deviation

Remember $|r(\hat{s}) - \hat{r}(\hat{s})| \leq \Gamma_n$
Suppose $0 \leq \ell\leq m$
- $\Rightarrow$ $\hat{r}$ has bounded-difference property with bound $m/n$
- $\Rightarrow$ $\Gamma_n$ also has BDP with bound $m/n$
Put this all together: \[\begin{eqnarray} \mathbb{P}\left( r(\hat{s}) \geq \hat{r}(\hat{s}) + \epsilon \right) & \leq & \mathbb{P}\left( \Gamma_n \geq \epsilon \right)\\ & \leq & \mathbb{P}\left( \Gamma_n - \mathbb{E}\left[ \Gamma_n \right] \geq \epsilon - \mathbb{E}\left[ \Gamma_n \right] \right)\\ & \leq & \exp{\left( -2n(\epsilon-\mathbb{E}\left[ \Gamma_n \right])^2 / m^2 \right)} \end{eqnarray}\]
Pick an $\alpha \in (0,1)$, set that equal to the right-hand side, solve for $\epsilon$: \[ \mathbb{P}\left( r(\hat{s}) \geq \hat{r}(\hat{s}) + \mathbb{E}\left[ \Gamma_n \right] + m\sqrt{\frac{\log{1/\alpha}}{2n}} \right) \leq \alpha \]
Maybe we can calculate/estimate/approximate $\mathbb{E}\left[ \Gamma_n \right]$?

Maximum deviation vs. expected maximum deviation

We’ll come back to losses and risks specifically later, for now we just have $Z_1, \ldots Z_n$ all IID, and some class of functions $\mathcal{F}$
- Abbreviate $Z_1, Z_2, \ldots Z_n$ by $Z_{1:n}$

\[ \Gamma_n = \max_{f \in \mathcal{F}}{\left|\frac{1}{n}\sum_{i=1}^{n}{f(Z_i)} - \mathbb{E}\left[ f \right]\right|} \]

$\Gamma_n$ is the maximum deviation over $\mathcal{F}$
- pedantically: maximum absolute deviation of a function from its expected value…
Just as on last slide, what matters is $\mathbb{E}\left[ \Gamma_n \right] \rightarrow 0$, plus $\Gamma_n$ concentrating around $\mathbb{E}\left[ \Gamma_n \right]$
Concentrating around the expectation we’ve got a handle on
How do we get a handle on $\mathbb{E}\left[ \Gamma_n \right]$, the expected maximum deviation?

Expected maximum deviation vs. expected maximum discrepancy

Pretend we have a ghost sample $Z^{\prime}_1, \ldots Z^{\prime}_n$ which have the same distribution as $Z_i$ but are independent \[\begin{eqnarray} \left| \left(\frac{1}{n}\sum_{i=1}^{n}{f(Z_i)}\right) - \mathbb{E}\left[ f \right] \right| & = & \left| \frac{1}{n}\sum_{i=1}^{n}{(f(Z_i) - \mathbb{E}\left[ f \right])}\right|\\ & = & \left|\frac{1}{n}\sum_{i=1}^{n}{f(Z_i) - \mathbb{E}\left[ f(Z^{\prime}_i) \right]}\right|\\ & = & \left| \frac{1}{n}\sum_{i=1}^{n}{\mathbb{E}\left[ f(Z_i) - f(Z^{\prime}_i)|Z_i \right]} \right| \\ & = & \left| \mathbb{E}\left[ \frac{1}{n}\sum_{i=1}^{n}{f(Z_i) - f(Z^{\prime}_i)|Z_{1:n}} \right] \right| \end{eqnarray}\] because all the $Z^{\prime}_i$s are independent of the $Z_i$s
$n^{-1} \sum_{i=1}^{n}{f(Z_i) - f(Z^{\prime}_i)}$ is the discrepancy between two sample averages for $f$
- $=$ how much the sample average changes just due to sampling noise

Expected maximum deviation vs. expected maximum discrepancy (2)

\[\begin{eqnarray} \Gamma_n = \max_{f \in \mathcal{F}}{\left| \left( \frac{1}{n}\sum_{i=1}^{n}{f(Z_i)} \right) - \mathbb{E}\left[ f \right] \right| } & = & \max_{f \in \mathcal{F}}{ \left| \mathbb{E}\left[ \frac{1}{n}\sum_{i=1}^{n}{f(Z_i) - f(Z^{\prime}_i)} \mid Z_{1:n} \right] \right| }\\ \end{eqnarray}\]

But $\max{\left| \mathbb{E}\left[ \cdot \right] \right| } \leq \mathbb{E}\left[ \max{\left| \cdot \right|} \right]$ so \[\begin{eqnarray} \Gamma_n & \leq & \mathbb{E}\left[ \max_{f \in \mathcal{F}}{\left| \frac{1}{n}\sum_{i=1}^{n}{f(Z_i) - f(Z_i^{\prime})} \right|} \mid Z_{1:n} \right]\\ \mathbb{E}\left[ \Gamma_n \right] & \leq & \mathbb{E}\left[ \max_{f \in \mathcal{F}}{\left| \frac{1}{n}\sum_{i=1}^{n}{f(Z_i) - f(Z_i^{\prime})} \right|} \right] \end{eqnarray}\]
The expected maximum deviation $\leq$ the expected maximum discrepancy

Expected maximum discrepancy to Rademacher complexity

We want to understand \[ \mathbb{E}\left[ \max_{f \in \mathcal{F}}{\left| \frac{1}{n}\sum_{i=1}^{n}{f(Z_i) - f(Z_i^{\prime})} \right|} \right] \]
Focus on $n^{-1} \sum_{i=1}^{n}{f(Z_i) - f(Z_i^{\prime})}$
Because $Z^{\prime}_i \stackrel{d}{=}Z_i$ but they’re independent so $f(Z_i) - f(Z_i^{\prime}) \stackrel{d}{=}f(Z_i^{\prime}) - f(Z_i)$
Introduce new random variables $\sigma_i$, $=\pm 1$ with equal probability, independent of each other and of $Z$ and $Z^{\prime}$
- Rademacher random variables (but $\sigma$ for “s” for “sign”) \[\begin{eqnarray} \sigma_i (f(Z_i) - f(Z_i^{\prime})) & \stackrel{d}{=}& f(Z_i) - f(Z_i^{\prime})\\ \frac{1}{n}\sum_{i=1}^{n}{f(Z_i) - f(Z_i^{\prime})} & \stackrel{d}{=}& \frac{1}{n}\sum_{i=1}^{n}{\sigma_i (f(Z_i) - f(Z_i^{\prime}))}\\ \mathbb{E}\left[ \max_{f \in \mathcal{F}}{\left| \frac{1}{n}\sum_{i=1}^{n}{f(Z_i) - f(Z_i^{\prime})} \right|} \right] & = & \mathbb{E}\left[ \max_{f \in \mathcal{F}}{\left| \frac{1}{n}\sum_{i=1}^{n}{\sigma_i (f(Z_i) - f(Z_i^{\prime}))} \right|} \right]\\ & \leq & \mathbb{E}\left[ \max_{f \in \mathcal{F}}{\left| \frac{1}{n}\sum_{i=1}^{n}{\sigma_i f(Z_i)}\right|} \right] + \mathbb{E}\left[ \max_{f \in \mathcal{F}}{\left| \frac{1}{n}\sum_{i=1}^{n}{-\sigma_i f(Z^{\prime}_i)}\right|} \right]\\ & = & 2\mathbb{E}\left[ \max_{f \in \mathcal{F}}{\left| \frac{1}{n}\sum_{i=1}^{n}{\sigma_i f(Z_i)}\right|} \right]\\ & \equiv & 2 \mathcal{R}_n(\mathcal{F}) \end{eqnarray}\]

Rademacher complexity

The Rademacher complexity of a function class $\mathcal{F}$, on $n$ samples, is \[ \mathcal{R}_n(\mathcal{F}) \equiv \mathbb{E}\left[ \max_{f \in \mathcal{F}}{\left| \frac{1}{n}\sum_{i=1}^{n}{\sigma_i f(Z_i)}\right|} \right] \]
- Often omit the $\mathcal{F}$ when clear from context
Some people include a factor of 2 in the definition because, as we just saw, \[ \mathbb{E}\left[ \Gamma_n \right] \leq 2 \mathcal{R}_n \]
Unpack $\mathcal{R}_n$ from the inside out: \[ \left|\frac{1}{n}\sum_{i=1}^{n}{\sigma_i f(Z_i)}\right| = \text{magnitude of sample covariance between function $f$ and binary noise $\sigma$} \] \[ \max_{f \in \mathcal{F}}{\left| \frac{1}{n}\sum_{i=1}^{n}{\sigma_i f(Z_i)}\right|} = \text{maximum apparent covariance between our functions and binary noise} \] \[ \mathbb{E}\left[ \max_{f \in \mathcal{F}}{\left| \frac{1}{n}\sum_{i=1}^{n}{\sigma_i f(Z_i)}\right|} \right] = \text{expected maximum apparent covariance between our functions and noise} \]

Rademacher complexity (2)

$\mathcal{R}_n(\mathcal{F}) =$ how well can functions from $\mathcal{F}$ correlate with pure binary noise?
High Rademacher complexity $=$ $\mathcal{F}$ seems like it can fit anything (including video snow)
Low Rademacher complexity $=$ $\mathcal{F}$ just can’t fit pure noise well
Aside: Humans (or at least undergrads at my alma mater) are remarkably good at coming up with stories that seem to fit with random noise, and their learning performance is predicted by their Rademacher complexities (Zhu, Rogers, and Gibson 2009)

Rademacher complexity and generalization error

Remember from above: for any $\alpha \in (0,1)$, \[ \mathbb{P}\left( r(\hat{s}) \geq \hat{r}(\hat{s}) + \mathbb{E}\left[ \Gamma_n \right] + m\sqrt{\frac{\log{1/\alpha}}{2n}} \right) \leq \alpha \]
But $\mathbb{E}\left[ \Gamma_n \right] \leq 2\mathcal{R}_n$ so \[ \mathbb{P}\left( r(\hat{s}) \geq \hat{r}(\hat{s}) + 2\mathcal{R}_n + m\sqrt{\frac{\log{1/\alpha}}{2n}} \right) \leq \alpha \]
This is our first generalization error bound
- Remember this assumes a bounded loss function, $0 \leq \ell\leq m$

How do we calculate the Rademacher complexity?

\[ \mathcal{R}_n(\mathcal{F}) = \mathbb{E}\left[ \max_{f \in \mathcal{F}}{\left| \frac{1}{n}\sum_{i=1}^{n}{\sigma_i f(Z_i)}\right|} \right] \]

Could try to do math…
- Lots of clever results on $\mathcal{R}(\mathcal{F})$ for various $\mathcal{F}$
- Some useful references for such results: Bartlett and Mendelson (2002), Mohri, Rostamizadeh, and Talwalkar (2012) (see under particular model classes)

How do we calculate the Rademacher complexity?

\[ \mathcal{R}_n(\mathcal{F}) = \mathbb{E}\left[ \max_{f \in \mathcal{F}}{\left| \frac{1}{n}\sum_{i=1}^{n}{\sigma_i f(Z_i)}\right|} \right] \]

We could do statistics: fix (condition on) $Z_1, \ldots Z_n$, and average over $\sigma$: \[\begin{eqnarray} \hat{\mathcal{R}}_n(\mathcal{F}) & \equiv & \mathbb{E}\left[ \max_{f \in \mathcal{F}}{\left| \frac{1}{n}\sum_{i=1}^{n}{\sigma_i f(Z_i)}\right|}|Z_{1:n} \right]\\ \mathcal{R}_n(\mathcal{F}) & = & \mathbb{E}\left[ \hat{\mathcal{R}}_n(\mathcal{F}) \right] \end{eqnarray}\]
$\hat{\mathcal{R}}_n(\mathcal{F})$ is the empirical Rademacher complexity of $\mathcal{F}$, and it is random (because it’s a function of $Z_{1:n}$)
Offline: $\hat{\mathcal{R}}_n$ again has the bounded difference property, with bound $m/n$ for each $Z_i$
$\therefore$ $\mathbb{P}\left( \mathcal{R}_n \geq \hat{\mathcal{R}}_n + \epsilon \right) \leq \exp{\left( -2n\epsilon^2/m^2 \right)}$ and we can use the empirical Rademacher complexity in our bounds (at some cost in confidence / margin of error) \[ \mathbb{P}\left( r(\hat{s}) \geq \hat{r}(\hat{s}) + 2\hat{\mathcal{R}}_n + 3m\sqrt{\frac{\log{2/\alpha}}{2n}} \right) \leq \alpha \]

Even more approximation

Simulate a $\sigma_i$ sequence and do the optimization $\max_{f \in \mathcal{F}}{| n^{-1}\sum_{i=1}^{n}{\sigma_i f(z_i)} |}$
- i.e., hold the $Z$s fixed at their observed values, then independently make up $\sigma$s, then search for the function which maximizes the sample covariance between $\sigma_i$ and $f(z_i)$
This is an unbiased estimate of $\hat{\mathcal{R}}$
Again, the bounded-difference property applies, so we can use the average of such simulations to estimate $\hat{\mathcal{R}}$
$\hat{\mathcal{R}}$ estimates $\mathcal{R}$
$2\mathcal{R}$ upper-bounds $\mathbb{E}\left[ \Gamma \right]$
which gives us a bound on $r(\hat{s}) - \hat{r}(\hat{s})$
Some book-keeping but we’ll do that in the homework!

Summing up

We want to know the expected maximum deviation $\mathbb{E}\left[ \Gamma_n \right]$ of the empirical risk from the true risk
$\mathbb{E}\left[ \Gamma_n \right] \leq$ the expected maximum discrepancy between empirical risks on two independent samples
The expected maximum discrepancy is $\leq 2\times$ expected maximum covariance with binary noise
$\therefore$ $\mathbb{E}\left[ \Gamma_n \right] \leq 2\mathcal{R}_n$
Sometimes people can calculate $\mathcal{R}_n$
Otherwise we approximate it on the training data (empirical Rademacher complexity, simulations)

Backup: Strategies vs. Losses

Set $Z_i = (X_i, Y_i)$
Set $f(Z_i) = \ell(Y_i, s(X_i))$
Each strategy $s \in S$ corresponds to an $f \in \mathcal{F}$
- $\mathcal{F}$ is called the loss class; people sometimes even call an $f$ a loss function, but that’s really confusing so please don’t
The Rademacher complexity we really care about is the R.C. of $\mathcal{F}$
Often it’s easier to get the Rademacher complexity of $S$, especially when “actions” take numerical values
Useful definition: a function $g$ is Lipsschitz with constant $h$, or $h$-Lipschitz, if $|g(x) - g(y)| \leq h d(x,y)$, where $d(x,y)$ is the distance (in whatever metric makes sense) between $x$ and $y$
- If $g$ is differentiable, $h$ is the maximum (absolute) derivative
- Examples: $|x|$ is Lipschitz; $\sin{x}$ is Lipschitz; $x^2$ is Lipschitz on the interval $[-m, m]$ (can you find the constants in each case?)
- Non-examples: $\mathbb{1}\left\{ x \leq a \right\}$ is not Lipschitz; $1/x$ is not Lipschitz on $(0, a)$; $\log{p}$ is not Lipschitz on $(0,1)$
Useful fact (which we will not prove): if $\ell(y, y^{\prime})$ is $h$-Lipschitz in $y-y^{\prime}$, with $\ell(y,y)=0$, then $\mathcal{R}_n(\mathcal{F}) \leq 2h \mathcal{R}(S)$ (Bartlett and Mendelson 2002, theorem 12)
Why it’s a useful fact: it lets us calculate the Rademacher (or empirical Rademacher) complexity of the strategies and then multiply to get the Rademacher complexity we need for generalization error bounds

References

Bartlett, Peter L., and Shahar Mendelson. 2002. “Rademacher and Gaussian Complexities: Risk Bounds and Structural Results.” Journal of Machine Learning Research 3:463–82. http://jmlr.csail.mit.edu/papers/v3/bartlett02a.html.

Mohri, Mehryar, Afshin Rostamizadeh, and Ameet Talwalkar. 2012. Foundations of Machine Learning. Cambridge, Massachusetts: MIT Press.

Zhu, Xiaojin, Timothy Rogers, and Bryan Gibson. 2009. “Human Rademacher Complexity.” In Advances in Neural Information Processing Systems 22, edited by Y. Bengio, D. Schuurmans, John Lafferty, C. K. I. Williams, and A. Culotta, 2322–30. http://papers.nips.cc/paper/3771-human-rademacher-complexity.

Bounding the Maximum Deviation with Rademacher Complexity

Recap

What we really need is the expected maximum deviation

Maximum deviation vs. expected maximum deviation

Expected maximum deviation vs. expected maximum discrepancy

Expected maximum deviation vs. expected maximum discrepancy (2)

Expected maximum discrepancy to Rademacher complexity

Rademacher complexity

Rademacher complexity (2)

Rademacher complexity and generalization error

How do we calculate the Rademacher complexity?

How do we calculate the Rademacher complexity?

Even more approximation

Summing up

Backup: Strategies vs. Losses

References