\[ \newcommand{\Prob}[1]{\mathbb{P}\left( #1 \right)} \newcommand{\Expect}[1]{\mathbb{E}\left[ #1 \right]} \newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]} \newcommand{\Cov}[1]{\mathrm{Cov}\left[ #1 \right]} \newcommand{\Risk}{r} \newcommand{\EmpRisk}{\hat{\Risk}} \newcommand{\Loss}{\ell} \newcommand{\OptimalStrategy}{\sigma} \DeclareMathOperator*{\argmin}{argmin} \newcommand{\ModelClass}{S} \newcommand{\OptimalModel}{s^*} \DeclareMathOperator{\tr}{tr} \newcommand{\Indicator}[1]{\mathbb{1}\left\{ #1 \right\}} \newcommand{\myexp}[1]{\exp{\left( #1 \right)}} \newcommand{\eqdist}{\stackrel{d}{=}} \newcommand{\Rademacher}{\mathcal{R}} \newcommand{\EmpRademacher}{\hat{\Rademacher}} \newcommand{\Growth}{\Pi} \newcommand{\VCD}{\mathrm{VCdim}} \newcommand{\OptDomain}{\Theta} \newcommand{\OptDim}{p} \newcommand{\optimand}{\theta} \newcommand{\altoptimand}{\optimand^{\prime}} \newcommand{\ObjFunc}{M} \newcommand{\outputoptimand}{\optimand_{\mathrm{out}}} \newcommand{\Hessian}{\mathbf{h}} \newcommand{\Penalty}{\Omega} \newcommand{\Lagrangian}{\mathcal{L}} \newcommand{\HoldoutRisk}{\tilde{\Risk}} \DeclareMathOperator{\sgn}{sgn} \newcommand{\Margin}{M} \newcommand{\CumLoss}{L} \newcommand{\EnsembleAction}{\overline{a}} \newcommand{\CumEnsembleLoss}{\overline{\CumLoss}} \newcommand{\Regret}{R} \]

Previously

States \(y\), actions \(a\), loss \(\Loss(y,a)\), information \(x\), strategies \(s: x \mapsto a\), model class \(\ModelClass\)
Assume \((X,Y)\) are IID so \(\Risk(s) = \Expect{\Loss(Y, s(X))}\) is well-defined and is what we’d really want to know
Risk is forward-looking (expected loss on new data) and probabilistic (expected loss on new data)
IID data assumption can be weakened to other kinds of stochastic process (say, Markov chains)
But “risk” just doesn’t make sense without some stochastic / probability assumptions

Low regret learning

Regret of an action \(a\) in state \(y\) is \[ \Loss(y,a) - \min_{a^{\prime}}{\Loss(y,a^{\prime})} \] \(=\) extra loss we could have avoided by doing something else
Regret of a strategy \(s\) on state sequence \(y_{1:n}\) and information sequence \(x_{1:n}\), compared to model class \(\ModelClass\), is \[ \sum_{t=1}^{n}{\Loss(y_t, s(x_{1:t}))} - \min_{s^{\prime} \in \ModelClass}{\sum_{t=1}^{n}{\Loss(y_t, s^{\prime}(x_{1:t}))}} \] \(=\) extra loss we could have avoided by following a different strategy

Risk vs. Regret

Risk:
- Forward-looking
- Assumes a probability distribution
- Absolute
Regret
- Backward-looking
- Only involves the actual path, not a distribution over paths (no probability)
- Comparative (relative to \(\ModelClass\))
Regret makes sense when dealing with
- Lots of uncertainty about the right model
- Lots of uncertainty about the distribution and whether it changes
- Intelligent adversaries (“adversarial” learning, not the same as “adversarial examples”)
How can we achieve low regret?

The game (“prediction with expert advice”)

We start round \(t\) with weights \(w_{i,t-1}\) over the experts \(i\)
We observe \(x_t\)
Every expert gets to announce \(s_i(x_{1:t})\), for short \(s_i(t)\)
We take action \(\EnsembleAction_t\) based on the experts’ advice and the weights
We see the actual \(y_t\) and calculate all the losses
We update the weights, to get \(w_{i,t}\)

Multiplicative weight training

We have models or experts \(s_1, s_2, \ldots s_q\)
At each time step \(t\), expert \(s_i\) recommends the action \(s_i(t)\)
\(\Loss(y_t, s_i(t)) =\) loss incurred following expert \(i\) at time \(t\)
\(\CumLoss_{i,t} \equiv \sum_{r=1}^{t}{\Loss(y_r, s_i(r))} =\) cumulative (or accumulated) loss of expert \(i\) to time \(t\)
Multiplicative weights: fix a learning rate \(\beta > 0\) and say \[ w_{i,t} \equiv \myexp{-\beta \CumLoss_{i,t}} = w_{i, t-1}\myexp{-\beta \Loss(y_t, s_i(t))} \] with \(w_{i,0} = 1\) uniformly
Normalized weights: \[ u_{i,t} \equiv \frac{w_{i,t}}{\sum_{j=1}^{q}{w_{j,t}}} \]
Ensemble predictions / actions: \[ \EnsembleAction_t = \sum_{i=1}^{q}{u_{i,t-1} s_i(t)} \]
This incurs loss \(\Loss(y_t, \EnsembleAction_t)\)
Cumulative ensemble loss \(\CumEnsembleLoss_n \equiv \sum_{t=1}^{n}{\Loss(y_t, \EnsembleAction_t)}\)

Regret of the ensemble (vs. the best expert)

The retrospectively-best expert, at time \(n\), is \[ \OptimalModel = \argmin_{i \in 1:q}{\CumLoss_{i,n}} \]
The regret of the ensemble, compared to this best expert, is \[ \Regret_n \equiv \CumEnsembleLoss_n - \min_{i \in 1:q}{\CumLoss_{i,n}} \]
We would like this to be small, but what’s “small” here?

Sub-linear regret

Assume \(\Loss \in [0,1]\)
- If \(\Loss \in [0,m]\), work with \(\frac{\Loss}{m}\) and pick up an over-all factor of \(m\) at the end
The worst the regret could possible be is \(\Regret_n = n\)
An idiot who picks one action and sticks with it no matter what has a regret that’s \(O(n)\) (at most)
\(\therefore\) It’d be good if a non-idiot could beat that
\(\therefore\) We want sub-linear regret, \(\Regret(n) = o(n)\)
An ensemble which always has \(\Regret(n) = o(n)\) is called Hannan consistent
Hannan consistency implies \(n^{-1}\Regret(n) = o(1)\)
- Confusingly, this is sometimes called “vanishing regret”

A regret bound

If \(\Loss(y,a) \in [0,1]\) and is convex in \(a\), and we use multiplicative weight training with learning rate \(\beta\) over \(q\) experts, then \[ \Regret_n \leq \frac{n\beta}{8} + \frac{\log{q}}{\beta} \] and, with the right choice of \(\beta\), \[ \Regret_n \leq \sqrt{\frac{n}{2}\log{q}} \]

Proving the regret bound

Introduce \(W_t = \sum_{i=1}^{q}{w_{i,t}}\), with \(W_0 = q\)
We’re going to show that \(\log{W_n/W_0}\) is lower bounded by the loss of the best expert, and upper bounded by the loss of the ensemble
- The difference will tell us about the regret

Lower bound

\[\begin{eqnarray} \log{\frac{W_n}{W_0}} & = & \log{\sum_{i=1}^{q}{w_{i,t}}} - \log{\sum_{i=1}^{q}{w_{i,0}}}\\ & = & \log{\sum_{i=1}^{n}{\myexp{-\beta \CumLoss_{i,n}}}} - \log{q}\\ & \geq & \log{\max_{i \in 1:q}{\myexp{-\beta \CumLoss_{i,n}}}} - \log{q}\\ & = & -\beta\min_{i \in 1:q}{\CumLoss_{i,n}} - \log{q} \end{eqnarray}\]

Proving the regret bound (2)

Upper bound

\[\begin{eqnarray} \frac{W_n}{W_0} & = & \frac{W_n}{W_{n-1}}\frac{W_{n-1}}{W_{n-2}} \ldots \frac{W_1}{W_0}\\ \log{\frac{W_n}{W_0}} & = & \sum_{t=1}^{n}{\log{\frac{W_{t}}{W_{t-1}}}}\\ \log{\frac{W_{t}}{W_{t-1}}} & = & \log{\frac{\sum_{i=1}^{q}{w_{i,t}}}{\sum_{j=1}^{q}{w_{j,t-1}}}}\\ & = & \log{\frac{\sum_{i=1}^{q}{w_{i,t-1}\myexp{-\beta \Loss(y_t, s_i(t))}}}{\sum_{j=1}^{q}{w_{j,t-1}}}}\\ & = & \log{\sum_{i=1}^{q}{u_{i,t-1} \myexp{-\beta \Loss(y_t, s_i(t))}}} \end{eqnarray}\]

Proving the regret bound (3)

Upper bound

Remember the Hoeffding bound: if \(Z \in [a,b]\), then \(\Expect{e^tZ} \leq e^{t\Expect{Z}}e^{t^2 (b-a)^2/8}\)
Implying \(\log{\Expect{e^{tZ}}} \leq t\Expect{Z} + \frac{t^2(b-a)^2}{8}\)
Here: \[\begin{eqnarray} \log{\frac{W_{t}}{W_{t-1}}} & = & \log{\sum_{i=1}^{q}{u_{i,t-1} \myexp{-\beta \Loss(y_t, s_i(t))}}}\\ & \leq & -\beta\sum_{i=1}^{q}{u_{i, t-1} \Loss(y_t, s_i(t))} + \frac{\beta^2}{8} \end{eqnarray}\]
Now use convexity: \[\begin{eqnarray} \log{\frac{W_{t}}{W_{t-1}}} & \leq & -\beta\Loss\left(y_t, \sum_{i=1}^{q}{u_{i, t-1} s_i(t)}\right) + \frac{\beta^2}{8}\\ & = & -\beta\Loss(y_t, \EnsembleAction_t) + \frac{\beta^2}{8} \end{eqnarray}\]

Proving the regret bound (4)

Upper bound

\[\begin{eqnarray} \log{\frac{W_n}{W_0}} & = & \sum_{t=1}^{n}{\log{\frac{W_{t}}{W_{t-1}}}}\\ & \leq & \sum_{t=1}^{n}{\left(-\beta \Loss(y_t, \EnsembleAction_t) + \frac{\beta^2}{8}\right)}\\ & = & -\beta\CumEnsembleLoss_n + \frac{n \beta^2}{8} \end{eqnarray}\]

Combine upper and lower bounds

\[\begin{eqnarray} -\beta\min_{i \in 1:q}{\CumLoss_{i,n}} - \log{q} & \leq & \log{\frac{W_n}{W_0}} \leq -\beta\CumEnsembleLoss_n + \frac{n \beta^2}{8}\\ \beta\left(\CumEnsembleLoss_n - \min_{i \in 1:q}{\CumLoss_{i,n}}\right) & \leq & \frac{n \beta^2}{8} + \log{q}\\ \Regret_n & \leq & \frac{n \beta}{8} + \frac{\log{q}}{\beta} ~ \Box \end{eqnarray}\]

Tricks, modifications

\(\Loss \in [0,m]\): work with \(\frac{\Loss}{m}\), multiply the final bound by \(m\)
Time-varying loss functions: no change necessary
If \(\Loss(y,a)\) isn’t convex in \(a\): Randomly pick expert \(i\) with probability \(u_{i,t-1}\) and take action \(s_i(t)\); then \(\Expect{\Loss(y, A_t)}\) is convex in the probabilities, and the bounds hold for expected regret
- Expectation is over the random choices of expert, not over the data
- Can you use Hoeffding’s inequality to argue the regret is small with high probability?
Unbounded losses: can work if the loss function has nice properties, e.g., log probability loss
Infinitely many experts: needs some extra structure, like an initial distribution over experts
- Multiplicative weight training then becomes evolution by natural selection (without mutation or other sources of variation)

Unbounded horizons, changing learning rates

With fixed \(\beta\) the regret bound is linear
If we know \(n\), we can pick \(\beta\) to get a regret bound of \(\sqrt{\frac{n}{2}\log{q}}\) which is sub-linear
But the right \(\beta\) depends on \(n\), what if we want to keep going?
A similar (but even longer) proof shows that if \(\beta_t = \sqrt{8\frac{\log{q}}{t}}\), then \[ \Regret_n \leq 2\sqrt{\frac{n}{2}\log{q}} + \sqrt{\frac{\log{q}}{8}} \]
We have to slow down learning (\(\beta_t \rightarrow 0\)) but we still get sub-linear regret, and basically just pick up a factor of 2

Summing up

In the rest of this course, we’ve focused on achieving low risk \(=\) expected loss on new data, but we’ve had to assume IID data to do so
Regret \(=\) actual loss on old data, compared to actual loss on old data of some other models / strategies / experts
Assume \(\Loss \in [0,m]\) and is convex in our action
Give \(s_i\) a weight \(w_{i,t} = \myexp{-\beta\sum_{t=1}^{n}{\Loss(y_t, s_i(t))}}\)
The ensemble takes action \(\EnsembleAction = \sum_{i=1}^{q}{s_i(t) \frac{w_{i,t-1}}{\sum_{j=1}^{q}{w_{j,t-1}}}}\)
With the right choice of \(\beta\), the regret is \(m\sqrt{\frac{n}{2}\log{q}}\), no matter what the states \(y_1, \ldots y_n\) are
Low-regret prediction / action is possible without any probability assumptions at all
Next time: regret compared to a sequence of experts; also situations where low regret and low risk go together, and when they pull apart

Low-Regret Learning I

Previously

Low regret learning

Risk vs. Regret

The game (“prediction with expert advice”)

Multiplicative weight training

Regret of the ensemble (vs. the best expert)

Sub-linear regret

A regret bound

Proving the regret bound

Lower bound

Proving the regret bound (2)

Upper bound

Proving the regret bound (3)

Upper bound

Proving the regret bound (4)

Upper bound

Combine upper and lower bounds

Tricks, modifications

Unbounded horizons, changing learning rates

Summing up