\[ \newcommand{\Prob}[1]{\mathbb{P}\left( #1 \right)} \newcommand{\Expect}[1]{\mathbb{E}\left[ #1 \right]} \newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]} \newcommand{\Cov}[1]{\mathrm{Cov}\left[ #1 \right]} \newcommand{\Risk}{r} \newcommand{\EmpRisk}{\hat{\Risk}} \newcommand{\Loss}{\ell} \newcommand{\OptimalStrategy}{\sigma} \DeclareMathOperator*{\argmin}{argmin} \newcommand{\ModelClass}{S} \newcommand{\OptimalModel}{s^*} \DeclareMathOperator{\tr}{tr} \newcommand{\Indicator}[1]{\mathbb{1}\left\{ #1 \right\}} \newcommand{\myexp}[1]{\exp{\left( #1 \right)}} \newcommand{\eqdist}{\stackrel{d}{=}} \newcommand{\Rademacher}{\mathcal{R}} \newcommand{\EmpRademacher}{\hat{\Rademacher}} \newcommand{\Growth}{\Pi} \newcommand{\VCD}{\mathrm{VCdim}} \newcommand{\OptDomain}{\Theta} \newcommand{\OptDim}{p} \newcommand{\optimand}{\theta} \newcommand{\altoptimand}{\optimand^{\prime}} \newcommand{\ObjFunc}{M} \newcommand{\outputoptimand}{\optimand_{\mathrm{out}}} \newcommand{\Hessian}{\mathbf{h}} \newcommand{\Penalty}{\Omega} \newcommand{\Lagrangian}{\mathcal{L}} \newcommand{\HoldoutRisk}{\tilde{\Risk}} \DeclareMathOperator{\sgn}{sgn} \newcommand{\Margin}{M} \newcommand{\CumLoss}{L} \newcommand{\EnsembleAction}{\overline{a}} \newcommand{\CumEnsembleLoss}{\overline{\CumLoss}} \newcommand{\Regret}{R} \newcommand{\MetaExpert}{\mathcal{M}} \]

Previously

States \(y\), actions \(a\), loss \(\Loss(y,a)\), information \(x\), strategies \(s: x \mapsto a\), model class \(\ModelClass\), cumulative loss \(\CumLoss_n\)
Risk \(\Risk(s) \equiv \Expect{\Loss(Y, s(X))}\): assumes a stable distribution; forward-looking, probabilistic, absolute
Regret \(R_n(s) \equiv \CumLoss_n(s) - \min_{s^{\prime} \in \ModelClass}{\CumLoss_n(s^{\prime})}\): assumes nothing; backward-looking, “actualist”, comparative
Goal in low-regret learning is sublinear regret, \(n^{-1}\Regret_n = o(1)\)
We can achieve low regret by multiplicative weight training / exponentially weighted mixtures: \[\begin{eqnarray} w_{i, t} & = & w_{i, t-1}\myexp{-\beta \Loss(y_t, s_i(t))}\\ u_{i, t} & = & \frac{w_{i,t}}{\sum_{j=1}^{q}{w_{j, t}}}\\ \EnsembleAction_t & = & \sum_{i=1}^{q}{u_{i, t-1} s_i(t)}\\ \Regret_n & \leq & \sqrt{\frac{n}{2}\log{q}} ~ \text{with the right} ~ \beta \end{eqnarray}\]

Agenda for today

What if there really is a stable distribution? What’s the risk price of low regret?
Competing not just against a fixed expert / strategy, but sequences of changing experts / strategies

Risk and multiplicative weight training

Suppose the data really are IID after all, but we do multiplicative weight training; what happens to the risk?
Start with the case of \(q\) fixed strategies to begin with
\(\CumLoss_n(s_i) = n\EmpRisk(s_i)\)
So \(w_{i, t} = \myexp{-\beta t \EmpRisk(s_i)}\)
But we know \(\EmpRisk(s_i) \rightarrow \Risk(s_i)\), and for a finite number of strategies, that convergence is uniform
So \(\max_{i\in 1:q}{|\EmpRisk(s_i) - \Risk(s_i)|} \leq \gamma(t)\) with \(\gamma(t) \rightarrow 0\)
So \(u_{i, t}\) either goes to \(1\), if \(s_i\) minimizes \(\Risk(s_i)\), or it goes to zero exponentially fast, \(t^{-1}\log{u_{i,t}} \rightarrow - \beta(\Risk(s_i) - \Risk(s^*))\) (by HW9Q3)

Risk and multiplicative weight training (2)

With a finite number of fixed strategies, and IID data, multiplicative weight training will end up with almost all the weight on the lowest-risk model, and exponentially small weights on the other models
If the loss is continuous in the action, the risk of the ensemble will be continuous in the model weights
The excess risk of doing multiplicative weight training, compared to just knowing the best strategy, should be exponentially small (in \(n\))

Risk and multiplicative weight training (3)

Each “expert” doesn’t have to be a fixed strategy; an expert can be “do ERM within such-and-such a model class”
This is what we considered for model averaging
The risk of model averaging converges on the risk of the best model class
We pay some risk penalty for not just using the best model class, but it’s vanishing in \(n\)

Risk and multiplicative weight training: summing up

Multiplicative weight training guarantees low regret
But it’s also a way of doing model averaging that’s risk-consistent
- There’s more risk than if we just knew the best model, but that \(\rightarrow 0\) as \(n \rightarrow\infty\)
Multiplicative weight training can give us both low regret and low risk
Complications:
- \(\beta\) should shrink as \(n\) grows to control regret: yes, but \(\beta \propto 1/\sqrt{n}\), still get vanishing weights for experts with more-than-minimal risk
- infinite numbers of experts so no “gap” between best and second best: do an integral, get excess risk that goes to zero like \(n^{-\delta}\), rather than like \(\myexp{-\delta n}\) but still \(\rightarrow 0\)

Competing with sequences of experts

So far we’ve been comparing the cumulative loss of our ensemble to the cumulative loss of the best expert / strategy
One reason we want to do this is we don’t trust there to be a stable probability distribution
But then the best expert might be different at different times
New goal: compete with the best sequence of experts, say \(s_1\) up to time \(t_1\), then \(s_2\) until time \(t_2\), then \(s_3\) until time \(n\)

Sequences of experts can be treated like experts

Say \(\MetaExpert\) (for “meta”) is a collection of sequences of experts, so \(m \in \MetaExpert\) is a function saying which expert in \(\ModelClass\) to use at each time \(t\)
We can apply multiplicative weight training to \(\MetaExpert\) just as well could to \(\ModelClass\)
Obvious result: \[ \Regret_n \leq \sqrt{\frac{n}{2}\log{\#\MetaExpert}} \]
So how many sequences of experts are there?

Counting sequences of experts

If we don’t put any restrictions on sequences, there are a lot of sequences: \[ \#\MetaExpert = q^n \] and the regret bound is pretty bad: \[ \Regret_n \leq \sqrt{\frac{n^2}{2}\log{q}} = n\sqrt{\frac{\log{q}}{2}} \]
This is a little pessimistic (lots of sequences will have similar losses and we’re not accounting for that), but only a little (more careful analysis is still \(\Regret_n = O(n)\))

Restricting sequences of experts

\(\MetaExpert_k \equiv\) set of sequences where the expert used switches exactly \(k\) times
- So \(\MetaExpert_0 =\) always using one expert
- \(\MetaExpert_1 =\) using one expert up to some time, then a different expert for the rest of the time, etc.
Claim: \[ \#\MetaExpert_k = q^{k+1} {n-1 \choose k} < q^{k+1} {n \choose k} \]
- Why \(n-1\)?
We need to worry about the \(\log\) of this, so use Stirling’s approximation, \(\log{(x!)} \approx x\log{x}\)
So \[\begin{eqnarray} \log{\#\MetaExpert_k} & \approx & (k+1)\log{q} + n\log{n} - k\log{k} - (n-k)\log{(n-k)}\\ & = & (k+1)\log{q} + n\log{n} - k\log{k} -n \log{n(1-k/n)} + k\log{n(1-k/n)}\\ & = & (k+1)\log{q} + n\log{n} - k\log{k} -n\log{n} -n\log{(1-k/n)} + k\log{n} + k\log{(1-k/n)}\\ & = & (k+1)\log{q} -k\log{(k/n)} - (n-k)\log{(1-\frac{k}{n})}\\ & = & (k+1) \log{q} + nH(k/n) \end{eqnarray}\]
- \(H(p) \equiv -p\log{p} - (1-p)\log{(1-p)}\) is the entropy of a Bernoulli (binary) variable that’s \(1\) with probability \(p\) and \(0\) with probability \(1-p\), or \(\mathrm{Bern}(p,1)\)
  - \(nH(p)\) is the entropy of a \(\mathrm{Bern}(p,n)\) sequence
  - \(nH(k/n) \approx -n\frac{k}{n}\log{\frac{k}{n}} = k\log{n} - k\log{k}\) for large \(n\)
\(\log{\#\MetaExpert_k}\) grows with \(n\), but only logarithmically, so we’re good: \[ \Regret_n \leq \sqrt{\frac{n}{2}((k+1)\log{q} + n H(k/n))} \]

Not explicitly representing sequences of experts

Having to keep track of the accumulated loss of every sequence of experts in \(\MetaExpert_k\) would be annoying:
- There are still a lot of them
- Redundancy: many sequences use the same expert at the same time
Could we somehow weight the base experts differently instead?
Yes! The fixed shares forecaster of Herbster and Warmuth (1998)
Pick \(\alpha \in (0,1)\) and then: \[\begin{eqnarray} w_{i,0} & = & 1/q\\ u_{i,t} & = & \frac{w_{i,t}}{\sum_{j=1}^{q}{w_{j,t}}}\\ \EnsembleAction_t & = & \sum_{i=1}^{q}{u_{i, t-1} s_i(t)}\\ v_{i,t} & = & w_{i, t-1} \myexp{-\beta \Loss(y_t, s_i(t))}\\ \end{eqnarray}\] \[\begin{eqnarray} w_{i,t} & = & (1-\alpha) v_{i,t} + \alpha \frac{\sum_{i=1}^{q}{v_{i,t}}}{q} \end{eqnarray}\]

What fixed shares does and why it works

\[\begin{eqnarray} v_{i,t} & = & w_{i, t-1} \myexp{-\beta \Loss(y_t, s_i(t))}\\ w_{i,t} & = & (1-\alpha) v_{i,t} + \alpha \frac{\sum_{i=1}^{q}{v_{i,t}}}{q} \end{eqnarray}\]

Every expert keeps a minimum share \(\alpha\) of the total weight
This is equivalent to using the meta-experts, but with non-uniform initial weights
The initial weights over expert sequences is biased towards those with \(\approx n\alpha\) switches
The regret from using fixed shares, compared to \(\MetaExpert_k\), is therefore \(O(\sqrt{n(k+1)\log{q} + k\log{n}})\)
- Details in HW13

Growing ensembles

Fixed shares still assumes the set of experts \(s_1, s_2, \ldots s_q\) is given at the beginning
The growing ensemble forecaster (Shalizi et al. 2011)
- Start with one expert, \(s_1\), which at time \(t\) gets to use \((x_1, y_1), \ldots (x_{t-1}, y_{t-1})\)
- At time 2, add the expert \(s_2\), which gets to use \((x_2, y_2), \ldots (x_{t-1}, y_{t-1})\)
- At time \(k\), add the expert \(s_k\), which gets to use \((x_k, y_k), \ldots (x_{t-1}, y_{t-1})\)
- Optional: at each time, add \(c>1\) different experts, with different model choices
- Optional: don’t add each time but only every \(\tau>1\) time steps
- The number of experts at time \(t\) is \(q_t \propto t\) (depending on \(c\) and \(\tau\))
- Run fixed shares over this growing ensemble
Regret bounds are proportional to \(\log{q_t}\) so we get an extra factor of \(\log{n}\) in the regret, which is negligible per time step
But if we’re really dealing with a stationary process over time, our risk is close to the risk of using all the data to fit one good model

Summing up

Multiplicative weight training gives us low regret, compared to the fixed set of experts
Multiplicative weight training is also exponentially-weighted model averaging, and we know its risk is almost as low as the risk of the best model class for IID data
\(\therefore\) MWT is a good way to get both low regret and low risk
When the best expert changes over time, we compare to “meta-experts” which are sequences of experts
The fixed-shares modification of multiplicative training gives low regret compared to sequences of experts without too many switches
This gives us a good way to handle non-stationary processes, without too much excess risk if the data are stationary

Backup: Further topics in low-regret learning

Simply counting the number of experts is crude; if two experts always recommend almost the same action, we should be able to use that to get tighter bounds on the regret
- Analogous to using Rademacher complexity or VC dimension rather than just counting models for uniform probabilistic convergence
- Leads to a notion of “sequential Rademacher complexity” (Rakhlin, Sridharan, and Tewari 2010, 2011)
Specific kinds of losses allow for more detailed results, especially important for the log probability loss
“Limited feedback”: maybe we don’t know \(\Loss(y, a)\) unless we actually do \(a\) in state \(y\)
- Can’t use multiplicative weight training unless we get to see \(\Loss(y_t, s_i(t))\) for all \(s_i\)
- “Bandit”¹ problems: we don’t even see \(y_t\), we just see \(\Loss(y_t, a_t)\) for whatever action \(a_t\) we took then, but we think \(\Loss(Y, a)\) follows some (unknown) distribution
- “Contextual bandit” problems: As above, but \(\Loss(Y,a)\) follows a distribution that depends on the information \(X\)
- Decision processes: as contextual bandits, but \(X_{t+1}\) depends on \(X_t\) and on our action \(a_t\), so we need to worry about the consequences of our actions
- Strategic action: \(Y_t\) is, at least in part, the result of actions taken by an opponent, who responds intelligently to our past, present and future actions
Outstanding reference: Cesa-Bianchi and Lugosi (2006)

References

Cesa-Bianchi, Nicolò, and Gábor Lugosi. 2006. Prediction, Learning, and Games. Cambridge, England: Cambridge University Press.

Herbster, Mark, and Manfred Warmuth. 1998. “Tracking the Best Expert.” Machine Learning 32:151–78.

Rakhlin, Alexander, Karthik Sridharan, and Ambuj Tewari. 2010. “Online Learning: Random Averages, Combinatorial Parameters, and Learnability.” In Advances in Neural Information Processing 23 [Nips 2010], edited by John Lafferty, C. K. I. Williams, John Shawe-Taylor, Richard S. Zemel, and A. Culotta, 1984–92. Cambridge, Massachusetts: MIT Press. http://arxiv.org/abs/1006.1138.

———. 2011. “Online Learning: Stochastic and Constrained Adversaries.” In Advances in Neural Information Processing Systems 24 [Nips 2011], edited by John Shawe-Taylor, Richard S. Zemel, Peter Bartlett, Fernando C. N. Pereira, and Kilian Q. Weinberger, 1764–72. http://arxiv.org/abs/1104.5070.

Shalizi, Cosma Rohilla, Abigail Z. Jacobs, Kristina Lisa Klinkner, and Aaron Clauset. 2011. “Adapting to Non-Stationarity with Growing Expert Ensembles.” Statistics Department, CMU. http://arxiv.org/abs/1103.0949.

A “one-armed bandit” is another name for a “slot machine”, a gambling device where you put a coin in a slot, and then get to pull a lever or arm which generates a random pay-off — which is usually zero. It’s called a “bandit” because it takes your money (with high probability). Some machines have two arms, one on each side, often where one arm has a higher probability of small rewards, and the other a lower probability of larger rewards. (Both arms will have negative expected rewards, because the owner of the slot machine wants to make money.) The idea of the “two-armed bandit” statistical problem is to try to figure out, from observation, which arm has higher expected rewards. This has a lot of real-world applications (which medical treatment / educational technique works better, on average?). Calling this a “bandit problem” is yet another example, like “bootstrap” or “Monte Carlo”, of a phrase that began as a joke, and hardened into obscure jargon.↩

Low-Regret Learning II