36-465/665, Spring 2021
29 April 2021 (Lecture 24)
\[ \newcommand{\Prob}[1]{\mathbb{P}\left( #1 \right)} \newcommand{\Expect}[1]{\mathbb{E}\left[ #1 \right]} \newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]} \newcommand{\Cov}[1]{\mathrm{Cov}\left[ #1 \right]} \newcommand{\Risk}{r} \newcommand{\EmpRisk}{\hat{\Risk}} \newcommand{\Loss}{\ell} \newcommand{\OptimalStrategy}{\sigma} \DeclareMathOperator*{\argmin}{argmin} \newcommand{\ModelClass}{S} \newcommand{\OptimalModel}{s^*} \DeclareMathOperator{\tr}{tr} \newcommand{\Indicator}[1]{\mathbb{1}\left\{ #1 \right\}} \newcommand{\myexp}[1]{\exp{\left( #1 \right)}} \newcommand{\eqdist}{\stackrel{d}{=}} \newcommand{\Rademacher}{\mathcal{R}} \newcommand{\EmpRademacher}{\hat{\Rademacher}} \newcommand{\Growth}{\Pi} \newcommand{\VCD}{\mathrm{VCdim}} \newcommand{\OptDomain}{\Theta} \newcommand{\OptDim}{p} \newcommand{\optimand}{\theta} \newcommand{\altoptimand}{\optimand^{\prime}} \newcommand{\ObjFunc}{M} \newcommand{\outputoptimand}{\optimand_{\mathrm{out}}} \newcommand{\Hessian}{\mathbf{h}} \newcommand{\Penalty}{\Omega} \newcommand{\Lagrangian}{\mathcal{L}} \newcommand{\HoldoutRisk}{\tilde{\Risk}} \DeclareMathOperator{\sgn}{sgn} \newcommand{\Margin}{M} \newcommand{\CumLoss}{L} \newcommand{\EnsembleAction}{\overline{a}} \newcommand{\CumEnsembleLoss}{\overline{\CumLoss}} \newcommand{\Regret}{R} \newcommand{\MetaExpert}{\mathcal{M}} \]
Cesa-Bianchi, Nicolò, and Gábor Lugosi. 2006. Prediction, Learning, and Games. Cambridge, England: Cambridge University Press.
Herbster, Mark, and Manfred Warmuth. 1998. “Tracking the Best Expert.” Machine Learning 32:151–78.
Rakhlin, Alexander, Karthik Sridharan, and Ambuj Tewari. 2010. “Online Learning: Random Averages, Combinatorial Parameters, and Learnability.” In Advances in Neural Information Processing 23 [Nips 2010], edited by John Lafferty, C. K. I. Williams, John Shawe-Taylor, Richard S. Zemel, and A. Culotta, 1984–92. Cambridge, Massachusetts: MIT Press. http://arxiv.org/abs/1006.1138.
———. 2011. “Online Learning: Stochastic and Constrained Adversaries.” In Advances in Neural Information Processing Systems 24 [Nips 2011], edited by John Shawe-Taylor, Richard S. Zemel, Peter Bartlett, Fernando C. N. Pereira, and Kilian Q. Weinberger, 1764–72. http://arxiv.org/abs/1104.5070.
Shalizi, Cosma Rohilla, Abigail Z. Jacobs, Kristina Lisa Klinkner, and Aaron Clauset. 2011. “Adapting to Non-Stationarity with Growing Expert Ensembles.” Statistics Department, CMU. http://arxiv.org/abs/1103.0949.
A “one-armed bandit” is another name for a “slot machine”, a gambling device where you put a coin in a slot, and then get to pull a lever or arm which generates a random pay-off — which is usually zero. It’s called a “bandit” because it takes your money (with high probability). Some machines have two arms, one on each side, often where one arm has a higher probability of small rewards, and the other a lower probability of larger rewards. (Both arms will have negative expected rewards, because the owner of the slot machine wants to make money.) The idea of the “two-armed bandit” statistical problem is to try to figure out, from observation, which arm has higher expected rewards. This has a lot of real-world applications (which medical treatment / educational technique works better, on average?). Calling this a “bandit problem” is yet another example, like “bootstrap” or “Monte Carlo”, of a phrase that began as a joke, and hardened into obscure jargon.↩