Deviation Bounds I: Markov Inequality etc.

36-465/665, Spring 2021

16 February 2021 (Lecture 5)

\[ \newcommand{\Prob}[1]{\mathbb{P}\left( #1 \right)} \newcommand{\Expect}[1]{\mathbb{E}\left[ #1 \right]} \newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]} \newcommand{\Cov}[1]{\mathrm{Cov}\left[ #1 \right]} \newcommand{\Risk}{r} \newcommand{\EmpRisk}{\hat{r}} \newcommand{\Loss}{\ell} \newcommand{\OptimalStrategy}{\sigma} \DeclareMathOperator*{\argmin}{argmin} \newcommand{\ModelClass}{S} \newcommand{\OptimalModel}{s^*} \DeclareMathOperator{\tr}{tr} \newcommand{\Indicator}[1]{\mathbb{1}\left\{ #1 \right\}} \newcommand{\myexp}[1]{\exp{\left( #1 \right)}} \]

Previously

We want to learn strategies which we can guarantee have low risk
At the very least, we want to know, reliably, what the true risk is
Risk is an expectation value, and we only have data
Empirical risk minimization: pick the strategy with the lowest average loss on the training data
Empirical risk is unbiased at any fixed strategy
Empirical risk minimization is biased and optimistic
Asymptotic estimates aren’t enough

What we’re building towards

It’d be nice to bound the optimism of ERM:

If we have \(n\) samples, then with probability at least \(1-\alpha\), \[ \Risk(\hat{s}) \leq \EmpRisk(\hat{s}) + g(n,\alpha) \] for some function \(g\) we can calculate

It’d also be nice to bound the estimation error:

If we have \(n\) samples, then with probability at least \(1-\alpha\), \[ \Risk(\hat{s}) \leq \Risk(s^*) + h(n,\alpha) \] for some function \(h\) we can calculate

We’ll pursue an approach which will let us make both assertions

Why we’ll detour through probability theory

\(\Risk\) is an expectation, \(\EmpRisk\) is a sample average
Law of large numbers says: \[ \frac{1}{n}\sum_{i=1}^{n}{f(X_i)} \rightarrow \Expect{f(X)} \]
We need to get quantitative about this convergence, and make it work when the function \(f\) is chosen based on data

Markov’s inequality

\(Z =\) any non-negative random variable
If \(\Expect{Z}\) is small, then \(Z\) can’t be large with high probability
- \(Z \geq 0\) \(\Rightarrow\) no cancellations in \(\Expect{Z}\)
Simple argument makes this quantitative: \[\begin{eqnarray} \Expect{Z} & = & \Expect{Z \Indicator{Z \geq \epsilon} + Z \Indicator{Z < \epsilon}}\\ & = & \Expect{Z \Indicator{Z \geq \epsilon}} + \Expect{Z \Indicator{Z < \epsilon}}\\ & \geq & \Expect{Z \Indicator{Z \geq \epsilon}}\\ & \geq & \Expect{\epsilon \Indicator{Z \geq \epsilon}}\\ & = & \epsilon\Expect{\Indicator{Z \geq \epsilon}}\\ & = & \epsilon \Prob{Z \geq \epsilon} \end{eqnarray}\]

Markov’s inequality (2)

If \(Z \geq 0\), then for any \(\epsilon > 0\),

\[ \Prob{Z \geq \epsilon} \leq \frac{\Expect{Z}}{\epsilon} \]

Markov says: \(\Prob{Z \geq \epsilon} = O(1/\epsilon)\)
- Provided \(Z \geq 0\) and \(\Expect{Z} < \infty\)

New inequalities from old

Whether or not \(Z \geq 0\), if \(f(Z) \geq 0\), then for any \(\epsilon > 0\)

\[ \Prob{f(Z) \geq \epsilon} \leq \frac{\Expect{f(Z)}}{\epsilon} \]

If \(f\) is an increasing function, then \[ \Prob{Z \geq \epsilon} = \Prob{f(Z) \geq f(\epsilon)} \] or \[ \Prob{f(Z) \geq \epsilon} = \Prob{Z \geq f^{-1}(\epsilon)} \]

From Markov to Chebyshev

For any random variable \(Z\), \((Z - \Expect{Z})^2 \geq 0\)
Say \(f(Z) = (Z-\Expect{Z})^2\)
Now use Markov: \[\begin{eqnarray} \Prob{|Z-\Expect{Z}| \geq \epsilon} & = & \Prob{(Z-\Expect{Z})^2 \geq \epsilon^2}\\ & \leq & \frac{\Expect{(Z-\Expect{Z}^2)}}{\epsilon^2}\\ & = & \frac{\Var{Z}}{\epsilon^2} \end{eqnarray}\]
This is Chebyshev’s inequality
Chebyshev’s inequality says: \(\Prob{Z \geq \epsilon} = O(1/\epsilon^2)\), provided \(\Var{Z} < \infty\)
- Stronger control of the tail probability because we know the 2nd moment

Our first deviation inequality

\(Z-\Expect{Z}\) is the deviation of \(Z\) from its expected value
A deviation inequality or deviation bound says \[ \Prob{|Z-\Expect{Z}| \geq \epsilon} \leq D_{Z}(\epsilon) \] where \(D_{Z}(\epsilon)\) is some function of \(Z\)’s distribution that we can calculate
Chebyshev’s inequality is our first deviation inequality
Once we have a deviation inequality, we can always turn it around
- Fix error rate (probability) \(\alpha > 0\)
- Solve \(D_{Z}(\epsilon) = \alpha\) for \(\epsilon\), say \(\epsilon=F_{Z}(\alpha)\)
- \(\Prob{|Z-\Expect{Z}| \geq F_{Z}(\alpha)} \leq \alpha\)

How Chebyshev proved the law of large numbers

Assume: \(X_1, \ldots X_n\) all have expectation \(\mu\), variance \(\sigma^2\), and no correlation
Define: \(\overline{X}_n = n^{-1}\sum_{i=1}^{n}{X_i}\); has expectation \(\mu\) and variance \(\sigma^2/n\)
Consequence: For any \(\epsilon > 0\), \[\begin{eqnarray} \Prob{|\overline{X}_n - \mu| \geq \epsilon} & \leq & \frac{\sigma^2}{n \epsilon^2}\\ & \rightarrow & 0 \end{eqnarray}\] So \[ \frac{1}{n}\sum_{i=1}^{n}{X_i} \rightarrow \Expect{X} \] as promised to you long ago

How good is Chebyshev?

Set \(\epsilon = k\sigma\) to get probability of being more than \(k\) standard deviations from the mean

Chebyshev bound is always true but not always “tight”
It’s much closer for the \(t\) distribution with 2 degrees of freedom
- (Technically the expectation is ill-defined with \(df=1\)…)
It’s exponentially too loose for the Gaussian:

Why it matters that the Chebyshev bound is so loose for Gaussians

Remember the empirical risk is a sample average
Sample averages approach Gaussians (central limit theorem)
We could use the Chebyshev bound, but the probability of large deviations is much, much smaller than that

A very hand-wavy introduction to large deviations

Suppose \(X_i\) are IID, but \[ \frac{1}{n}\sum_{i=1}^{n}{X_i} - \Expect{X} \geq \epsilon \]
Could have all \(n\) \(X_i\)s be at least \(\epsilon\) above \(\Expect{X}\)
- Probability \(\left(\Prob{X_i \geq \Expect{X} + \epsilon}\right)^{n}\) so exponentially small in \(n\)
Could have \(n/2\) \(X_i\)s be at least \(2\epsilon+\delta\) above and the rest no more than \(\delta\) below \(\Expect{X}\)
- Probability \({n \choose n/2} \Prob{X_i \geq \Expect{X}+2\epsilon+\delta}^{n/2} \Prob{X_i \geq \Expect{X}| - \delta}^{n/2}\) so exponentially small in \(n\)
Any scenario needs \(O(n)\) of the \(X_i\)s to all deviate in the same way from \(\Expect{X}\) by \(O(\epsilon)\) \(\Rightarrow\) exponentially unlikely
\(\Rightarrow\) probability of large deviations should be exponentially small in \(n\)

(This conclusion is correct but there’s a missing assumption we should be explicit about: what? [see backup])

The exponential Markov inequality

Pick any \(t> 0\), so \(e^{tx}\) is an increasing function of \(x\): \[\begin{eqnarray} \Prob{X \geq \epsilon} & = & \Prob{e^{tX} \geq e^{t\epsilon}}\\ & \leq & e^{-t\epsilon}\Expect{e^{tX}} \end{eqnarray}\]
- because \(e^{tX} \geq 0\) and we can use Markov’s inequality
But \(t\) was arbitrary so \[ \Prob{X \geq \epsilon} \leq \min_{t > 0}{e^{-t\epsilon}\Expect{e^{tX}}} \]
\(e^{-tx}\) is a decreasing but non-negative function of \(x\), so \[\begin{eqnarray} \Prob{X \leq \epsilon} & = & \Prob{e^{-tX} \geq e^{-t\epsilon}}\\ & \leq & \min_{t > 0}{e^{t\epsilon}\Expect{e^{-tX}}} \end{eqnarray}\]

Exponential Markov inequalities / Chernoff Bounds

\[\begin{eqnarray} \Prob{X \geq \epsilon} & \leq & \min_{t \geq 0}{e^{-t\epsilon}\Expect{e^{tX}}}\\ \Prob{X \leq \epsilon} & \leq & \min_{t > 0}{e^{t\epsilon}\Expect{e^{-tX}}} \end{eqnarray}\]

Implication: \(\Prob{X \geq \epsilon} = O(e^{-t\epsilon})\) if \(\Expect{e^{tX}} < \infty\)
- Exponentially small tail probabilities, from bounded exponential moments (which as we’ll see implies that all the moments are finite)
These bounds are the exponential Markov inequalities, or the Chernoff bounds
- Warning: lots of other related results get called “the Chernoff bounds”, the nomenclature is a bit of a mess (but “exponential Markov inequality” is less ambiguous)
What’s up with \(\Expect{e^{tX}}\)?

Moment generating functions (1)

\[\begin{eqnarray} e^u & = & \sum_{k=0}^{\infty}{\frac{u^k}{k!}}\\ \Expect{e^{tX}} & = & \sum_{k=0}^{\infty}{\frac{t^k}{k!}\Expect{X^k}} \end{eqnarray}\]

\(\Expect{e^{tX}}\) is the moment generating function (MGF) of \(X\), say \(M_{X}(t)\)
Example MGFs:
- Binomial, \((1-p+pe^t)^n\)
- Poisson, \(e^{\lambda(e^t-1)}\)
- Gaussian, \(e^{t\mu + \sigma^2 t^2/2}\)
- Exponential, \((1-t\lambda^{-1})^{-1}\)

Moment generating functions (2)

\(aX+b\) has MGF \(e^{bt} M_X(at)\)
\(\therefore\) \(X-\Expect{X}\) has MGF \(e^{-t\Expect{X}} M_X(t)\)
- Conversely if \(X-\Expect{X}\) has MGF \(M_C(t)\), then \(X\) has MGF \(e^{t\Expect{X}} M_C(t)\)
If \(X_1\) and \(X_2\) are independent, then \(M_{X_1+X_2}(t) = M_{X_1}(t) M_{X_2}(t)\)

Exponential bounds

If \(X_i\) are IID, their sum \(S_n\) has MGF \(= \left(M_{X}(t)\right)^n\)
\(S_n - n\Expect{X}\) has MGF \(=e^{-tn\Expect{X}} \left(M_{X}(t)\right)^n\)
Pick any \(\epsilon > 0\): \[\begin{eqnarray} \Prob{\overline{X}_n - \Expect{X} \geq \epsilon} & = & \Prob{S_n - n\Expect{X} \geq n\epsilon}\\ & \leq & \min_{t > 0}{e^{-tn\epsilon}e^{-tn\Expect{X}} \left(M_{X}(t)\right)^n}\\ \Prob{\overline{X}_n - \Expect{X} \leq -\epsilon} & \leq & \min_{t > 0}{e^{-tn\epsilon} e^{tn\Expect{X}} \left(M_{X}(-t)\right)^{n}} \end{eqnarray}\]
Now we minimize over \(t\)
The bounds generally look like \(e^{-r(n) \epsilon^2}\)
And the rate \(r(n)\) is generally \(O(n)\)

Exponential bounds (2)

Let’s try this for the Gaussian, where \(M_X(t) = e^{\mu t + \sigma^2 t^2/2}\)
For the deviating-above-expectation bound, we want to minimize \[ e^{-tn\epsilon}e^{-tn\Expect{X}} \left(M_{X}(t)\right)^n = \myexp{-tn(\epsilon+\mu) + tn\mu + n\sigma^2 t^2/2} \] over all \(t > 0\), so minimize the exponent: \[ \frac{d}{dt}\left(-tn\epsilon + n\sigma^2 t^2/2\right) = -n\epsilon + tn\sigma^2 \] so \(t^*=\epsilon/\sigma^2\)
- Check: \(t^* > 0\) as required!
Now plug back in: \[ \min_{t > 0}{e^{-tn\epsilon}e^{-tn\Expect{X}} \left(M_{X}(t)\right)^n} = \myexp{-n\epsilon \frac{\epsilon}{\sigma^2} + \frac{n}{2}\sigma^2 \frac{\epsilon^2}{\sigma^4}} = \myexp{ -\frac{n}{2}\frac{\epsilon^2}{\sigma^2} } \]
The deviating-below-expectation bound matches exactly (why?)
So, for a Gaussian, \[ \Prob{\left|\overline{X}_n - \Expect{X}\right| \geq \epsilon} \leq 2 \myexp{ -\frac{n\epsilon^2}{2\sigma^2} } \]

“Sub-Gaussian” distributions

If \(\Expect{X}=0\), we say \(X\) is sub-Gaussian if we can find a \(c>0\) where \[ M_{X}(t) \leq e^{t^2 c^2/2} \] for all \(t>0\)
- Note: Gaussians are technically also “sub-Gaussian” by this definition!
- If \(\Expect{X} \neq 0\), we say \(X\) is sub-Gaussian if \(X-\Expect{X}\) meets the original definition, i.e. if \(M_{X}(t) \leq e^{t\Expect{X}} e^{t^2 c^2/2}\)
If \(X\) is sub-Gaussian, then for any \(t\), \[ e^{-tn\epsilon}e^{-tn\Expect{X}} \left(M_X(t)\right)^n \leq e^{-tn\epsilon} e^{nt^2 c^2/2} \] (why?)
\(\therefore\) \[ \min_{t > 0}{ e^{-tn\epsilon}e^{-tn\Expect{X}} \left(M_X(t)\right)^n } \leq \min_{t > 0}{e^{-tn\epsilon} e^{nt^2 c^2/2}} \leq \myexp{ -\frac{n}{2}\frac{\epsilon^2}{c^2} } \]
\(\therefore\) sub-Gaussians converge at least as fast as Gaussians

Summing up

We want to put quantitative, probabilistic bounds on how far \(\Risk(\hat{s})\) is from \(\EmpRisk(\hat{s})\) or \(\Risk(s^*)\)
To build up to that, we’ve studied bounding \(Z-\Expect{Z}\):
- Markov: \(Z \geq 0\) \(\Rightarrow\) \(\Prob{Z \geq \epsilon} \leq \Expect{Z}/\epsilon\)
- If \(f \geq 0\) and increasing, \(\Prob{Z \geq \epsilon} \leq \Expect{f(Z)}/f(\epsilon)\) (even if \(Z\) can be negative)
- Chebyshev: \(\Prob{|Z - \Expect{Z}| \geq \epsilon} \leq \Var{Z}/\epsilon^2\)
- Chernoff: \(\Prob{Z - \Expect{Z} \geq \epsilon} \leq \min_{t > 0}{e^{-t\epsilon} \Expect{e^{tZ}}}\)
Applied to averages, these inequalities give us quantitative forms of the law of large numbers:
- Chebyhsev: if \(\Var{X} < \infty\) then \(\Prob{|\overline{X}_n - \Expect{X}| \geq \epsilon} \leq \Var{X}/n\epsilon^2\)
- Chernoff: if \(\Expect{e^{tX}} \leq e^{t\Expect{X}} e^{t^2 c^2/2}\) (sub-Gaussian), then \(\Prob{|\overline{X}_n - \Expect{X}| \geq \epsilon} \leq 2 \myexp{ -\frac{n\epsilon^2}{2c^2}}\)
What about functions of the data that aren’t just averages?

Backup: Lower bounds on deviation probabilities

We’ve focused on upper bounding deviation probabilites
This is usually what’s needed for learning theory
Sometimes we need lower bounds
One basic idea: the variance can’t be large unless there’s some probability of going below the mean
Paley-Zygmund inequality: If \(Z \geq 0\), then \[ \Prob{Z \geq \epsilon} \geq \frac{(\Expect{Z} - \epsilon)^2}{\Var{Z} + \Expect{Z}^2} \]
Proof: \[\begin{eqnarray} \Expect{Z} & = & \Expect{Z \Indicator{Z \geq \epsilon}} + \Expect{Z \Indicator{Z < \epsilon}}\\ & \leq & \Expect{Z \Indicator{Z \geq \epsilon}} + \epsilon\Expect{\Indicator{Z < \epsilon}}\\ \label{eqn:use-cauchy-schwarz} & \leq & \sqrt{\Expect{Z^2}\Expect{\Indicator{Z \geq \epsilon}}} +\epsilon\Expect{\Indicator{Z < \epsilon}}\\ & \leq & \sqrt{\Expect{Z^2}\Prob{Z \geq \epsilon}} + \epsilon\\ \frac{(\Expect{Z} - \epsilon)^2}{\Expect{Z^2}} & \leq & \Prob{Z \geq \epsilon} \end{eqnarray}\] using the Cauchy-Schwarz inequality in the middle
As with Markov or Chebyshev, there are tighter and more complicated lower bounds that rely on more information about the distribution

Backup: An ungraded, character-building exercise

Suppose \(Z_n\) are random variables with common mean \(\mu\)

Use Chebyshev to show that if \(\Var{Z_n} \rightarrow 0\), then \(\Prob{(Z_n -\mu)^2 \geq \epsilon} \rightarrow 0\) (no matter how small \(\epsilon\) is)
Can you use Paley-Zygmund to get a lower bound on \(\Prob{(Z_n -\mu)^2 \geq \epsilon}\), just assuming \(\Var{Z_n}\) is finite? If not, what else do you need to assume about the distribution of the \(Z_n\)?
Now assume that \(\Var{Z_n} \rightarrow \sigma^2 > 0\). Can you use Paley-Zygmund to show that \(\Prob{(Z_n - \mu)^2 \geq \epsilon} \not\rightarrow 0\)? If not, what more do you need to assume about the distribution of the \(Z_n\)?

Backup: Large deviations theory

\(|\overline{X}_n -\Expect{X}| \geq \epsilon\) is a large deviation because the size of the deviation stays constant as \(n\) grows
We’ve convinced ourselves that the probability of a large deviation should have an upper bound that’s exponentially small in \(n\), say \(\myexp{-c n \epsilon^2}\)
On the other hand, having all of the \(X_i\) be \(\geq \Expect{X}+\epsilon\) has a probability that’s exactly exponentially small in \(n\)
So it should be possible to lower bound the probability of large deviations by something that’s also exponential in \(n\)
- As opposed to, e.g., \(\myexp{ -d n^2 \epsilon^2}\)
Large deviations theory tries to prove results like: \[ \frac{1}{n}\log{\Prob{d(Z_n, Z^*) \geq \epsilon}} \rightarrow -J(\epsilon) \] and identify the limiting value \(Z^*\) and the rate function \(J\)
- Occasionally have to deal with a different rate than \(1/n\)
Large deviations rates come from matching upper and lower bounds on deviation probabilities
Very roughly, if \(Z_n\) is a function of many independent random variables, and none of those inputs can change \(Z_n\) by too much, then there’s usually a large deviations principle
- Upper bound: every way of creating an \(O(\epsilon)\) shift in the collective behavior means coordinating \(O(n)\) of the inputs, and independent probabilities multiply
- Lower bound: creating an \(O(\epsilon)\) shift in the collective behavior only requires coordinating \(O(n)\) of the inputs, and independent probabilities multiply
- Rate functions generally come from saying “what’s the smallest shift to the distribution of the inputs that would give us this large deviation?”, and so from solving optimization problems!
Large deviations theory gets used in asymptotic statistic], signal processing, operations research, information theory (data compression, error-correcting codes, noise removal…), physics, biology, insurance, etc.
It’s a beautiful, important and useful body of mathematics, but it’s all asymptotic (limits as \(n\rightarrow\infty\)), and learning theory wants finite-\(n\) guarantees, so…
(There is also a “moderate deviations theory”, about \(\Prob{d(Z_n, Z^*) \geq \epsilon/\sqrt{n}}\), that’s related to the central limit theorem the same way that large deviations theory is related to the law of large numbers)

Why there aren’t exponential Chebyshev inequalities

Question from class discussion: The Chebyshev inequality improves on the Markov inequality; why don’t we try to construct an “exponential Chebyhsev inequality” to improve on this exponential Markov inequality in the same way?
Answer 1: You can try, but it’d be even more of a mess than this. But also:
Answer 2: Large deviations theorems! For Gaussians, \[ \frac{1}{n}\log{\Prob{|\overline{X}_n -\Expect{X}| \geq \epsilon}} \rightarrow -\frac{n \epsilon^2}{2\sigma^2} \]
- The result, “Cramér’s Theorem”, is more general, but this is the form it takes for Gaussians.
This means that there is a lower bound on the deviation probability, which is also exponential, and with the same exponent as what we’ve gotten from the exponential Markov inequality. So while it might be possible to improve the upper bound we got from the exponential Markov inequality a little, any drastically smaller upper bound would run in to the lower bound

Backup: Central limit theorem and moment generating functions

Say that the \(X_i\) have MGF \(M_{X}(t) = \Expect{e^{tX}}\)
Then \(\overline{X}_n\) has MGF \(M_n(t) = \left(M_{X}(t/n)\right)^n\)
For any given \(t\), as \(n\rightarrow\infty\), \(t/n \rightarrow 0\), so we need to look at how \(M_X\) behaves near 0
Pulling out the mean simplifies this
Define \(M_{C}(t) = \Expect{e^{t(X-\Expect{X})}} = e^{-t\Expect{X}} M_{X}(t)\)
\(M_n(t) = \left( e^{t\Expect{X}/n} M_{C}(t/n)\right)^n = e^{t\Expect{X}} M_C(t/n)^n\)
Remember that \[ M_{C}(t) \equiv \Expect{e^{t(X-\Expect{X})}} = \sum_{k=0}^{\infty}{\frac{t^k}{k!} \Expect{(X-\Expect{X})^k}} \]
For very small \(t\), \(|t| \ll 1\), we can say \[ M_{C}(t) \approx 1 + t\Expect{X-\Expect{X}} + \frac{t^2}{2} \Expect{(X-\Expect{X})^2} = 1 + \frac{t^2}{2}\Var{X} \]
Thus \[\begin{eqnarray} M_n(t) & = & e^{t\Expect{X}} \left( M_{C}(t/n) \right)^n\\ & \approx & e^{t\Expect{X}}\left( 1 + \frac{1}{n}\frac{t^2}{2} \frac{\Var{X}}{n}\right)^n\\ & \rightarrow & e^{t\Expect{X} + t^2 \Var{X}/n} \end{eqnarray}\] since \(\lim_{n\rightarrow\infty}{(1+t/n)^{n}} = e^t\)
But this is the MGF of a Gaussian \(\mathcal{N}(\Expect{X}, \Var{X}/n)\)
Proving that having the same MGF means having the same distribution is more involved, but it’s true, so we’ve just shown \(\overline{X}_n \rightsquigarrow \mathcal{N}(\Expect{X}, \Var{X}/n)\) approaches a Gaussian limit,
(Some people would prefer to write this as: \(\sqrt{n}(\overline{X}_n - \Expect{X}) \rightsquigarrow \mathcal{N}(0, \Var{X})\), just so that we’re tending to an unchanging limiting distribution)
Note: This way of deriving the central limit theorem assumes that \(\Expect{X^k}\) is finite for all \(k\) (otherwise the MGF doesn’t exist), but the CLT still holds for heavy-tailed distributions where say \(\Expect{X^4} = \infty\), so long as \(\Var{X} < \infty\)
- The more general proofs of the CLT which I know either rely on Fourier transforms (“characteristic functions”), which always exist even when some moments are infinite; or on the fact that the Gaussian distribution is “stable”, i.e. the average of two independent Gaussians is another Gaussian, and then showing that averaging two things which are close to Gaussian gives you something even closer to being Gaussian

Backup: Moment generating functions vs. cumulant generating functions

Remember that the moment generating function of \(X\) is \[ M_X(t) = \Expect{e^{tX}} = \sum_{k=0}^{\infty}{\frac{t^k}{k!}\Expect{X^k}} \]
It has that name because \[ \frac{d^k M_X}{dt^k}(0) = \Expect{X^k} \]
When dealing with sums, we kept having to use \((M_X(t))^n\), and multiplying is more annoying that adding
The cumulant generating function of \(X\) is \[ K_X(t) = \log \Expect{e^{tX}} \] so if \(X\) and \(Y\) are independent, then \(K_{X+Y}(t) = K_X(t) + K_Y(t)\), and a sum will have CGF \(n K_X(t)\)
You’ll sometimes see all these results expressed in terms of cumulant generating functions rather than moment generating functions, e.g., \[ \Prob{\overline{X}_n - \Expect{X} \geq \epsilon} \leq \min_{t > 0}{\myexp{-n(t\epsilon - t\Expect{X} + K_X(t))}} \]
- Especially in large deviations theory (see above) where we only care about exponential rates of convergence to 0
As for “What the — is a ‘cumulant’ anyway?”, the best answer I know is “whatever \(K(t)\) is a power series for”: \[ K(t) = \sum_{j=1}^{\infty}{\frac{t^j}{j!}\kappa_j} \]
- \(\kappa_j\) is a polynomial in \(\Expect{X}, \Expect{X^2}, \ldots \Expect{X^j}\), and vice versa \(\Expect{X^j}\) can be written as a polynomial in terms of \(\kappa_1, \ldots \kappa_j\)
- \(\kappa_1 = \Expect{X}\), \(\kappa_2 = \Var{X} = \Expect{X^2} - (\Expect{X})^2\), but after that they get weird and uninterpretable
- But the cumulants in many ways have much nicer algebraic / mathematical properties than the moments

Backup: The implicit assumption on slide 14

Slide 14 argued that large deviations should have probabilities that are exponentially small in \(n\), because getting \(\overline{X}_n\) to be \(\epsilon\) away from \(\Expect{X}\) involves getting \(O(n)\) of the \(X_i\)s to shift by \(O(\epsilon)\) amounts; each of those events has some probability, they’re independent so probabilities multiply, multiplying \(O(n)\) terms will give you something like \(e^{-n \text{stuff}}\)
The implicit assumption is that no one random variable can shift the average by that much
But we certainly could have, say, \(X_1 = \Expect{X} + n\epsilon\) and \(X_2 \ldots X_n\) all close to \(\Expect{X}\), and then \(\overline{X} - \Expect{X} = \epsilon\)
So the implicit assumption is that either no one \(X_i\) can deviate from \(\Expect{X}\) by \(O(n\epsilon)\), or at least that this is even more unlikely than \(O(n)\) variables deviating by \(O(\epsilon)\)
- Can’t happen: think of random variables with bounded ranges; as \(n\) grows, it becomes impossible for one variable to shift the average by \(n\epsilon\)
- Too unlikely: Say the \(X_i\) are sub-Gaussian (slide 21). Then \(\Prob{X_i - \Expect{X} \geq n\epsilon} \leq e^{-n^2\epsilon^2/2\sigma^2}\)
  - vs. having \(\approx \left(e^{-\epsilon^2/2\sigma^2}\right)^{n} = e^{-n\epsilon^2/2\sigma^2}\) for the probability of all \(n\) of the \(X_i\)s being \(\epsilon\) or more above \(\Expect{X}\)
This assumption holds for lots of distributions but not all
- Example: with a \(t\) distribution with \(d \geq 1\) degrees of freedom, \(\Prob{X \geq x} \sim x^{-d}\), so the probability of a single \(X_i\) have a deviation of size \(n\epsilon\) is \(\approx (n\epsilon)^{-d}\), which is only polynomially small in \(n\), not exponentially small
  - In this case, if \(\overline{X}_n\) is going to have a large deviation from its expectation, it’s actually more likely to come from a single large fluctuation of size \(n\epsilon\) than from \(O(n)\) variables having fluctuations of size \(O(\epsilon)\)