In our last episode…

We typically estimate a quantity \(\psi\) by minimizing a partly-random objective function \(M_n(\psi)\), so \(\hat{\psi}_n = \argmin_{\psi}{M_n(\psi)}\)
Generally, \(\hat{\psi}_n \rightarrow \psi_0\) (consistent estimation) if two things are true:
- \(M_n(\psi) \rightarrow m(\psi)\) as \(n\rightarrow\infty\)
- and the true \(\psi_0 = \argmin_{m}{m(\psi)}\)
Generally, \(\Var{\hat{\psi}_n} \rightarrow \mathbf{h}^{-1} \Var{\nabla M_n(\psi_0)} \mathbf{h}^{-1}\) (sandwich covariance) if
- \(\nabla M_n(\psi_0) \rightarrow \nabla m(\psi_0) = 0\)
- and \(\nabla \nabla M_n(\psi_0) \rightarrow \nabla\nabla m(\psi_0) \equiv \mathbf{h}\)
With IID data, we use the law of large numbers to ensure \(M_n \rightarrow m\)
- We use the central limit theorem to ensure \(M_n \rightsquigarrow \mathcal{N}\)

Agenda for today

Convergence of not-too-correlated sample averages on expectations
- A “mean ergodic theorem”
- Stationary and non-stationary versions
- Notion of effective sample size and correlation time
Inference for AR(1)
Some glimpses at more advanced ergodic theory (without proofs)
- Convergence of the log-likelihood
- Weak dependence and CLTs

Ergodic theory

Laws of large numbers for dependent variables are called ergodic theorems
- Blame Ludwig Boltzmann in the late 1800s
This has absorbed a lot of mathematical talent over the last \(\approx 150\) years (Plato 1994)
We (= you) can prove a useful one over the next few minutes

Second-order stationary and not-too-correlated

Assume \(X(1), X(2), \ldots X(t), \ldots\) is (second-order) stationary:
- \(\Expect{X(t)} = \mu\) for all \(t\)
- \(\Cov{X(t), X(t+h)} = \gamma(h)\) for all \(t, h\)
Assume the sum of the covariances is finite: \[ \sum_{h=-\infty}^{\infty}{\gamma(h)} \equiv \gamma(0)\tau < \infty \]
- Correlation time (also) refers to this \(\tau \equiv \frac{\sum_{h=-\infty}^{\infty}{\gamma(h)}}{\gamma(0)}\)
  - Other names are “autocorrelation time”, “integrated autocorrelation time”, “integral time scale”, etc.
  - Sometimes slight variations in the definition, such as \(\frac{\sum_{h=0}^{\infty}{\gamma(h)}}{\gamma(0)}\) or \(\frac{\sum_{h=1}^{\infty}{\gamma(h)}}{\gamma(0)}\); check this when combining results from different authors
- Decay of correlations: finite sum implies \(\gamma(h) \rightarrow 0\) as \(h\rightarrow\infty\)

Our first ergodic theorem

\[\begin{eqnarray} \overline{X}_n & \equiv & \frac{1}{n}\sum_{t=1}^{n}{X(t)}\\ \Expect{\left(\overline{X}_n - \mu\right)^2} & = & \left(\Expect{\overline{X}_n - \mu}\right)^2 + \Var{\overline{X}_n}\\ \Expect{\overline{X}_n} & = & \frac{1}{n}\sum_{t=1}^{n}{\Expect{X(t)}} = \mu\\ \Var{\overline{X}_n} & = & \frac{1}{n^2}\left(\sum_{t=1}^{n}{\Var{X(t)}} + 2\sum_{t=1}^{n-1}{\sum_{s=t+1}^{n}{\Cov{X(t), X(s)}}}\right)\\ & = & \frac{1}{n^2}\left(n \gamma(0) + \sum_{t=1}^{n}{\sum_{s\neq t}{\gamma(t-s)}}\right)\\ & = & \frac{1}{n^2}\sum_{t=1}^{n}{\sum_{s=1}^{n}{\gamma(t-s)}}\\ & = & \frac{1}{n^2}\sum_{t=1}^{n}{\sum_{h=1-t}^{n-t}{\gamma(h)}} \\ & \rightarrow & \frac{1}{n^2}\sum_{t=1}^{n}{\sum_{h=-\infty}^{\infty}{\gamma(h)}} = \frac{1}{n^2}\sum_{t=1}^{n}{\gamma(0)\tau} = \frac{\gamma(0)\tau}{n} \end{eqnarray}\]

Our first ergodic theorem

If \(\tau < \infty\), then

\[\begin{eqnarray} \Expect{\left(\overline{X}_n - \mu\right)^2} &\rightarrow & 0 + \frac{\gamma(0)\tau}{n} \rightarrow 0 \end{eqnarray}\]

\(\Leftrightarrow\) If \(\tau < \infty\), then

\[\begin{eqnarray} \overline{X}_n \rightarrow \mu \end{eqnarray}\]

How sensible is \(\tau < \infty\)?

Uncorrelated processes have \(\tau = 1\)
Suppose \(\gamma(h) = \gamma(0) \rho^{|h|}\) with \(|\rho| < 1\) (stationary AR(1) models have covariance functions like this) \[ \sum_{h=-\infty}^{\infty}{\gamma(h)} = \gamma(0)\left(1+2\sum_{h=1}^{\infty}{\rho^h}\right) = \gamma(0)\left(1+2\frac{\rho}{1-\rho}\right) \]
- (sum a geometric series)
- So \(\tau = (1+\rho)/(1-\rho)\)
- Continuity with the uncorrelated case if \(\rho \approx 0\)
In general, stationary AR(p) and VAR(p) models have \(\tau < \infty\)
- So do lots of non-linear models we’ll meet later in the course

Not every process has \(\tau < \infty\):

An easy but extreme counter-example:
- \(X(1) \sim\) anything, \(X(t+1) = X(t)\) \(\Rightarrow\) \(\tau = \infty\)
- “Checking a newspaper by buying more copies”
Also bad:
- \(X(1) \sim\) anything, \(X(t) = X(1) + \epsilon(t)\) with \(\epsilon\)s IID
- “Checking a newspaper by buying smudged copies”
More troublesome: very slow decay of correlations
- \(\gamma(h) \propto h^{-\alpha}\) is a “long-memory process”, or one with “long-range correlations”
- \(\lim_{T\rightarrow\infty}{\sum_{h=-T}^{T}{\gamma(h)}}=\infty\) if \(\alpha \leq 1\)
- But \(\gamma(h)\) is summable if \(\alpha > 1\)

Effective sample size

For uncorrelated \(X_i\), we saw last time that \[ \Expect{\left(\overline{X}_n - \mu\right)^2} = \frac{\Var{X(1)}}{n} \]
We just showed that if \(\tau < \infty\), then \[ \Expect{\left(\overline{X}_n - \mu\right)^2} = \frac{\Var{X(1)}\tau}{n} \]
Equivalently, \[ \Expect{\left(\overline{X}_n - \mu\right)^2} = \frac{\Var{X(1)}}{n/\tau} \]
As though we had \(n/\tau\) uncorrelated observations, instead of \(n\) dependent ones
For the \(\gamma(h) = \gamma(0) \rho^{|h|}\) situation:

Generalizing: non-stationary case

We don’t actually need stationarity!
Define \(\mu_n \equiv \frac{1}{n}\sum_{t=1}^{n}{\Expect{X(t)}}\)
Define \(V(n) \equiv \sum_{t=1}^{n}{\sum_{s=1}^{n}{\Cov{X(t), X(s)}}}\)
Then if \(V(n) = o(n^2)\) \[ \Expect{\left(\overline{X}_n - \mu_n\right)^2} \rightarrow 0 \]
- By exactly the same proof

Application: Stationary AR(1)

\(X(t) = a + b X(t-1) + \epsilon(t)\)
Assume stationary, so \(\Expect{X(t)} = \frac{a}{1-b}\) and \(\Cov{X(t), X(t+h)} = b^{|h|} \frac{\Var{\epsilon}}{1-b^2}\)
Then \(\tau = 1+ 2\frac{b}{1-b}\)
- N.B.: \(> 0\) for any \(b \in (-1, 1)\)
So \(\overline{X}_n \rightarrow \Expect{X(1)} = \frac{a}{1-b}\) (but more slowly than for uncorrelated data)
Similarly \[ \hat{\gamma}(h) \equiv \frac{1}{n-h}\sum_{t=1}^{n-h}{(X(t) - \overline{X}_n)(X(t+h)-\overline{X}_n)} \rightarrow \Cov{X(t), X(t+h)} = \gamma(h) \]

Application: Not-necessarily-stationary AR(1)

Back in Lecture 12, saw that if \(a=0\) and we use OLS \[\begin{eqnarray} \hat{b} & = & b + \frac{\sum_{t=0}^{n-1}{X(t)\epsilon(t+1)}}{\sum_{t=0}^{n-1}{X^2(t)}}\\ & = & b + \frac{n^{-1}\sum_{t=0}^{n-1}{X(t)\epsilon(t+1)}}{n^{-1}\sum_{t=0}^{n-1}{X^2(t)}} \end{eqnarray}\]
But now the numerator \(\rightarrow \Expect{X(t) \epsilon(t+1)} = 0\)
- because \(\Cov{X(t) \epsilon(t+1), X(t+1)\epsilon(t+2)} = 0\)

Application: AR(1)

Objective function at finite \(n\): \[ M_n(a,b) = \frac{1}{n-1}\sum_{i=1}^{n-1}{(X(t+1) - a - b X(t))^2} \]
Exercise (off-line): Assume stationarity and show that this goes to \[ m(a,b) = \Expect{(X(t+1) - a-bX(t))^2} \]
- (Can you do this under non-stationarity?)
So all our asymtptotic analysis from last time applies

Looking beyond the simplest ergodic theorem

More general “mean-square” ergodic theorem for weakly stationary processes: \(\overline{X}_n \rightarrow \Expect{X(1)}\) iff \(n^{-1}\sum_{h=0}^{n}{\gamma(h)} \rightarrow 0\)
- If \(\tau < \infty\), the rate of convergence is \(O(1/n)\) just as with uncorrelated data
- But one can construct processes where \(\overline{X}_n\) converges arbitrarily slowly
A process is strongly stationary, or strictly stationary, when for any \(k\) and any \(h\), \[ \Prob{X(1) = x_1, X(2) = x_2, \ldots X(k)=x_k} = \Prob{X(h+1) = x_1, X(h+2) = x_2, \ldots X(h+k) = x_k} \]
The individual ergodic theorem: If the process is strongly stationary, then for any \(k\) an d for any function \(f(X(1), X(2), \ldots X(k))\), \[ \Prob{\frac{1}{n-k}\sum_{t=1}^{n-k}{f(X(t), \ldots X(t+k))} \rightarrow \Expect{f(X(1), \ldots X(k))}} = 1 \]
- i.e., sample averages converge along (almost all) individual trajectories
- Again, there are processes where the rate of convergence is arbitrarily slow, but they’re a bit “pathological”
These two theorems also work for asymptotically stationary processes
- because the long-run limit is dominated by the stationary process we’re approaching
- Also for cyclo-stationary and asymptotically cyclo-stationary processes

Convergence of the log-likelihood

Assume a pdf \(p(x(1), x(2), \ldots x(t))\) generates \(X(1), X(2), \ldots X(t)\)
Assume \(X\) is stationary
Then \[ \lim_{n\rightarrow\infty}{\frac{1}{n}\Expect{\log{p(X(1), \ldots X(n))}}} = \lambda \] exists, and \[ \frac{1}{n}\log{p(X(1), \ldots X(n))} \rightarrow \lambda \]
- “Asymptotic equipartiton” or “Shannon-McMillan-Breiman” property from information theory

Convergence of the log-likelihood (II)

Assume \(X\) is generated by a distribution \(p\)
Consider a model pdf \(f(x(1), x(2), \ldots x(t); \theta)\)
Then, under stationarity, \[\begin{eqnarray} \lim_{n\rightarrow\infty}{\frac{1}{n}\Expect{\log{f(X(1), \ldots X(n);\theta)}}} & = & \ell(\theta)\\ \frac{1}{n}\log{f(X(1), \ldots X(n); \theta)} & \rightarrow & \ell(\theta) \end{eqnarray}\]
- And \(\ell(\theta) < \lambda\), unless the model is right at some \(\theta_0\)
Some non-stationary extensions, especially if asymptotically stationary

Central limit theorems and weak dependence

Suppose \((X(t-k), \ldots X(t-1))\) and \((X(t+h), \ldots X(t+h+k-1)\) approach independence as \(h\rightarrow\infty\)
Then we’ve got nearly independent “blocks” of length \(k\)
\(\overline{X}_n\) acts like average of some nearly-independent blocks, plus some corrections/remainders/fudge
This lets us transfer the central limit theorem to dependent data
We need to be precise about “approaching independence” (see backup)

Summary

If correlations go to zero fast enough (e.g., \(\tau < \infty\)), then \(\overline{X}_n\) converges to the expectation value
This is enough to have a lot of estimators converge
- And even to get asymptotic standard errors
- Effective sample size is reduced from \(n\) to \(n/\tau\)
Maximum likelihood works very generally for parametric models
Central limit theorem still holds under weak dependence / asymptotic independence

Backup: Boltzmann

(Photo credit: Tom Schneider, downloaded 2008 from an apparently-defunct website)

Backup: “Ergodic”, “Ergodicity”

Boltzmann (1964) was interested in the behavior of a physical system at constant energy
Write:
- \(X(t)\) for the state of the system at time \(t\)
- \(E(x)\) for the energy of a system in state \(x\)
- \(R =\) set of states with \(E(x) = E(X(0))\)
- \(W=\) volume of \(R\)
He wanted to say that for any (well-behaved) set of states \(A\), \[ \frac{1}{T}\int_{t=0}^{T}{\mathbf{1}(X(t) \in A) dt} \rightarrow \int_{R}{\mathbf{1}(x \in A) \frac{1}{W} dx} \]
- so the average along a path \(X(t)\) equals the average over the region of constant energy
- ergon = “work, energy” (as in “erg”, the unit of energy), hodos = “way, path” (as in “odometer”) \(\Rightarrow\) “ergodic” = energy-path
The name stuck
A process is ergodic when averages over time converge on expectation values
An ergodic theorem is one showing some class of processes (or functions of processes, etc.) is ergodic

Backup: More on ergodic theory

Early history: Plato (1994)
Gentlest introduction, and a really good complement to this class: Grimmett and Stirzaker (1992), chapter 9
Some of how ergodicity supports using probability theory in the real world: Ruelle (1991)
The result I called “Our first ergodic theorem” goes back to Taylor (1922)
- Taylor was a physicist specializing in fluid mechanics (Batchelor 1996); it’s the same Taylor as in the “Taylor hypothesis” for spatio-temporal correlation functions, but not the same as in “Taylor series”
- In continuous time, you integrate the covariance function rather than summing it, which is how I first learned the result from Frisch (1995), sec. 4.4, pp. 50–51
Ergodicity in physics: Lebowitz (1999), Castiglione et al. (2008)
Going deeper needs advanced (measure-theoretic) probability: Gray (2009) builds up what’s needed for ergodic theory
Ergodic properties for log-likelihood are part of information theory:
- Cover and Thomas (2006) is the best over-all textbook
- Gray (1990) gives detailed coverage of this topic
Mackey (1992) is really about decay-of-dependence

Backup: Weak dependence and central limit theorems

“Mixing” is a strong notion of asymptotic independence
- A strong notion of weak dependence, you should excuse the expression
Measure dependence between \(X(-\infty), \ldots X(t-1), X(t)\) and \(X(t+h), X(t+h+1), \ldots X(+\infty)\) by “total variation” from independence: \[ \beta(h) = \int{|p(x{-\infty:t}, x_{t+h:\infty}) - p(x_{-\infty:t})p(x_{t+h:\infty})| dx_{-\infty:t} dx_{t+h:\infty}} \]
The process is \(\beta\)-mixing if \(\beta(h) \rightarrow 0\) as \(h\rightarrow \infty\)
Now use “blocking”:
- Divide \(X_1, \ldots X_n\) into \(2m\) blocks of length \(h\) (plus \(< h\) extra observations as remainder)
- Take the odd-numbered blocks; they’re a random variable \(Z = (Z_1, Z_2, \ldots Z_m)\)
- Imagine \(\tilde{Z} = (\tilde{Z}_1, \ldots \tilde{Z}_m)\), where the blocks have the same marginal distribution but are all independent
- For any event \(A\), \(|\Prob{Z \in A} - \Prob{\tilde{Z} \in A}| \leq m \beta(h)\) (Yu 1994)
- Since sample averages using \(\tilde{Z} \rightsquigarrow \mathcal{N}\), so do sample averages using \(Z\)

References

Batchelor, G. K. 1996. The Life and Legacy of G. I. Taylor. Cambridge, England: Cambridge University Press.

Boltzmann, Ludwig. 1964. Lectures on Gas Theory. Berkeley: University of California Press.

Castiglione, Patrizia, Massimo Falcioni, Annick Lesne, and Angelo Vulpiani. 2008. Chaos and Coarse Graining in Statistical Mechanics. Cambridge, England: Cambridge University Press.

Cover, Thomas M., and Joy A. Thomas. 2006. Elements of Information Theory. Second. New York: John Wiley.

Frisch, Uriel. 1995. Turbulence: The Legacy of a. N. Kolmogorov. Cambridge, England: Cambridge University Press.

Gray, Robert M. 1990. Entropy and Information Theory. New York: Springer-Verlag. http://ee.stanford.edu/~gray/it.html.

———. 2009. Probability, Random Processes, and Ergodic Properties. Second. New York: Springer-Verlag. http://ee.stanford.edu/~gray/arp.html.

Grimmett, G. R., and D. R. Stirzaker. 1992. Probability and Random Processes. 2nd ed. Oxford: Oxford University Press.

Lebowitz, Joel L. 1999. “Statistical Mechanics: A Selective Review of Two Central Issues.” Reviews of Modern Physics 71:S346–S357. http://arxiv.org/abs/math-ph/0010018.

Mackey, Michael C. 1992. Time’s Arrow: The Origins of Thermodynamic Behavior. Berlin: Springer-Verlag.

Plato, Jan von. 1994. Creating Modern Probability: Its Mathematics, Physics and Philosophy in Historical Perspective. Cambridge, England: Cambridge University Press.

Ruelle, David. 1991. Chance and Chaos. Princeton, New Jersey: Princeton University Press.

Taylor, G. I. 1922. “Diffusion by Continuous Movements.” Proceedings of the London Mathematical Society, 2nd ser., 20:196–212. https://doi.org/10.1112/plms/s2-20.1.196.

Yu, Bin. 1994. “Rates of Convergence for Empirical Processes of Stationary Mixing Sequences.” Annals of Probability 22:94–116. https://doi.org/10.1214/aop/1176988849.

Inference II — Ergodic Theory

In our last episode…

Agenda for today

Ergodic theory

Second-order stationary and not-too-correlated

Our first ergodic theorem

Our first ergodic theorem

How sensible is \(\tau < \infty\)?

Not every process has \(\tau < \infty\):

Effective sample size

Generalizing: non-stationary case

Application: Stationary AR(1)

Application: Not-necessarily-stationary AR(1)

Application: AR(1)

Looking beyond the simplest ergodic theorem

Convergence of the log-likelihood

Convergence of the log-likelihood (II)

Central limit theorems and weak dependence

Summary

Backup: Boltzmann

Backup: “Ergodic”, “Ergodicity”

Backup: More on ergodic theory

Backup: Weak dependence and central limit theorems

References