In our last episode

Autoregressive models of order \(p\) (AR\((p)\)) for time univariate time series:
- \(X(t) = a+\left(\sum_{j=1}^{p}{b_j X(t-j)}\right) + \epsilon(t)\)
- Innovations \(\epsilon\) uncorrelated with each other and with past of \(X\)
- Generating a time series = step through that rule over and over
- Predictions = plug in to the deterministic part
Vector autoregressions of order \(p\) (VAR\((p)\)) for multivariate time series:
- \(\vec{X}(t) = \vec{a} + \left(\sum_{j=1}^{p}{\mathbf{b}_j \vec{X}(t-j)}\right) + \vec{\epsilon}(t)\)
- All vectors have dimension \(k\), matrices dimension \(k\times k\)
In general:
- Can eliminate higher-order dependence in favor of first-order dependence with more variables (to keep track of memory)
- In a first-order linear system, the eigenvalues and eigenvectors tell us everything

Today

What to do with purely spatial data?
What to do with spatio-temporal data?
A little about estimating, emphasizing AR(1)

In space, no one can decide which way to go

For time, it makes sense to generate the future from the past
Space isn’t directed the same way time is (Reichenbach 1956)
This makes purely spatial generative models much trickier, and much harder to interpret

Conditional autoregressive (CAR) spatial models

Data \(X(r_1), X(r_2), \ldots X(r_n)\)
Assume \(r_i\) are on a regular grid
Write \(\Neighbors(r)\) for the locations next to \(r\)
CAR(1): \[ X(r) | \{X(q): q \in \Neighbors(r)\} = \sum_{j\in\Neighbors}{b_j X(r+j)} + \epsilon(r) \]
- All neighborhoods have the same shape
- Each neighbor gets to have its own coefficient
- CAR\((p)\) is a sum of neighbors, plus neighbors’ neighbors, etc.
- Typically assume \(\epsilon\) is IID
- Easy to include a global mean

The Gaussian CAR model

Assume \(\epsilon(r)\) is IID Gaussian, variance \(\tau^2\)
Assemble a matrix \(\mathbf{b}\) where \(b_{ij} =\) coefficient of \(X(r_j)\) in the model for \(X(r_i)\)
Then \(X \sim \mathcal{N}(0, \tau^2(\mathbf{I}-\mathbf{b})^{-1})\)
Conditional law: \[ X(r_i)|\text{all other} ~ X(r_k) \sim \mathcal{N}(\mu(r_i), \tau^2) \] with \[ \mu(r_i) = \sum_{k=1}^{n}{b_{ik} X(r_k)} \]
- Intuition: we’ve built in conditional independence
- Rigor: general properties of Gaussians; see back-up

Some points about CAR models

Generally, you can write down any conditional regression model you like, and this is internally consistent or self-consistent
- e.g., non-linear but with Gaussian noise
- e.g., logistic regression on neighbors for binary-valued \(X\)
- e.g., non-linear and non-Gaussian, etc., etc.
Conditional prediction is straightforward from the specification
How do we generate a new field?
- There’s a global solution for the linear-Gaussian case…
- … but finding a global solution otherwise is really hard
- The model only specifies conditional distributions

The “Gibbs sampler” trick

Start with some value for \(X(r_1), X(r_2), \ldots X(r_n)\)
Calculate the distribution of \(X(r_1) | X(\Neighbors(r_1))\)
Draw a random value \(X^*\) from that distribution
Update: \(X(r_1) \leftarrow X^*\)
Now go on to \(X(r_2)\) and condition on its neighbors:
- Calculate the distribution of \(X(r_2) | X(\Neighbors(r_2))\)
- Draw a random value \(X^*\) from that distribution
- Update: \(X(r_2) \leftarrow X^*\)
Iterate through to \(X(r_n)\), then go back to \(X(r_1)\)
Stop after a few passes through all the sites
- Optional variants: randomize the order in which you visit sites, etc.
We’ve made a Markov chain, and this converges when the Markov chain has a unique, attracting invariant distribution
- That jargon will make more sense when we’ve done Markov chains in a few weeks

Simultaneous autoregressions (SAR):

A non-conditional specification: \[ X(r_i) = \sum_{j=1}^{n}{b_{ij} X(r_j)} + \epsilon(r) \]
Each \(X(r_i)\) appears on the left-hand side of one equation, and on the right-hand-side of all the other equations, and it needs to be the same value whether it’s on the left or the right
If \(\epsilon\) is IID Gaussian with variance \(\tau^2\), the joint distribution is Gaussian (see backup): \[ X \sim \mathcal{N}(0, \tau^2(\mathbf{I}-\mathbf{b})^{-1}(\mathbf{I}-\mathbf{b})^{-1}) \]
This is not the same as the CAR model with the same \(\mathbf{b}\), which would be \(X \sim \mathcal{N}(0, \tau^2(\mathbf{I} - \mathbf{b})^{-1})\)

Simultaneous autoregressions (SAR):

SAR models are self-referential and weird
e.g., in one dimension with nearest-neighbor interactions, we have the simultaneous equations \[\begin{eqnarray} X(r-1) & = & b_1 X(r-2) + b_2 X(r) + \epsilon(r-1)\\ X(r) & = & b_1 X(r-1) + b_2 X(r+1) + \epsilon(r)\\ X(r+1) & = & b_1 X(r) + b_2 X(r+2) + \epsilon(r+1)\\ \end{eqnarray}\]
Economists are bizarrely fond of SARs
- Supposed to model equilibrium, where you take account of what your neighbors do, who are taking account of what you do, etc.
Generally, there is no joint distribution of the \(X(r_i)\) which satisfies the simultaneous equations
- Unless everything is linear and Gaussian
  - So good luck generalizing to nonlinear models, or discrete-valued observations, etc.
- Every (linear-Gaussian) SAR is equivalent to some (linear-Gaussian) CAR (see back-up)

Spatio-temporal

If you just want to predict \(X(r,t)\) from \(X(\Neighbors(r), t-1)\), you have a bunch of VARs
If you must include dependence of \(X(r,t)\) on \(X(q,t)\), then try to use a conditional model

Estimating AR(1)

Until we can figure out how to estimate a temporal auto-regression, we’re not ready to deal with a spatial or a spatio-temporal model
If we can estimate an AR(1), we can figure out how to estimate a VAR(1)
If we can estimate a VAR(1), we should be able to estimate an AR(p)
So focus on estimating an AR(1) \[ X(t) = a+b X(t-1) + \epsilon(t) \]
Given data \(X(0), X(1), X(2), \ldots X(t), \ldots X(n)\), how do we fit this?
- If we can find \(a, b\) we can get distribution of \(\epsilon\) from residuals
- If we can find \(b\) we can find \(a\) in lots of ways
\(\Rightarrow\) focus on estimating \(b\)

Estimating \(b\): Yule-Walker approach

Assume stationarity (so \(|b| < 1\))
Work out covariance function \(\gamma(h; b) = b^h \gamma(0)\)
- More algebra for an \(AR(p)\) or VAR but doable
Pick the \(b\) that best fits observed \(\hat{\gamma}(h)\)
Seems reasonable if stationary
and if \(\hat{\gamma}(h) \rightarrow \gamma_{\mathrm{true}}(h)\) as \(n\rightarrow \infty\); why should we believe that?

Estimating \(b\): Ordinary least squares approach

Assume \(a=0\) to simplify book-keeping
Regress \(X(1), X(2), \ldots X(n)\) on \(X(0), X(1), \ldots X(n-1)\)
- i.e., “shut up and calculate”
The usual math leads to \[\begin{eqnarray} \hat{b} & = & \frac{\sum_{t=0}^{n-1}{X(t)X(t+1)}}{\sum_{t=0}^{n-1}{X^2(t)}}\\ & = & \frac{\sum_{t=0}^{n-1}{X(t)(b X(t) + \epsilon(t+1))}}{\sum_{t=0}^{n-1}{X^2(t)}}\\ & = & b + \frac{\sum_{t=0}^{n-1}{X(t)\epsilon(t+1)}}{\sum_{t=0}^{n-1}{X^2(t)}} \end{eqnarray}\]
\(X(t)\) and \(\epsilon(t+1)\) are uncorrelated so maybe the non-\(b\) term goes to zero as \(n\) grows?

Estimating \(b\): Maximum likelihood approach

Assume a pdf \(f\) for \(\epsilon(t)\) with parameter \(\theta\)
- Gaussian is traditional but not mandatory
Assume \(\epsilon(t)\) are IID, not just uncorrelated
Conditional likelihood, given \(X(0)\), is \[ \prod_{t=1}^{n}{f(X(t) - a - bX(t-1); \theta)} \]
Normalized conditional log-likelihood, given \(X(0)\), is \[ L_n(a,b,\theta) = \frac{1}{n}\sum_{t=1}^{n}{\log{f(X(t) - a - bX(t-1); \theta)}} \]
Maximize this over \(a, b, \theta\)
Might work if \(L_n(a,b,\theta)\) converges to a well-behaved limit

Estimating AR(1)

We will see later what we need to assume to get these approaches to work
Stationarity will be sufficient, but OLS and MLE sometimes works for non-stationary processes

Summary

Spatial autoregressions come in two forms:
- Conditional autoregressions are interpretable
- Simultaneous autoregressions are bizarre
Spatio-temporal autoregressions come in two forms:
- Dependence on the past alone is just a VAR
- Dependence on past and present is a (basically) a CAR
Still need to see how to really estimate everything
- When do sample averages converge on expectations?
- When will log-likelihoods tend to good limits?

Backup: Converting an SAR to a CAR

Assume a linear, Gaussian SAR, with matrix \(\mathbf{b}\)
The global distribution is \(X \sim \mathcal{N}(0, (\mathbf{I}-\mathbf{b})^{-1}(\mathbf{I}-\mathbf{b})^{-1})\)
The global distribution of a CAR with matrix \(\mathbf{\beta}\) is \(\mathcal{N}(0, (\mathbf{I} - \mathbf{\beta})^{-1})\) \[\begin{eqnarray} (\mathbf{I}-\mathbf{\beta})^{-1} & = & (\mathbf{I}-\mathbf{b})^{-1}(\mathbf{I}-\mathbf{b})^{-1}\\ \mathbf{I}-\mathbf{\beta} & = & (\mathbf{I}-\mathbf{b})(\mathbf{I}-\mathbf{b})\\ \mathbf{I}-\mathbf{\beta} & = & \mathbf{I} - 2\mathbf{b} + \mathbf{b}^2\\ \mathbf{\beta} & = & 2\mathbf{b} + \mathbf{b}^2 \end{eqnarray}\]
So a nearest-neighbors SAR corresponds to a CAR with next-nearest-neighbors, etc.
Query: Can you always find an SAR matrix to match a CAR?

After discussion at the end of class

Q: When will a CAR with coefficient matrix \(\mathbf{b}\) have the same distribution as a SAR with coefficient matrix \(\mathbf{b}\)?

A: Use the last equation above but require \(\mathbf{\beta} = \mathbf{b}\): \[\begin{eqnarray} \mathbf{b} & = & 2 \mathbf{b} + \mathbf{b}^2\\ \mathbf{0} & = & \mathbf{b} + \mathbf{b}^2\\ \mathbf{0} & = & \mathbf{b}(\mathbf{I} + \mathbf{b}) \end{eqnarray}\] so \(\mathbf{0} = 0\) or \(\mathbf{b} = - \mathbf{I}\)

The first case, \(\mathbf{b}=0\), is where there’s actually no dependence between neighbors, so it’s actually reassuring that the SAR and CAR have the same distribution then
The second case, \(\mathbf{b} = -\mathbf{I}\), doesn’t make any sense in either model

Backup: Intuition for \((\mathbf{I} - \mathbf{b})^{-1}\)

Remember the geometric series for scalars: if \(|b| < 1\), then \(\frac{1}{1-b} = 1 + b + b^2 + b^3 + \ldots = \sum_{k=0}^{\infty}{b^k}\)
Pretend that this works for matrices: \[ (\mathbf{I} - \mathbf{b})^{-1} = \mathbf{I} + \mathbf{b} + \mathbf{b}^2 + \ldots \]
\(\Var{X} = \tau^2(\mathbf{I} - \mathbf{b})^{-1}\), so \[\begin{eqnarray} \Var{X} & = & \tau^2 \mathbf{I} + \tau^2 \mathbf{b} + \tau^2 \mathbf{b}^2 + \ldots\\ & = & \Var{\epsilon} + \text{(propagation of innovations to neighbors)} + \text{(propagation of innovations to neighbors' neighbors)} + \ldots \end{eqnarray}\]

Backup: Deriving the global distribution of a linear-Gaussian CAR

(after Guttorp (1995, 206–7))

Abbreviate \(X(r_i)\) as \(X_i\); “all other \(X\)’s” as \(X_{-i}\); and “all other \(X\)’s, with 0s from position \(j\) on” as \(X_{-i;j}\)

\[\begin{eqnarray} \log{\frac{p(\mathbf{X}=\mathbf{x})}{p(\mathbf{X}=\mathbf{0})}} & = & \sum_{i=1}^{n}{\log{p(X_i=x_i|X_{-i}=x_{-i;i+1})} - \log{p(X_i=0|X_{-i}=x_{-i;i+1})}} ~ \text{(Brooks representation, see backup)}\\ & = & -\frac{1}{2\tau^2}\left(\sum_{i=1}^{n}{\left(x_i - \sum_{j=1}^{i-1}{b_{ij} x_j}\right)^2 - \left(0 - \sum_{j=1}^{i-1}{b_{ij} x_j}\right)^2}\right)\\ & = & -\frac{1}{2\tau^2}\left(\sum_{i=1}^{n}{x_i^2 - 2 x_i \sum_{j=1}^{i-1}{b_{ij} x_j}}\right)\\ & = & \frac{1}{2\tau^2}\left(\sum_{i=1}^{n}{x_i^2 - x_i (\mathbf{b} \mathbf{x})_{i}}\right) ~ \text{(symmetry of}~\mathbf{b})\\ & = & \frac{1}{2\tau^2}\mathbf{x}^T(\mathbf{I}-\mathbf{b})\mathbf{x}\\ \end{eqnarray}\]

This is the form of a multivariate Gaussian, with variance-covariance matrix \(\tau^2(\mathbf{I}-\mathbf{b})^{-1}\)

Backup: Deriving the global distribution of a linear-Gaussian SAR

(after Guttorp (1995, 219–21, Exercise 4.1))

Again, abbreviate \(X(r_i)\) as \(X_i\), etc.
For each \(i\), \[ X_i = \sum_{j=1}^{n}{b_{ij} X_j} + \epsilon_i \]
System of simultaneous equations \(\Leftrightarrow\) one matrix equation: \[ \mathbf{X} = \mathbf{b}\mathbf{X} + \mathbf{\epsilon} \]
Solve: \[ \mathbf{X} = (\mathbf{I}-\mathbf{b})^{-1} \mathbf{\epsilon} \]
By assumption, \(\mathbf{\epsilon} \sim \mathcal{N}(\mathbf{0}, \tau^2\mathbf{I})\)
So by general properties of multivariate Gaussians, \[ \mathbf{X} \sim \mathcal{N}\left(\mathbf{0}, \tau^2 (\mathbf{I}-\mathbf{b})^{-1}(\mathbf{I}-\mathbf{b})^{-1}\right) \] using the symmetry of \(\mathbf{b}\)

Backup: The “Brooks representation” result

After Guttorp (1995, 7, Proposition 1.1)

\(\mathbf{X}\) is an \(n\)-dimensional random vector. Assume \(p(\mathbf{x}) > 0\) for all \(\mathbf{x}\). Then \[ \frac{p(\mathbf{x})}{p(\mathbf{y})} = \prod_{i=1}^{n}{\frac{p(x_i|x_{1:i-1}, y_{i+1:n})}{p(y_i|x_{1:i-1}, y_{i+1:n})}} \] Since \(\sum_{\mathbf{x}}{p(\mathbf{x})}=1\), this implies that the conditional distributions uniquely fix the whole distribution

Backup: Proof of the Brooks representation result

Again, after Guttorp (1995, 7)

\[\begin{eqnarray} p(\mathbf{x}) & = & p(x_n|x_{1:n-1}) p(x_{1:n-1})\\ & = & p(x_n|x_{1:n-1})p(x_{1:n-1})\frac{p(y_n|x_{1:n-1})}{p(y_n|x_{1:n-1})}\\ & = & \frac{p(x_n|x_{1:n-1})}{p(y_n|x_{1:n-1})}p(x_{1:n-1}, y_n)\\ & = & \frac{p(x_n|x_{1:n-1})}{p(y_n|x_{1:n-1})} p(x_{n-1}| x_{1:n-2}, y_n) p(x_{1:n-2}, y_n)\\ & = & \frac{p(x_n|x_{1:n-1})}{p(y_n|x_{1:n-1})} \frac{p(x_{n-1}| x_{1:n-2}, y_n)}{p(y_{n-1}| x_{1:n-2}, y_n)}p(x_{1:n-2}, y_{n-1:n})\\ & \vdots &\\ & = & \frac{p(x_n|x_{1:n-1})}{p(y_n|x_{1:n-1})} \frac{p(x_{n-1}| x_{1:n-2}, y_n)}{p(y_{n-1}| x_{1:n-2}, y_n)} \ldots \frac{p(x_1|y_{2:n})}{p(y_1|y_{2:n})} \end{eqnarray}\]

Backup: The Gibbs Sampler…

… was not invented by Gibbs, but by Geman and Geman (1984)
- As they said (p. 722), it’s very similar to older ideas in computational physics, going back to Metropolis et al. (1953)
J. W. Gibbs was an American physicist who pioneered “statistical mechanics” — building probability models for physical systems
Gibbs’s models were based on the idea that the probability of the system being in state \(x\) would be \(\propto e^{- k U(x)/T}\), where \(U(x)\) is the energy of the system in state \(x\), \(T\) is temperature, and \(k\) is a universal constant
- Normalizing, \(\mathrm{Pr}(X=x) = \frac{e^{-k U(x)/T}}{\sum_{x^{\prime}}{e^{-k U(x^{\prime})/T}}} = Z^{-1}(T) e^{-k U(x)/T}\)
- Computing the partition function \(Z(T) = \sum_{x}{e^{-k U(x)/T}}\) is generally hard, because there are a very large number of states
  - Physicists have spent whole careers getting short-cut formulas for \(Z(T)\) for particular models
Distributions like this came to be called “Gibbs distributions” or “Gibbs measures”
- Statisticians would call them exponential families (Mandelbrot 1962)
In the 1970s, people realized that every Gibbs distribution corresponds to a Markov random field, and vice versa, every Markov random field corresponds to a Gibbs distribution (Griffeath 1976)
- We’ll cover Markov random fields later in the course; or read ahead in Guttorp (1995)
Geman and Geman (1984) were making statistical models of images, and modeled them as Gibbs distributions, where the “energy” favored things like “color changes slowly from one pixel to the next”, but then they wanted / needed to generate images from their model
- Directly sampling from \(\mathrm{Pr}(X=x)\) would require calculating \(Z(T)\), which was intractable even for their \(64 \times 64\) images; with 5 level grey-scale, that’s \(5^{64 \times 64} = 9.6 \times 10^{2862}\) states!
Geman and Geman used the correspondence between Gibbs distributions and Markov random fields to get the conditional distribution of each pixel given the neighboring pixels
- Each conditional distribution is easy to simulate from
- Sweeping through all the conditional distributions a few times gives something close to the joint distribution

References

Geman, Stuart, and Donald Geman. 1984. “Stochastic Relaxation, Gibbs Distributions, and the Bayesian Restoration of Images.” IEEE Transactions on Pattern Analysis and Machine Intelligence 6:721–41. https://doi.org/10.1109/TPAMI.1984.4767596.

Griffeath, David. 1976. “Introduction to Markov Random Fields.” In Denumerable Markov Chains, edited by John G. Kemeny, J. Laurie Snell, and Anthony W. Knapp, Second, 425–57. Berlin: Springer-Verlag.

Guttorp, Peter. 1995. Stochastic Modeling of Scientific Data. London: Chapman; Hall.

Mandelbrot, Benoit. 1962. “The Role of Sufficiency and of Estimation in Thermodynamics.” Annals of Mathematical Statistics 33:1021–38. http://projecteuclid.org/euclid.aoms/1177704470.

Metropolis, Nicholas, Arianna W. Rosenbluth, Marshall N. Rosenbluth, Augusta H. Teller, and Edward Teller. 1953. “Equations of State Calculations by Fast Computing Machines.” Journal of Chemical Physics 21:1087–92. https://doi.org/10.1063/1.1699114.

Reichenbach, Hans. 1956. The Direction of Time. Berkeley: University of California Press.

Linear Generative Models for Spatial and Spatio-Temporal Data

In our last episode

Today

In space, no one can decide which way to go

Conditional autoregressive (CAR) spatial models

The Gaussian CAR model

Some points about CAR models

The “Gibbs sampler” trick

Simultaneous autoregressions (SAR):

Simultaneous autoregressions (SAR):

Spatio-temporal

Estimating AR(1)

Estimating \(b\): Yule-Walker approach

Estimating \(b\): Ordinary least squares approach

Estimating \(b\): Maximum likelihood approach

Estimating AR(1)

Summary

Backup: Converting an SAR to a CAR

After discussion at the end of class

Backup: Intuition for \((\mathbf{I} - \mathbf{b})^{-1}\)

Backup: Deriving the global distribution of a linear-Gaussian CAR

Backup: Deriving the global distribution of a linear-Gaussian SAR

Backup: The “Brooks representation” result

Backup: Proof of the Brooks representation result

Backup: The Gibbs Sampler…

References