\[ \newcommand{\Expect}[1]{\mathbb{E}\left[ #1 \right]} \newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]} \newcommand{\Cov}[1]{\mathrm{Cov}\left[ #1 \right]} \newcommand{\Prob}[1]{\mathbb{P}\left( #1 \right)} \newcommand{\x}{\mathbf{x}} \newcommand{\y}{\mathbf{y}} \newcommand{\Y}{\mathbf{Y}} \newcommand{\NoiseVar}{\mathbf{\Sigma}} \DeclareMathOperator*{\argmin}{argmin} \newcommand{\w}{\mathbf{w}} \]

Agenda

Sometimes you really just want to run a regression of \(Y\) on \(X\)
You know how to do this when the data are IID pairs \((X_i, Y_i)\)
Dependence makes things more complicated

The basic linear model

\[ Y_i= X_i \cdot \beta + \epsilon_i \]

\(Y\) is the regressand (“thing to be regressed”)
\(X\) is the vector of regressors (“things doing the regressing”)
- The baby-stats vocabulary of “dependent variable” and “independent variables” is too awkward when we’re talking about dependence in the \(X\)s and/or \(\epsilon\)s
Assuming the \((X_i, Y_i)\) pairs are IID implies \(\Var{\mathbf{\epsilon}} = \sigma^2 \mathbf{I}\) for some \(\sigma^2 > 0\)
Bundle the data into an \(n\times 1\) matrix \(\y\) and an \(n\times p\) matrix \(\x\)
Mean squared error (MSE) is \(n^{-1}(\y - \x \beta)^T (\y - \x\beta)\)
Ordinary least squares (OLS) estimate is \[ \hat{\beta} = \argmin_{b}{n^{-1}(\y - \x b)^T (\y - \x b)} \]
Explicitly, this is \[ \hat{\beta} = \left(\frac{1}{n} \x^T \x\right)^{-1} \frac{1}{n} \x^T \y \]

Adding Correlations to the Noise

Keep the linear model but say that \(\Var{\mathbf{\epsilon}} = \NoiseVar \neq \sigma^2 \mathbf{I}\)
- Still assuming \(\Expect{\mathbf{\epsilon}} = 0\)
The OLS estimate is still \[ \hat{\beta} = (\frac{1}{n} \x^T \x)^{-1} \frac{1}{n} \x^T \y \]
OLS is still unbiased: \[\begin{eqnarray} \Expect{\hat{\beta}|\x} & = & \Expect{(\x^T \x)^{-1} \x^T \Y|\x}\\ & = & (\x^T \x)^{-1} \x^T \Expect{Y|\x}\\ & = & (\x^T \x)^{-1} \x^T \x \beta = \beta\\ \Expect{\hat{\beta}} & = & \Expect{\Expect{\hat{\beta}|\x}}\\ & = & \Expect{\beta} = \beta \end{eqnarray}\]
Variance is different: \[\begin{eqnarray} \Var{\hat{\beta}|\x} & = & \Var{(\x^T \x)^{-1} \x^T \Y|\x}\\ & = & (\x^T \x)^{-1} \x^T \Var{\Y|\x} \x (\x^T \x)^{-1}\ & = & (\x^T \x)^{-1} \x^T \Var{\mathbf{\epsilon}|\x} \x (\x^T \x)^{-1} \end{eqnarray}\]
This is \(\neq \sigma^2 (\x^T \x)^{-1}\), the variance we’d get in the IID case
- Usually bigger (on the diagonal and as a positive-definite matrix), especially if the autocorrelations of \(\epsilon\) are positive

Generalized/weighted least squares

Introduce a symmetric, positive-definite weighting matrix \(\w\) \[ WSE(b) \equiv (\y - \x b)^T \w (\y-\x b) \]
- So ordinary MSE is choosing \(\w = \mathbf{I}/n\)
The WLS estimate is \[ \tilde{\beta} = (\x^T\w\x)^{-1}\x^T\w\y \]
This is also going to be unbiased, regardless of \(\w\)
- Parallel argument to previous slide
The variance: \[ \Var{\tilde{\beta}|\x} = (\x^T \w \x)^{-1} \x^T \w \Var{\epsilon} \w \x (\x^T \w \x)^{-1} \]
- Again, parallel argument to previous slide
This is minimized when \(\w = (\Var{\epsilon})^{-1}\)
- This is called the Gauss-Markov theorem
- This is also the maximum likelihood estimate if the noise has a Gaussian distribution
Some people call this WLS only if \(\w\) is diagonal, and generalized least squares (GLS) otherwise; some people call it all WLS

So how do we estimate \(\Var{\epsilon}\)?

Two important cases:

\(\Var{\epsilon}\) is diagonal, but not \(\sigma^2 \mathbf{I}\) (heteroskedasticity)
- or heteroscedasticity
\(\Var{\epsilon}\) has off-diagonal entries (correlated noise)
- possibly heteroskedastic as well

Estimating \(\Var{\epsilon}\): heteroskedastic but not autocorrelated

\(\Expect{\epsilon_i^2} = \Var{\mathbf{\epsilon}}_{ii}\)
Iterate:
1. Start with \(\w = \mathbf{I}\)
2. Estimate \(\hat{\beta}\) using current \(\w\), get residuals \(r_i = y_i - x_i \hat{\beta}\)
3. Use squared residuals, \(r_i^2\), as diagonal entries in \(\w^{-1}\)
4. Go to (1) until converged
Optionally: smooth squared residuals \(r_i^2\) against the regressor variables, and use fitted values from that smoothing in \(\w^{-1}\)
- More stability if conditional variance function is indeed smooth

Estimating \(\Var{\epsilon}\): autocorrelated

Again, \(\Expect{\epsilon_i \epsilon_j} = \Var{\mathbf{\epsilon}}_{ij}\)
Start with an OLS regression
Fit a correlation function to the residuals we get with \(\hat{\beta}\)
- All the tricks for estimating correlation functions we looked at in time-series smoothing and kriging
Build an estimate of \(\Var{\mathbf{\epsilon}}\) using the correlation function
Re-estimate \(\beta\), get new residuals, etc., until convergence
We can still improve our estimate even if the shape of the correlation function isn’t exactly right
- “Working covariance model” or “working correlation function”

Summing Up on Regression with Correlated Noise

Start with an OLS regression: inefficient and over-confident but unbiased
Model the heteroskedasticity and correlations in the residuals
Use the model to estimate \(\Var{\mathbf{\epsilon}}\), invert it, and then do weighted/generalized least squares with that \(\w\)
Re-estimate \(\Var{\mathbf{\epsilon}}\) with the new residuals, etc., until converged/tired

“Spurious” Correlations and Regressions

GLS with \(\w = (\Var{\mathbf{\epsilon}})^{-1}\) minimizes the variance conditional on \(\x\)
With spatiotemporal data, \(X\) is often generated by the same kind of process as \(Y\); \(X\) itself has lots of autocorrelations
If \(X\) and \(Y\) are really uncorrelated, \(\Cov{X,Y} = 0\)
So, generally, the sample covariance \(\rightarrow 0\)
But if both \(X\) and \(Y\) are autocorrelated, the convergence can be very slow

Situations Where This Matters

Regressing one time series on another
Regression one spatial field on another
Regressing one spatiotemporal field on another
The usual significance tests will be mis-leading, the \(p\)-values will be too small, the confidence intervals will be too narrow, etc.

This is an old problem but it keeps happening

Noted at least since the 1920s
- by G. Udny Yule, who gave us the name “spurious correlations”, and by Francis Galton
Class website links to some recent papers pointing out big modern literatures which run into this problem:
- Spatial regressions in economics (Kelly, “Standard Errors of Persistence”)
- Dependence across related languages and cultures (Pepinsky)
- Dependence across neighbors in social networks (Lee and Ogburn)
- These are all well-written papers and I recommend trying to read them

Advice

Simulate, and use simulation to get at confidence intervals
If you must just test whether \(\beta=0\), simulate an autocorrelated \(X\), and an independent autocorrelated \(Y\), and look at the sample distribution of \(\hat{\beta}\) from the simulations

Regression with Dependent Noise and Observations