\[ \newcommand{\Expect}[1]{\mathbb{E}\left[ #1 \right]} \newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]} \newcommand{\SampleVar}[1]{\widehat{\mathrm{Var}}\left[ #1 \right]} \newcommand{\Cov}[1]{\mathrm{Cov}\left[ #1 \right]} \newcommand{\TrueRegFunc}{\mu} \newcommand{\EstRegFunc}{\widehat{\TrueRegFunc}} \DeclareMathOperator{\tr}{tr} \DeclareMathOperator*{\argmin}{argmin} \DeclareMathOperator{\det}{det} \newcommand{\TrueNoise}{\epsilon} \newcommand{\EstNoise}{\widehat{\TrueNoise}} \]

In our previous episodes

Linear smoothers
- Predictions are linear combinations of the data
- How to choose the weights?
PCA
- Use correlations to break the data into additive components

Today: use correlations to do prediction

Optimal prediction in general

What’s the best constant guess for a random variable \(Y\)?

\[\begin{eqnarray} \TrueRegFunc & = & \argmin_{m \in \mathbb{R}}{\Expect{(Y-m)^2}}\\ & = & \argmin_{m \in \mathbb{R}}{\left(\Var{(Y-m)} + (\Expect{Y-m})^2\right)}\\ & = & \argmin_{m \in \mathbb{R}}{\left(\Var{Y} + (\Expect{Y} - m)^2\right)}\\ & = & \argmin_{m \in \mathbb{R}}{ (\Expect{Y} - m)^2}\\ & = & \Expect{Y} \end{eqnarray}\]

Optimal prediction in general

Now we get a covariate \(Z\) which takes values in some arbitrary space \(\mathcal{Z}\)
What’s the best function of \(Z\) to guess for \(Y\)? \[\begin{eqnarray} \TrueRegFunc & = & \argmin_{m: ~ \mathcal{Z} \mapsto \mathbb{R}}{\Expect{(Y-m(Z))^2}}\\ & = & \argmin_{m: ~ \mathcal{Z} \mapsto \mathbb{R}}{\Expect{\Expect{(Y-m(Z))^2|Z}}}\\ & = & \argmin_{m: ~ \mathcal{Z} \mapsto \mathbb{R}}{\left( \int_{\mathcal{Z}}{\Expect{(Y-m(z))^2|Z=z} p(z) dz}\right)} \end{eqnarray}\]

For each \(z \in \mathcal{Z}\), best \(m(z)\) is \(\Expect{Y|Z=z}\) (by previous slide), so \[ \TrueRegFunc(z) = \Expect{Y|Z=z} \]

Optimal prediction in general

Learning arbitrary functions is hard!
Who knows what the right function might be?
What if we decide to make our predictions linear?

Optimal linear prediction with univariate predictor

Our prediction will be of the form \[ m(z) = a + b z \] and we want the best \(a, b\)

Optimal linear prediction, univariate case

\[ (\alpha, \beta) = \argmin_{a \in \mathbb{R}, b \in \mathbb{R}}{\Expect{(Y-(a+bZ))^2}} \]

Expand out that expectation, then take derivatives and set them to 0

The intercept

\[\begin{eqnarray} \Expect{(Y-(a+bZ))^2} & = & \Expect{Y^2} - 2\Expect{Y(a+bZ)} + \Expect{(a+bZ)^2}\\ & = & \Expect{Y^2} - 2a\Expect{Y} - 2b\Expect{YZ} +\\ & & a^2 + 2 ab \Expect{Z} + b^2 \Expect{Z^2}\\ \left. \frac{\partial}{\partial a}\Expect{(Y-(a+bZ))^2} \right|_{a=\alpha, b=\beta} & = & -2\Expect{Y} + 2\alpha + 2\beta\Expect{Z} = 0\\ \alpha & = & \Expect{Y} - \beta\Expect{Z} \end{eqnarray}\]

Remember: optimal linear predictor is \(\alpha + \beta Z\)

\(\therefore\) optimal linear predictor looks like \[ \Expect{Y} + \beta(Z-\Expect{Z}) \] \(\Rightarrow\) centering \(Z\) and/or \(Y\) won’t change the slope

The slope

\[\begin{eqnarray} \left. \frac{\partial}{\partial b}\Expect{(Y-(a+bZ))^2}\right|_{a=\alpha, b=\beta} & = & -2\Expect{YZ} + 2\alpha \Expect{Z} + 2\beta \Expect{Z^2} = 0\\ 0 & = & -\Expect{YZ} + (\Expect{Y} - \beta\Expect{Z})\Expect{Z} + \beta\Expect{Z^2} \\ 0 & = & \Expect{Y}\Expect{Z} - \Expect{YZ} + \beta(\Expect{Z^2} - \Expect{Z}^2)\\ 0 & = & -\Cov{Y,Z} + \beta \Var{Z}\\ \beta & = & \frac{\Cov{Y,Z}}{\Var{Z}} \end{eqnarray}\]

The optimal linear predictor of \(Y\) from \(Z\)

The optimal linear predictor of \(Y\) from a single \(Z\) is always

\[ \alpha + \beta Z = \Expect{Y} + \left(\frac{\Cov{Z,Y}}{\Var{Z}}\right) (Z - \Expect{Z}) \]

What did we not assume?

That the true relationship between \(Y\) and \(Z\) is linear
That anything is Gaussian
That anything has constant variance
That anything is independent or even uncorrelated
NONE OF THAT MATTERS for the optimal linear predictor

A little worked example (I)

We see \(X(r_1, t_1)\), for short \(X_1\)
We want to guess \(X(r_0, t_0)\), for short \(X_0\)
Assume: \(\Expect{X_0} = \Expect{X_1} = \mu\)
Assume: \(\Var{X_0} = \Var{X_1} = \sigma^2\)
Assume: \(\Cov{X_0, X_1} = \gamma\)

We know: \[ m(Z) = \Expect{Y} + \frac{\Cov{Y, Z}}{\Var{Z}}(Z - \Expect{Z}) \]

Substituting in: \[ m(X_1) = \mu + \frac{\gamma}{\sigma^2}(X_1 - \mu) \]

Some general properties of the optimal linear predictor

The prediction errors average out to zero
The prediction errors are uncorrelated with \(Z\)
The variance of the prediction errors \(\leq\) the variance of \(Y\)

The prediction errors average out to zero

\[\begin{eqnarray} \Expect{Y-m(Z)} & = & \Expect{Y - (\Expect{Y} + \beta(Z-\Expect{Z}))}\\ & = & \Expect{Y} - \Expect{Y} - \beta(\Expect{Z} - \Expect{Z}) = 0 \end{eqnarray}\]

If they didn’t average to zero, we’d adjust the coefficients until they did
Important: In general, \(\Expect{Y-m(Z)|Z} \neq 0\)

The prediction errors are uncorrelated with \(Z\)

\[\begin{eqnarray} \Cov{Z, Y-m(Z)} & = & \Expect{Z(Y-m(Z))} ~\text{(by previous slide)}\\ & = & \Expect{Z(Y - \Expect{Y} - \frac{\Cov{Y,Z}}{\Var{Z}}(Z-\Expect{Z}))}\\ & = & \Expect{ZY - Z\Expect{Y} - \frac{\Cov{Y,Z}}{\Var{Z}}(Z^2) + \frac{\Cov{Y,Z}}{\Var{Z}} (Z \Expect{Z})}\\ & = & \Expect{ZY} - \Expect{Z}\Expect{Y} - \frac{\Cov{Y,Z}}{\Var{Z}}\Expect{Z^2} + \frac{\Cov{Y,Z}}{\Var{Z}} (\Expect{Z})^2\\ & = & \Cov{Z,Y} - \frac{\Cov{Y,Z}}{\Var{Z}}(\Var{Z})\\ & = & 0 \end{eqnarray}\]

If they weren’t uncorrelated, we’d adjust the coefficients until they were

The prediction errors are uncorrelated with \(Z\)

Alternate take:

\[\begin{eqnarray} \Cov{Z, Y-m(Z)} & = & \Cov{Z, Y} - \Cov{Z, \alpha + \beta Z}\\ & = & \Cov{Y,Z} - \Cov{Z, \beta Z}\\ & = & \Cov{Y,Z} - \beta\Cov{Z,Z}\\ & = & \Cov{Y,Z} - \beta\Var{Z}\\ & = & \Cov{Y,Z} - \Cov{Y,Z} = 0 \end{eqnarray}\]

How big are the prediction errors?

\[\begin{eqnarray} \Var{Y-m(Z)} & = & \Var{Y - \alpha - \beta Z}\\ & = & \Var{Y - \beta Z}\\ & = & \Var{Y} + \beta^2\Var{Z} - 2\beta\Cov{Y,Z} \end{eqnarray}\]

but \(\beta = \Cov{Y,Z}/\Var{Z}\) so

\[\begin{eqnarray} \Var{Y-m(Z)} & = & \Var{Y} + \frac{(\Cov{Y,Z})^2}{\Var{Z}} - 2\frac{(\Cov{Y,Z})^2}{\Var{Z}}\\ & = & \Var{Y} - \frac{(\Cov{Y,Z})^2}{\Var{Z}}\\ & < & \Var{Y} ~ \text{unless}\ \Cov{Y,Z} = 0 \end{eqnarray}\]

\(\Rightarrow\) Optimal linear predictor is (almost) always better than nothing…

Multivariate case

We try to predict \(Y\) from a whole bunch of variables

Bundle those predictor variables into \(\vec{Z}\)

Solution:

\[ m(\vec{Z}) = \alpha+\vec{\beta}\cdot \vec{Z} = \Expect{Y} + (\Var{\vec{Z}})^{-1} \Cov{\vec{Z},Y} \cdot (\vec{Z} - \Expect{\vec{Z}}) \]

and

\[ \Var{Y-m(\vec{Z})} = \Var{Y} - \Cov{Y,\vec{Z}}^T (\Var{\vec{Z}})^{-1} \Cov{Y,\vec{Z}} \]

What we don’t assume, again

That anything is Gaussian
Anything else about the distributions of \(Y\) or \(\vec{Z}\)
That the linear predictor is correct

Some possible contexts

Interpolating or extrapolating one variable over space and/or time
Predicting one variable from another
Predicting one variable from 2+ others

Interpolating or extrapolating a single variable

Given: \(X(r_1, t_1), X(r_2, t_2), \ldots X(r_n, t_n)\)
Desired: estimate/guess at \(X(r_0, t_0)\)

\[\begin{eqnarray} Y & = & X(r_0, t_0)\\ \vec{Z} & = & [X(r_1, t_1), X(r_2, t_2), \ldots X(r_n, t_n)] \end{eqnarray}\]

Prediction for \(X(r_0, t_0)\) is a linear combination of \(X\) at other points \[\begin{eqnarray} m(r_0, t_0) & = & \alpha + \vec{\beta} \cdot \left[\begin{array}{c} X(r_1, t_1) \\ X(r_2, t_2) \\ \vdots \\ X(r_n, t_n) \end{array}\right]\\ \alpha & = & \Expect{X(r_0, t_0)} - \vec{\beta} \cdot \left[\begin{array}{c} \Expect{X(r_1, t_1)}\\ \Expect{X(r_2, t_2)} \\ \vdots \\ \Expect{X(r_n, t_n)}\end{array}\right] ~ \text{(goes away if everything's centered)}\\ \vec{\beta} & = & {\left[\begin{array}{cccc} \Var{X(r_1, t_1)} & \Cov{X(r_1, t_1), X(r_2, t_2)} & \ldots & \Cov{X(r_1, t_1), X(r_n, t_n)}\\ \Cov{X(r_1, t_1), X(r_2, t_2)} & \Var{X(r_2, t_2)} & \ldots & \Cov{X(r_2, t_2), X(r_n, t_n)}\\ \vdots & \vdots & \ldots & \vdots\\ \Cov{X(r_1, t_1), X(r_n, t_n)} & \Cov{X(r_2, t_2), X(r_n, t_n)} & \ldots & \Var{X(r_n, t_n)}\end{array}\right]}^{-1} \left[\begin{array}{c} \Cov{X(r_0, t_0), X(r_1, t_1)}\\ \Cov{X(r_0, t_0), X(r_2, t_2)}\\ \vdots \\ \Cov{X(r_0, t_0), X(r_n, t_n)}\end{array}\right] \end{eqnarray}\]
looks a lot like a linear smoother
- specifically \(\vec{\beta}\) looks like one row of \(\mathbf{w}\)
- best choice of weights \(\mathbf{w}\) comes from variances and covariances

A little worked example (II)

We see \(X(r_1, t_1)\) and \(X(r_2, t_2)\), for short \(X_1\) and \(X_2\)
We want to predict \(X(r_0, t_0)\), for short \(X_0\)
Assume: \(\Expect{X(r, t)} = \mu\) for all \(r,t\)
Assume: \(\Var{X(r,t)} = \sigma^2\) ditto
Assume: \(\Cov{X_1, X_0} = \Cov{X_2, X_0} = \gamma\)
Assume: \(\Cov{X_1, X_2} = \rho\)

Work out \(\vec{\beta}\) (off-line!) and get \[\begin{eqnarray} m(x_1, x_2) = \mu + \frac{\gamma}{\sigma^2 + \rho}\left( (x_1 - \mu) + (x_2 - \mu)\right) \end{eqnarray}\] vs. with one predictor \[ \mu + \frac{\gamma}{\sigma^2}(x_1 - \mu) \]

Predicting one variable from another

Given: values of variable \(U\) at many points, \(U(r_1, t_1), \ldots U(r_n, t_n)\)
Desired: estimate of \(X\) at point \((r_0, t_0)\), \(X\neq U\)

\[\begin{eqnarray} Y & = & X(r_0, t_0)\\ \vec{Z} & = & [U(r_1, t_1), U(r_2, t_2), \ldots U(r_n, t_n)]\\ \end{eqnarray}\]

Need to find covariances of the \(U\)s with each other, and their covariances with \(X\)

Predicting one variable from 2+ others

Given: values of two variables \(U\), \(V\) at many points
Desired: estimate of \(X\) at one point

\[\begin{eqnarray} Y & = & X(r_0, t_0)\\ \vec{Z} & = & [U(r_1, t_1), V(r_1, t_1), U(r_2, t_2), V(r_2, t_2), \ldots U(r_n, t_n), V(r_n, t_n)] \end{eqnarray}\]

Need to find covariances of \(U\)s and \(V\)s with each other, and with \(X\)

Optimal prediction depends on variances and covariances

so how do we get these?

Repeat the experiment many times
OR make assumptions
- E.g., some covariances should be the same
- E.g., covariances should change smoothly in time or space
- E.g., covariances should follow a particular model

Summing up

We can always decide to use a linear predictor, \(m(\vec{Z}) = \alpha + \vec{\beta} \cdot \vec{Z}\)
The optimal linear predictor of \(Y\) from \(\vec{Z}\) always takes the same form: \[ m(\vec{Z}) = \Expect{Y} + (\Var{\vec{Z}})^{-1} \Cov{Y,\vec{Z}} \cdot (\vec{Z} - \Expect{\vec{Z}}) \]
Doing linear prediction requires finding the covariances
Next few lectures: how to find and use covariances over time, over space, over both

Backup: Gory details for multivariate predictors

\[\begin{eqnarray} m(\vec{Z}) & = & a + \vec{b} \cdot \vec{Z}\\ (\alpha, \vec{\beta}) & = & \argmin_{a \in \mathbb{R}, \vec{b} \in \mathbb{R}^n}{\Expect{(Y-(a + \vec{b} \cdot \vec{Z}))^2}}\\ \Expect{(Y-(a+\vec{b}\cdot \vec{Z}))^2} & = & \Expect{Y^2} + a^2 + \Expect{(\vec{b}\cdot \vec{Z})^2}\\ \nonumber & & - 2\Expect{Y (\vec{b}\cdot \vec{Z})} - 2 \Expect{Y a} + 2 \Expect{a \vec{b} \cdot \vec{Z}}\\ & = & \Expect{Y^2} + a^2 + \vec{b} \cdot \Expect{\vec{Z} \otimes \vec{Z}} b \\ \nonumber & & -2a\Expect{Y} - 2 \vec{b} \cdot \Expect{Y\vec{Z}} + 2a\vec{b}\cdot \Expect{\vec{Z}}\\ \end{eqnarray}\]

(\(\vec{u} \otimes \vec{v}\) is the outer product, the square matrix where \((\vec{u} \times \vec{v})_{ij} = u_i v_j\))

Backup: Gory details: the intercept

Take derivative w.r.t. \(a\), set to 0 at \(a=\alpha\), \(\vec{b}=\vec{\beta}\):

\[\begin{eqnarray} 0 & = & -2\Expect{Y} + 2\vec{\beta} \cdot \Expect{\vec{Z}} + 2\alpha \\ \alpha & = & \Expect{Y} - \vec{\beta} \cdot \Expect{\vec{Z}}\\ \end{eqnarray}\]

just like when \(Z\) was univariate

Backup: Gory details: the slopes

\[\begin{eqnarray} -2 \Expect{Y\vec{Z}} + 2 \Expect{\vec{Z} \otimes \vec{Z}} \beta + 2 \alpha \Expect{\vec{Z}} & = & 0\\ \Expect{Y\vec{Z}} - \alpha\Expect{\vec{Z}} & = & \Expect{\vec{Z} \otimes \vec{Z}} \beta\\ \Expect{Y\vec{Z}} - (\Expect{Y} - \vec{\beta} \cdot \Expect{\vec{Z}}) \Expect{\vec{Z}} & = & \Expect{\vec{Z} \otimes \vec{Z}} \beta\\ \Cov{Y,\vec{Z}} & = & \Var{\vec{Z}} \beta\\ \beta & = & (\Var{\vec{Z}})^{-1} \Cov{Y,\vec{Z}} \end{eqnarray}\]

Reduces to \(\Cov{Y,Z}/\Var{Z}\) when \(Z\) is univariate

Backup: Gory details: the PCA view

The factor of \((\Var{\vec{Z}})^{-1}\) rotates and scales \(\vec{Z}\) to uncorrelated, unit-variance variables
Start with the eigendecomposition of \(\Var{\vec{Z}}\): \[\begin{eqnarray} \Var{\vec{Z}} & = & \mathbf{w} \mathbf{\Lambda} \mathbf{w}^T\\ (\Var{\vec{Z}})^{-1} & = & \mathbf{w} \mathbf{\Lambda}^{-1} \mathbf{w}^T\\ (\Var{\vec{Z}})^{-1} & = & (\mathbf{w} \mathbf{\Lambda}^{-1/2}) (\mathbf{w} \mathbf{\Lambda}^{-1/2})^T\\ & = & (\Var{\vec{Z}})^{-1/2} \left((\Var{\vec{Z}})^{-1/2}\right)^T\\ \end{eqnarray}\]
- (For the idea of the square root of a matrix, see further backup)
Use this to motivate defining new, uncorrelated, unit-variance variables: \[\begin{eqnarray} \vec{U} & \equiv & \vec{Z} \Var{\vec{Z}}^{-1/2}\\ \Var{\vec{U}} & = & \mathbf{I}\\ \end{eqnarray}\]
Now replace \(\vec{Z}\) in the linear predictor with \(\vec{U}\) and see how simple it is: \[\begin{eqnarray} \vec{Z}\cdot\vec{\beta} & = & \vec{Z} \cdot (\Var{\vec{Z}})^{-1} \Cov{\vec{Z}, Y}\\ & = & \vec{Z} \Var{\vec{Z}}^{-1/2} \cdot \left(\Var{\vec{Z}}^{-1/2}\right)^T \Cov{\vec{Z}, Y}\\ & = & \vec{U} \cdot \Cov{\vec{U}, Y}\\ \end{eqnarray}\]

Backup: Estimation I: “plug-in”

We don’t see the true expectations, variances, covariances
But we can have sample/empirical values
One estimate of the optimal linear predictor: plug in the sample values

so for univariate \(Z\), \[ \hat{m}(z) = \overline{y} - \frac{\widehat{\Cov{Y,Z}}}{\widehat{\Var{Z}}}(z-\overline{z}) \]

Backup: Estimation II: ordinary least squares

We don’t see the true expected squared error, but we do have the sample mean squared error
Minimize that
Leads to exactly the same results as plug-in approach!

Backup: Estimation: When does OLS/plug-in work?

Jointly sufficient conditions:
- Sample means converge on expectation values
- Sample covariances converge on true covariance
- Sample variances converge on true, invertible variance
Then by continuity OLS coefficients converge on true \(\beta\)
This can all happen even when everything is dependent on everything else!

Backup: Square roots of a matrix

A square matrix \(\mathbf{d}\) is a square root of \(\mathbf{c}\) when \(\mathbf{c} = \mathbf{d} \mathbf{d}^T\)
If there are any square roots, there are many square roots
- Pick any orthogonal matrix \(\mathbf{o}^T = \mathbf{o}^{-1}\)
- \((\mathbf{d}\mathbf{o})(\mathbf{d}\mathbf{o})^T = \mathbf{d}\mathbf{d}^T\)
- Just like every real number has two square roots…
If \(\mathbf{c}\) is diagonal, define \(\mathbf{c}^{1/2}\) as the diagonal matrix of square roots
If \(\mathbf{c} = \mathbf{w}\mathbf{\Lambda}\mathbf{w}^T\), one square root is \(\mathbf{w}\mathbf{\Lambda}^{1/2}\)

Optimal Linear Prediction

In our previous episodes

Optimal prediction in general

Optimal prediction in general

Optimal prediction in general

Optimal linear prediction with univariate predictor

Optimal linear prediction, univariate case

The intercept

The slope

The optimal linear predictor of \(Y\) from \(Z\)

What did we not assume?

A little worked example (I)

Some general properties of the optimal linear predictor

The prediction errors average out to zero

The prediction errors are uncorrelated with \(Z\)

The prediction errors are uncorrelated with \(Z\)

How big are the prediction errors?

Multivariate case

What we don’t assume, again

Some possible contexts

Interpolating or extrapolating a single variable

A little worked example (II)

Predicting one variable from another

Predicting one variable from 2+ others

Optimal prediction depends on variances and covariances

Summing up

Backup: Gory details for multivariate predictors

Backup: Gory details: the intercept

Backup: Gory details: the slopes

Backup: Gory details: the PCA view

Backup: Estimation I: “plug-in”

Backup: Estimation II: ordinary least squares

Backup: Estimation: When does OLS/plug-in work?

Backup: Square roots of a matrix