\[ \newcommand{\Expect}[1]{\mathbb{E}\left[ #1 \right]} \newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]} \newcommand{\SampleVar}[1]{\widehat{\mathrm{Var}}\left[ #1 \right]} \newcommand{\Cov}[1]{\mathrm{Cov}\left[ #1 \right]} \newcommand{\TrueRegFunc}{\mu} \newcommand{\EstRegFunc}{\widehat{\TrueRegFunc}} \DeclareMathOperator{\tr}{tr} \DeclareMathOperator*{\argmin}{argmin} \DeclareMathOperator{\det}{det} \newcommand{\TrueNoise}{\epsilon} \newcommand{\EstNoise}{\widehat{\TrueNoise}} \]

In our previous episodes

Linear smoothers
- Predictions are linear combinations of the data
- How to choose the weights?
PCA
- Use correlations to break the data into additive components

Today: use correlations to do prediction

Optimal prediction in general

What’s the best constant guess for a random variable \(Y\)?

\[\begin{eqnarray} \TrueRegFunc & = & \argmin_{m}{\Expect{(Y-m)^2}}\\ & = & \argmin_{m}{\Var{(Y-m)} + (\Expect{Y-m})^2}\\ & = & \argmin_m{\Var{Y} + (\Expect{Y} - m)^2}\\ & = & \argmin_m{ (\Expect{Y} - m)^2}\\ & = & \Expect{Y} \end{eqnarray}\]

Optimal prediction in general

What’s the best function of \(Z\) to guess for \(Y\)?

\[\begin{eqnarray} \TrueRegFunc & = & \argmin_{m}{\Expect{(Y-m(Z))^2}}\\ & = & \argmin_{m}{\Expect{\Expect{(Y-m(Z))^2|Z}}} \end{eqnarray}\]

For each \(z\), best \(m(z)\) is \(\Expect{Y|Z=z}\)

\[ \TrueRegFunc(z) = \Expect{Y|Z=z} \]

Optimal prediction in general

Learning arbitrary functions is hard!

Who knows what the right function might be?

What if we decide to make our predictions linear?

Optimal linear prediction with univariate predictor

Our prediction will be of the form \[ m(z) = a + b z \] and we want the best \(a, b\)

Optimal linear prediction, univariate case

\[ (\alpha, \beta) = \argmin_{a,b}{\Expect{(Y-(a+bZ))^2}} \]

Expand out that expectation, then take derivatives and set them to 0

The intercept

\[\begin{eqnarray} \Expect{(Y-(a+bZ))^2} & = & \Expect{Y^2} - 2\Expect{Y(a+bZ)} + \Expect{(a+bZ)^2}\\ & = & \Expect{Y^2} - 2a\Expect{Y} - 2b\Expect{YZ} +\\ & & a^2 + 2 ab \Expect{Z} + b^2 \Expect{Z^2}\\ \left. \frac{\partial}{\partial a}\Expect{(Y-(a+bZ))^2} \right|_{a=\alpha, b=\beta} & = & -2\Expect{Y} + 2\alpha + 2\beta\Expect{Z} = 0\\ \alpha & = & \Expect{Y} - \beta\Expect{Z} \end{eqnarray}\]

\(\therefore\) optimal linear predictor looks like \[ \Expect{Y} + \beta(Z-\Expect{Z}) \] \(\Rightarrow\) centering \(Z\) and/or \(Y\) won’t change the slope

The slope

\[\begin{eqnarray} \left. \frac{\partial}{\partial b}\Expect{(Y-(a+bZ))^2}\right|_{a=\alpha, b=\beta} & = & -2\Expect{YZ} + 2\alpha \Expect{Z} + 2\beta \Expect{Z^2} = 0\\ 0 & = & -\Expect{YZ} + (\Expect{Y} - \beta\Expect{Z})\Expect{Z} + \beta\Expect{Z^2} \\ 0 & = & \Expect{Y}\Expect{Z} - \Expect{YZ} + \beta(\Expect{Z^2} - \Expect{Z}^2)\\ 0 & = & -\Cov{Y,Z} + \beta \Var{Z}\\ \beta & = & \frac{\Cov{Y,Z}}{\Var{Z}} \end{eqnarray}\]

The optimal linear predictor of \(Y\) from \(Z\)

The optimal linear predictor of \(Y\) from a single \(Z\) is always

\[ \alpha + \beta Z = \Expect{Y} + \left(\frac{\Cov{Z,Y}}{\Var{Z}}\right) (Z - \Expect{Z}) \]

What did we not assume?

That the true relationship between \(Y\) and \(Z\) is linear
That anything is Gaussian
That anything has constant variance
That anything is independent or even uncorrelated

NONE OF THAT MATTERS for the optimal linear predictor

The prediction errors average out to zero

\[\begin{eqnarray} \Expect{Y-m(Z)} & = & \Expect{Y - (\Expect{Y} + \beta(Z-\Expect{Z}))}\\ & = & \Expect{Y} - \Expect{Y} - \beta(\Expect{Z} - \Expect{Z}) = 0 \end{eqnarray}\]

If they didn’t average to zero, we’d adjust the coefficients until they did
Important: In general, \(\Expect{Y-m(Z)|Z} \neq 0\)

The prediction errors are uncorrelated with \(Z\)

\[\begin{eqnarray} \Cov{Z, Y-m(Z)} & = & \Expect{Z(Y-m(Z))} ~\text{(by previous slide)}\\ & = & \Expect{Z(Y - \Expect{Y} - \frac{\Cov{Y,Z}}{\Var{Z}}(Z-\Expect{Z}))}\\ & = & \Expect{ZY - Z\Expect{Y} - \frac{\Cov{Y,Z}}{\Var{Z}}(Z^2) + \frac{\Cov{Y,Z}}{\Var{Z}} (Z \Expect{Z})}\\ & = & \Expect{ZY} - \Expect{Z}\Expect{Y} - \frac{\Cov{Y,Z}}{\Var{Z}}\Expect{Z^2} + \frac{\Cov{Y,Z}}{\Var{Z}} (\Expect{Z})^2\\ & = & \Cov{Z,Y} - \frac{\Cov{Y,Z}}{\Var{Z}}(\Var{Z})\\ & = & 0 \end{eqnarray}\]

If they weren’t uncorrelated, we’d adjust the coefficients until they were

The prediction errors are uncorrelated with \(Z\)

Alternate take:

\[\begin{eqnarray} \Cov{Z, Y-m(Z)} & = & \Cov{Z, Y} - \Cov{Z, \alpha + \beta Z}\\ & = & \Cov{Y,Z} - \Cov{Z, \beta Z}\\ & = & \Cov{Y,Z} - \beta\Cov{Z,Z}\\ & = & \Cov{Y,Z} - \beta\Var{Z}\\ & = & \Cov{Y,Z} - \Cov{Y,Z} = 0 \end{eqnarray}\]

How big are the prediction errors?

\[\begin{eqnarray} \Var{Y-m(Z)} & = & \Var{Y - \alpha - \beta Z}\\ & = & \Var{Y - \beta Z}\\ \end{eqnarray}\]

In-class exercise: finish this! Answer in terms of \(\Var{Y}\), \(\Var{Z}\), \(\Cov{Y,Z}\)

How big are the prediction errors?

\[\begin{eqnarray} \Var{Y-m(Z)} & = & \Var{Y - \alpha - \beta Z}\\ & = & \Var{Y - \beta Z}\\ & = & \Var{Y} + \beta^2\Var{Z} - 2\beta\Cov{Y,Z} \end{eqnarray}\]

but \(\beta = \Cov{Y,Z}/\Var{Z}\) so

\[\begin{eqnarray} \Var{Y-m(Z)} & = & \Var{Y} + \frac{\Cov{Y,Z}^2}{\Var{Z}} - 2\frac{\Cov{Y,Z}^2}{\Var{Z}}\\ & = & \Var{Y} - \frac{\Cov{Y,Z}^2}{\Var{Z}}\\ & < & \Var{Y} \text{unless}\ \Cov{Y,Z} = 0 \end{eqnarray}\]

\(\Rightarrow\) Optimal linear predictor is almost always better than nothing…

Multivariate case

We try to predict \(Y\) from a whole bunch of variables

Bundle those predictor variables into \(\vec{Z}\)

Solution:

\[ m(\vec{Z}) = \alpha+\vec{\beta}\cdot \vec{Z} = \Expect{Y} + \Var{\vec{Z}}^{-1} \Cov{\vec{Z},Y} (\vec{Z} - \Expect{\vec{Z}}) \]

and

\[ \Var{Y-m(\vec{Z})} = \Var{Y} - \Cov{Y,\vec{Z}}^T \Var{\vec{Z}}^{-1} \Cov{Y,\vec{Z}} \]

What we don’t assume, again

Anything about the distributions of \(Y\) or \(\vec{Z}\)
That the linear predictor is correct
That anything is Gaussian

Some possible contexts

Interpolating or extrapolating one variable over space and/or time
Predicting one variable from another
Predicting one variable from 2+ others

Interpolating or extrapolating a single variable

Given: \(X(r_1, t_1), X(r_2, t_2), \ldots X(r_n, t_n)\)
Desired: estimate/guess at \(X(r_0, t_0)\)

\[\begin{eqnarray} Y & = & X(r_0, t_0)\\ \vec{Z} & = & [X(r_1, t_1), X(r_2, t_2), \ldots X(r_n, t_n)] \end{eqnarray}\]

Prediction for \(X(r_0, t_0)\) is a linear combination of \(X\) at other points

\[\begin{eqnarray} \EstRegFunc(r_0, t_0) & = & \alpha + \vec{\beta} \cdot \left[\begin{array}{c} X(r_1, t_1) \\ X(r_2, t_2) \\ \vdots \\ X(r_n, t_n) \end{array}\right]\\ \alpha & = & \Expect{X(r_0, t_0)} - \vec{\beta} \cdot \left[\begin{array}{c} \Expect{X(r_1, t_1)}\\ \Expect{X(r_2, t_2)} \\ \vdots \\ \Expect{X(r_n, t_n)}\end{array}\right] ~ \text{(goes away if everything's centered)}\\ \beta & = & {\left[\begin{array}{cccc} \Var{X(r_1, t_1)} & \Cov{X(r_1, t_1), X(r_2, t_2)} & \ldots & \Cov{X(r_1, t_1), X(r_n, t_n)}\\ \Cov{X(r_1, t_1), X(r_2, t_2)} & \Var{X(r_2, t_2)} & \ldots & \Cov{X(r_2, t_2), X(r_n, t_n)}\\ \vdots & \vdots & \ldots & \vdots\\ \Cov{X(r_1, t_1), X(r_n, t_n)} & \Cov{X(r_2, t_2), X(r_n, t_n)} & \ldots & \Var{X(r_n, t_n)}\end{array}\right]}^{-1} \left[\begin{array}{c} \Cov{X(r_0, t_0), X(r_1, t_1)}\\ \Cov{X(r_0, t_0), X(r_2, t_2)}\\ \vdots \\ \Cov{X(r_0, t_0), X(r_n, t_n)}\end{array}\right] \end{eqnarray}\]

looks a lot like a linear smoother
- best choice of weights \(\mathbf{w}\) from variances and covariances

Predicting one variable from another

Given: values of variable \(U\) at many points, \(U(r_1, t_1), \ldots U(r_n, t_n)\)
Desired: estimate of \(X\) at point \((r_0, t_0)\), \(X\neq U\)

\[\begin{eqnarray} Y & = & X(r_0, t_0)\\ \vec{Z} & = & [U(r_1, t_1), U(r_2, t_2), \ldots U(r_n, t_n)]\\ \end{eqnarray}\]

Need to find covariances of the \(U\)s with each other, and their covariances with \(X\)

Predicting one variable from 2+ others

Given: values of two variables \(U\), \(V\) at many points
Desired: estimate of \(X\) at one point

\[\begin{eqnarray} Y & = & X(r_0, t_0)\\ \vec{Z} & = & [U(r_1, t_1), V(r_1, t_1), U(r_2, t_2), V(r_2, t_2), \ldots U(r_n, t_n), V(r_n, t_n)] \end{eqnarray}\]

Need to find covariances of \(U\)s and \(V\)s with each other, and with \(X\)

Optimal prediction depends on variances and covariances

so how do we get these?

Repeat the experiment many times
OR make assumptions
- E.g., some covariances should be the same
- E.g., covariances should change smoothly in time or space
- E.g., covariances should follow a particular model

Summing up

We can always decide to use a linear predictor, \(m(\vec{Z}) = \alpha + \vec{\beta} \cdot \vec{Z}\)
The optimal linear predictor of \(Y\) from \(\vec{Z}\) always takes the same form: \[ m(Y) = \Expect{Y} + \Var{\vec{Z}}^{-1} \Cov{Y,\vec{Z}} (\vec{Z} - \Expect{\vec{Z}}) \]
Doing linear prediction requires finding the covariances
Next few lectures: how to find and use covariances over time, over space, over both

Gory details for multivariate predictors

\[\begin{eqnarray} m(\vec{Z}) & = & a + \vec{b} \cdot \vec{Z}\\ (\alpha, \vec{\beta}) & = & \argmin_{a, \vec{b}}{\Expect{(Y-(a + \vec{b} \cdot \vec{Z}))^2}}\\ \Expect{(Y-(a+\vec{b}\cdot \vec{Z}))^2} & = & \Expect{Y^2} + a^2 + \Expect{(\vec{b}\cdot \vec{Z})^2}\\ \nonumber & & - 2\Expect{Y (\vec{b}\cdot \vec{Z})} - 2 \Expect{Y a} + 2 \Expect{a \vec{b} \cdot \vec{Z}}\\ & = & \Expect{Y^2} + a^2 + \vec{b} \cdot \Expect{\vec{Z} \otimes \vec{Z}} b \\ \nonumber & & -2a\Expect{Y} - 2 \vec{b} \cdot \Expect{Y\vec{Z}} + 2a\vec{b}\cdot \Expect{\vec{Z}}\\ \end{eqnarray}\]

Gory details: the intercept

Take derivative w.r.t. \(a\), set to 0:

\[\begin{eqnarray} 0 & = & -2\Expect{Y} + 2\beta \Expect{\vec{Z}} + 2\alpha \\ \alpha & = & \Expect{Y} - \vec{\beta} \cdot \Expect{\vec{Z}}\\ \end{eqnarray}\]

just like when \(Z\) was univariate

Gory details: the slopes

\[\begin{eqnarray} -2 \Expect{Y\vec{Z}} + 2 \Expect{\vec{Z} \otimes \vec{Z}} \beta + 2 \alpha \Expect{\vec{Z}} & = & 0\\ \Expect{Y\vec{Z}} - \alpha\Expect{\vec{Z}} & = & \Expect{\vec{Z} \otimes \vec{Z}} \beta\\ \Expect{Y\vec{Z}} - (\Expect{Y} - \vec{\beta} \cdot \Expect{\vec{Z}}) \Expect{\vec{Z}} & = & \Expect{\vec{Z} \otimes \vec{Z}} \beta\\ \Cov{Y,\vec{Z}} & = & \Var{\vec{Z}} \beta\\ \beta & = & (\Var{\vec{Z}})^{-1} \Cov{Y,\vec{Z}} \end{eqnarray}\]

Reduces to \(\Cov{Y,Z}/\Var{Z}\) when \(Z\) is univariate

Gory details: the PCA view

The factor of \(\Var{\vec{Z}}^{-1}\) rotates and scales \(\vec{Z}\) to uncorrelated, unit-variance variables

\[\begin{eqnarray} \Var{\vec{Z}} & = & \mathbf{w} \mathbf{\Lambda} \mathbf{w}^T\\ \Var{\vec{Z}}^{-1} & = & \mathbf{w} \mathbf{\Lambda}^{-1} \mathbf{w}^T\\ \Var{\vec{Z}}^{-1} & = & (\mathbf{w} \mathbf{\Lambda}^{-1/2}) (\mathbf{w} \mathbf{\Lambda}^{-1/2})^T\\ & = & \Var{\vec{Z}}^{-1/2} \left(\Var{\vec{Z}}^{-1/2}\right)^T\\ \vec{U} & \equiv & \vec{Z} \Var{\vec{Z}}^{-1/2}\\ \Var{\vec{U}} & = & \mathbf{I}\\ \vec{Z}\cdot\vec{\beta} & = & \vec{Z} \cdot \Var{\vec{Z}}^{-1} \Cov{\vec{Z}, Y}\\ & = & \vec{Z} \Var{\vec{Z}}^{-1/2} \left(\Var{\vec{Z}}^{-1/2}\right)^T \Cov{\vec{Z}, Y}\\ & = & \vec{U} \Cov{\vec{U}, Y}\\ \end{eqnarray}\]

Estimation I: “plug-in”

We don’t see the true expectations, variances, covariances
But we can have sample/empirical values
One estimate of the optimal linear predictor: plug in the sample values

so for univariate \(Z\),

\[ \EstRegFunc(z) = \overline{y} - \frac{\widehat{\Cov{Y,Z}}}{\widehat{\Var{Z}}}(z-\overline{z}) \]

Estimation II: ordinary least squares

We don’t see the true expected squared error, but we do have the sample mean
Minimize that
Leads to exactly the same results as plug-in approach!

When does OLS/plug-in work?

Jointly sufficient conditions:
- Sample means converge on expectation values
- Sample covariances converge on true covariance
- Sample variances converge on true, invertible variance
Then by continuity OLS coefficients converge on true \(\beta\)
This can all happen even when everything is dependent on everything else!

Square root of a matrix

A square matrix \(\mathbf{d}\) is a square root of \(\mathbf{c}\) when \(\mathbf{c} = \mathbf{d} \mathbf{d}^T\)
If there are any square roots, there are many square roots
- Pick any orthogonal matrix \(\mathbf{o}^T = \mathbf{o}^{-1}\)
- \((\mathbf{d}\mathbf{o})(\mathbf{d}\mathbf{o})^T = \mathbf{d}\mathbf{d}^T\)
- Just like every real number has two square roots…
If \(\mathbf{c}\) is diagonal, define \(\mathbf{c}^{1/2}\) as the diagonal matrix of square roots
If \(\mathbf{c} = \mathbf{w}\mathbf{\Lambda}\mathbf{w}^T\), one square root is \(\mathbf{w}\mathbf{\Lambda}^{1/2}\)

Optimal Linear Prediction

In our previous episodes

Optimal prediction in general

Optimal prediction in general

Optimal prediction in general

Optimal linear prediction with univariate predictor

Optimal linear prediction, univariate case

The intercept

The slope

The optimal linear predictor of \(Y\) from \(Z\)

What did we not assume?

The prediction errors average out to zero

The prediction errors are uncorrelated with \(Z\)

The prediction errors are uncorrelated with \(Z\)

How big are the prediction errors?

How big are the prediction errors?

Multivariate case

What we don’t assume, again

Some possible contexts

Interpolating or extrapolating a single variable

Predicting one variable from another

Predicting one variable from 2+ others

Optimal prediction depends on variances and covariances

Summing up

Gory details for multivariate predictors

Gory details: the intercept

Gory details: the slopes

Gory details: the PCA view

Estimation I: “plug-in”

Estimation II: ordinary least squares

When does OLS/plug-in work?

Square root of a matrix