Optimal Linear Prediction

36-467/36-667

18 September 2018

\[ \newcommand{\Expect}[1]{\mathbb{E}\left[ #1 \right]} \newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]} \newcommand{\SampleVar}[1]{\widehat{\mathrm{Var}}\left[ #1 \right]} \newcommand{\Cov}[1]{\mathrm{Cov}\left[ #1 \right]} \newcommand{\TrueRegFunc}{\mu} \newcommand{\EstRegFunc}{\widehat{\TrueRegFunc}} \DeclareMathOperator{\tr}{tr} \DeclareMathOperator*{\argmin}{argmin} \DeclareMathOperator{\det}{det} \newcommand{\TrueNoise}{\epsilon} \newcommand{\EstNoise}{\widehat{\TrueNoise}} \]

In our previous episodes

Today: use correlations to do prediction

Optimal prediction in general

What’s the best constant guess for a random variable \(Y\)?

\[\begin{eqnarray} \TrueRegFunc & = & \argmin_{m}{\Expect{(Y-m)^2}}\\ & = & \argmin_{m}{\Var{(Y-m)} + (\Expect{Y-m})^2}\\ & = & \argmin_m{\Var{Y} + (\Expect{Y} - m)^2}\\ & = & \argmin_m{ (\Expect{Y} - m)^2}\\ & = & \Expect{Y} \end{eqnarray}\]

Optimal prediction in general

What’s the best function of \(Z\) to guess for \(Y\)?

\[\begin{eqnarray} \TrueRegFunc & = & \argmin_{m}{\Expect{(Y-m(Z))^2}}\\ & = & \argmin_{m}{\Expect{\Expect{(Y-m(Z))^2|Z}}} \end{eqnarray}\]

For each \(z\), best \(m(z)\) is \(\Expect{Y|Z=z}\)

\[ \TrueRegFunc(z) = \Expect{Y|Z=z} \]

Optimal prediction in general

Learning arbitrary functions is hard!

Who knows what the right function might be?

What if we decide to make our predictions linear?

Optimal linear prediction with univariate predictor

Our prediction will be of the form \[ m(z) = a + b z \] and we want the best \(a, b\)

Optimal linear prediction, univariate case

\[ (\alpha, \beta) = \argmin_{a,b}{\Expect{(Y-(a+bZ))^2}} \]

Expand out that expectation, then take derivatives and set them to 0

The intercept

\[\begin{eqnarray} \Expect{(Y-(a+bZ))^2} & = & \Expect{Y^2} - 2\Expect{Y(a+bZ)} + \Expect{(a+bZ)^2}\\ & = & \Expect{Y^2} - 2a\Expect{Y} - 2b\Expect{YZ} +\\ & & a^2 + 2 ab \Expect{Z} + b^2 \Expect{Z^2}\\ \left. \frac{\partial}{\partial a}\Expect{(Y-(a+bZ))^2} \right|_{a=\alpha, b=\beta} & = & -2\Expect{Y} + 2\alpha + 2\beta\Expect{Z} = 0\\ \alpha & = & \Expect{Y} - \beta\Expect{Z} \end{eqnarray}\]

\(\therefore\) optimal linear predictor looks like \[ \Expect{Y} + \beta(Z-\Expect{Z}) \] \(\Rightarrow\) centering \(Z\) and/or \(Y\) won’t change the slope

The slope

\[\begin{eqnarray} \left. \frac{\partial}{\partial b}\Expect{(Y-(a+bZ))^2}\right|_{a=\alpha, b=\beta} & = & -2\Expect{YZ} + 2\alpha \Expect{Z} + 2\beta \Expect{Z^2} = 0\\ 0 & = & -\Expect{YZ} + (\Expect{Y} - \beta\Expect{Z})\Expect{Z} + \beta\Expect{Z^2} \\ 0 & = & \Expect{Y}\Expect{Z} - \Expect{YZ} + \beta(\Expect{Z^2} - \Expect{Z}^2)\\ 0 & = & -\Cov{Y,Z} + \beta \Var{Z}\\ \beta & = & \frac{\Cov{Y,Z}}{\Var{Z}} \end{eqnarray}\]

The optimal linear predictor of \(Y\) from \(Z\)

The optimal linear predictor of \(Y\) from a single \(Z\) is always

\[ \alpha + \beta Z = \Expect{Y} + \left(\frac{\Cov{Z,Y}}{\Var{Z}}\right) (Z - \Expect{Z}) \]

What did we not assume?

NONE OF THAT MATTERS for the optimal linear predictor

The prediction errors average out to zero

\[\begin{eqnarray} \Expect{Y-m(Z)} & = & \Expect{Y - (\Expect{Y} + \beta(Z-\Expect{Z}))}\\ & = & \Expect{Y} - \Expect{Y} - \beta(\Expect{Z} - \Expect{Z}) = 0 \end{eqnarray}\]

The prediction errors are uncorrelated with \(Z\)

\[\begin{eqnarray} \Cov{Z, Y-m(Z)} & = & \Expect{Z(Y-m(Z))} ~\text{(by previous slide)}\\ & = & \Expect{Z(Y - \Expect{Y} - \frac{\Cov{Y,Z}}{\Var{Z}}(Z-\Expect{Z}))}\\ & = & \Expect{ZY - Z\Expect{Y} - \frac{\Cov{Y,Z}}{\Var{Z}}(Z^2) + \frac{\Cov{Y,Z}}{\Var{Z}} (Z \Expect{Z})}\\ & = & \Expect{ZY} - \Expect{Z}\Expect{Y} - \frac{\Cov{Y,Z}}{\Var{Z}}\Expect{Z^2} + \frac{\Cov{Y,Z}}{\Var{Z}} (\Expect{Z})^2\\ & = & \Cov{Z,Y} - \frac{\Cov{Y,Z}}{\Var{Z}}(\Var{Z})\\ & = & 0 \end{eqnarray}\]

The prediction errors are uncorrelated with \(Z\)

Alternate take:

\[\begin{eqnarray} \Cov{Z, Y-m(Z)} & = & \Cov{Z, Y} - \Cov{Z, \alpha + \beta Z}\\ & = & \Cov{Y,Z} - \Cov{Z, \beta Z}\\ & = & \Cov{Y,Z} - \beta\Cov{Z,Z}\\ & = & \Cov{Y,Z} - \beta\Var{Z}\\ & = & \Cov{Y,Z} - \Cov{Y,Z} = 0 \end{eqnarray}\]

How big are the prediction errors?

\[\begin{eqnarray} \Var{Y-m(Z)} & = & \Var{Y - \alpha - \beta Z}\\ & = & \Var{Y - \beta Z}\\ \end{eqnarray}\]

In-class exercise: finish this! Answer in terms of \(\Var{Y}\), \(\Var{Z}\), \(\Cov{Y,Z}\)

How big are the prediction errors?

\[\begin{eqnarray} \Var{Y-m(Z)} & = & \Var{Y - \alpha - \beta Z}\\ & = & \Var{Y - \beta Z}\\ & = & \Var{Y} + \beta^2\Var{Z} - 2\beta\Cov{Y,Z} \end{eqnarray}\]

but \(\beta = \Cov{Y,Z}/\Var{Z}\) so

\[\begin{eqnarray} \Var{Y-m(Z)} & = & \Var{Y} + \frac{\Cov{Y,Z}^2}{\Var{Z}} - 2\frac{\Cov{Y,Z}^2}{\Var{Z}}\\ & = & \Var{Y} - \frac{\Cov{Y,Z}^2}{\Var{Z}}\\ & < & \Var{Y} \text{unless}\ \Cov{Y,Z} = 0 \end{eqnarray}\]

\(\Rightarrow\) Optimal linear predictor is almost always better than nothing…

Multivariate case

We try to predict \(Y\) from a whole bunch of variables

Bundle those predictor variables into \(\vec{Z}\)

Solution:

\[ m(\vec{Z}) = \alpha+\vec{\beta}\cdot \vec{Z} = \Expect{Y} + \Var{\vec{Z}}^{-1} \Cov{\vec{Z},Y} (\vec{Z} - \Expect{\vec{Z}}) \]

and

\[ \Var{Y-m(\vec{Z})} = \Var{Y} - \Cov{Y,\vec{Z}}^T \Var{\vec{Z}}^{-1} \Cov{Y,\vec{Z}} \]

What we don’t assume, again

Some possible contexts

Interpolating or extrapolating a single variable

\[\begin{eqnarray} Y & = & X(r_0, t_0)\\ \vec{Z} & = & [X(r_1, t_1), X(r_2, t_2), \ldots X(r_n, t_n)] \end{eqnarray}\]

Prediction for \(X(r_0, t_0)\) is a linear combination of \(X\) at other points

\[\begin{eqnarray} \EstRegFunc(r_0, t_0) & = & \alpha + \vec{\beta} \cdot \left[\begin{array}{c} X(r_1, t_1) \\ X(r_2, t_2) \\ \vdots \\ X(r_n, t_n) \end{array}\right]\\ \alpha & = & \Expect{X(r_0, t_0)} - \vec{\beta} \cdot \left[\begin{array}{c} \Expect{X(r_1, t_1)}\\ \Expect{X(r_2, t_2)} \\ \vdots \\ \Expect{X(r_n, t_n)}\end{array}\right] ~ \text{(goes away if everything's centered)}\\ \beta & = & {\left[\begin{array}{cccc} \Var{X(r_1, t_1)} & \Cov{X(r_1, t_1), X(r_2, t_2)} & \ldots & \Cov{X(r_1, t_1), X(r_n, t_n)}\\ \Cov{X(r_1, t_1), X(r_2, t_2)} & \Var{X(r_2, t_2)} & \ldots & \Cov{X(r_2, t_2), X(r_n, t_n)}\\ \vdots & \vdots & \ldots & \vdots\\ \Cov{X(r_1, t_1), X(r_n, t_n)} & \Cov{X(r_2, t_2), X(r_n, t_n)} & \ldots & \Var{X(r_n, t_n)}\end{array}\right]}^{-1} \left[\begin{array}{c} \Cov{X(r_0, t_0), X(r_1, t_1)}\\ \Cov{X(r_0, t_0), X(r_2, t_2)}\\ \vdots \\ \Cov{X(r_0, t_0), X(r_n, t_n)}\end{array}\right] \end{eqnarray}\]

Predicting one variable from another

\[\begin{eqnarray} Y & = & X(r_0, t_0)\\ \vec{Z} & = & [U(r_1, t_1), U(r_2, t_2), \ldots U(r_n, t_n)]\\ \end{eqnarray}\]

Predicting one variable from 2+ others

\[\begin{eqnarray} Y & = & X(r_0, t_0)\\ \vec{Z} & = & [U(r_1, t_1), V(r_1, t_1), U(r_2, t_2), V(r_2, t_2), \ldots U(r_n, t_n), V(r_n, t_n)] \end{eqnarray}\]

Optimal prediction depends on variances and covariances

so how do we get these?

Summing up

Gory details for multivariate predictors

\[\begin{eqnarray} m(\vec{Z}) & = & a + \vec{b} \cdot \vec{Z}\\ (\alpha, \vec{\beta}) & = & \argmin_{a, \vec{b}}{\Expect{(Y-(a + \vec{b} \cdot \vec{Z}))^2}}\\ \Expect{(Y-(a+\vec{b}\cdot \vec{Z}))^2} & = & \Expect{Y^2} + a^2 + \Expect{(\vec{b}\cdot \vec{Z})^2}\\ \nonumber & & - 2\Expect{Y (\vec{b}\cdot \vec{Z})} - 2 \Expect{Y a} + 2 \Expect{a \vec{b} \cdot \vec{Z}}\\ & = & \Expect{Y^2} + a^2 + \vec{b} \cdot \Expect{\vec{Z} \otimes \vec{Z}} b \\ \nonumber & & -2a\Expect{Y} - 2 \vec{b} \cdot \Expect{Y\vec{Z}} + 2a\vec{b}\cdot \Expect{\vec{Z}}\\ \end{eqnarray}\]

Gory details: the intercept

Take derivative w.r.t. \(a\), set to 0:

\[\begin{eqnarray} 0 & = & -2\Expect{Y} + 2\beta \Expect{\vec{Z}} + 2\alpha \\ \alpha & = & \Expect{Y} - \vec{\beta} \cdot \Expect{\vec{Z}}\\ \end{eqnarray}\]

just like when \(Z\) was univariate

Gory details: the slopes

\[\begin{eqnarray} -2 \Expect{Y\vec{Z}} + 2 \Expect{\vec{Z} \otimes \vec{Z}} \beta + 2 \alpha \Expect{\vec{Z}} & = & 0\\ \Expect{Y\vec{Z}} - \alpha\Expect{\vec{Z}} & = & \Expect{\vec{Z} \otimes \vec{Z}} \beta\\ \Expect{Y\vec{Z}} - (\Expect{Y} - \vec{\beta} \cdot \Expect{\vec{Z}}) \Expect{\vec{Z}} & = & \Expect{\vec{Z} \otimes \vec{Z}} \beta\\ \Cov{Y,\vec{Z}} & = & \Var{\vec{Z}} \beta\\ \beta & = & (\Var{\vec{Z}})^{-1} \Cov{Y,\vec{Z}} \end{eqnarray}\]

Reduces to \(\Cov{Y,Z}/\Var{Z}\) when \(Z\) is univariate

Gory details: the PCA view

The factor of \(\Var{\vec{Z}}^{-1}\) rotates and scales \(\vec{Z}\) to uncorrelated, unit-variance variables

\[\begin{eqnarray} \Var{\vec{Z}} & = & \mathbf{w} \mathbf{\Lambda} \mathbf{w}^T\\ \Var{\vec{Z}}^{-1} & = & \mathbf{w} \mathbf{\Lambda}^{-1} \mathbf{w}^T\\ \Var{\vec{Z}}^{-1} & = & (\mathbf{w} \mathbf{\Lambda}^{-1/2}) (\mathbf{w} \mathbf{\Lambda}^{-1/2})^T\\ & = & \Var{\vec{Z}}^{-1/2} \left(\Var{\vec{Z}}^{-1/2}\right)^T\\ \vec{U} & \equiv & \vec{Z} \Var{\vec{Z}}^{-1/2}\\ \Var{\vec{U}} & = & \mathbf{I}\\ \vec{Z}\cdot\vec{\beta} & = & \vec{Z} \cdot \Var{\vec{Z}}^{-1} \Cov{\vec{Z}, Y}\\ & = & \vec{Z} \Var{\vec{Z}}^{-1/2} \left(\Var{\vec{Z}}^{-1/2}\right)^T \Cov{\vec{Z}, Y}\\ & = & \vec{U} \Cov{\vec{U}, Y}\\ \end{eqnarray}\]

Estimation I: “plug-in”

so for univariate \(Z\),

\[ \EstRegFunc(z) = \overline{y} - \frac{\widehat{\Cov{Y,Z}}}{\widehat{\Var{Z}}}(z-\overline{z}) \]

Estimation II: ordinary least squares

When does OLS/plug-in work?

Square root of a matrix