Optimal Linear Prediction
36-467/36-667
18 September 2018
\[
\newcommand{\Expect}[1]{\mathbb{E}\left[ #1 \right]}
\newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]}
\newcommand{\SampleVar}[1]{\widehat{\mathrm{Var}}\left[ #1 \right]}
\newcommand{\Cov}[1]{\mathrm{Cov}\left[ #1 \right]}
\newcommand{\TrueRegFunc}{\mu}
\newcommand{\EstRegFunc}{\widehat{\TrueRegFunc}}
\DeclareMathOperator{\tr}{tr}
\DeclareMathOperator*{\argmin}{argmin}
\DeclareMathOperator{\det}{det}
\newcommand{\TrueNoise}{\epsilon}
\newcommand{\EstNoise}{\widehat{\TrueNoise}}
\]
In our previous episodes
- Linear smoothers
- Predictions are linear combinations of the data
- How to choose the weights?
- PCA
- Use correlations to break the data into additive components
Today: use correlations to do prediction
Optimal prediction in general
What’s the best constant guess for a random variable \(Y\)?
\[\begin{eqnarray}
\TrueRegFunc & = & \argmin_{m}{\Expect{(Y-m)^2}}\\
& = & \argmin_{m}{\Var{(Y-m)} + (\Expect{Y-m})^2}\\
& = & \argmin_m{\Var{Y} + (\Expect{Y} - m)^2}\\
& = & \argmin_m{ (\Expect{Y} - m)^2}\\
& = & \Expect{Y}
\end{eqnarray}\]
Optimal prediction in general
What’s the best function of \(Z\) to guess for \(Y\)?
\[\begin{eqnarray}
\TrueRegFunc & = & \argmin_{m}{\Expect{(Y-m(Z))^2}}\\
& = & \argmin_{m}{\Expect{\Expect{(Y-m(Z))^2|Z}}}
\end{eqnarray}\]
For each \(z\), best \(m(z)\) is \(\Expect{Y|Z=z}\)
\[
\TrueRegFunc(z) = \Expect{Y|Z=z}
\]
Optimal prediction in general
Learning arbitrary functions is hard!
Who knows what the right function might be?
What if we decide to make our predictions linear?
Optimal linear prediction with univariate predictor
Our prediction will be of the form \[
m(z) = a + b z
\] and we want the best \(a, b\)
Optimal linear prediction, univariate case
\[
(\alpha, \beta) = \argmin_{a,b}{\Expect{(Y-(a+bZ))^2}}
\]
Expand out that expectation, then take derivatives and set them to 0
The intercept
\[\begin{eqnarray}
\Expect{(Y-(a+bZ))^2} & = & \Expect{Y^2} - 2\Expect{Y(a+bZ)} + \Expect{(a+bZ)^2}\\
& = & \Expect{Y^2} - 2a\Expect{Y} - 2b\Expect{YZ} +\\
& & a^2 + 2 ab \Expect{Z} + b^2 \Expect{Z^2}\\
\left. \frac{\partial}{\partial a}\Expect{(Y-(a+bZ))^2} \right|_{a=\alpha, b=\beta} & = & -2\Expect{Y} + 2\alpha + 2\beta\Expect{Z} = 0\\
\alpha & = & \Expect{Y} - \beta\Expect{Z}
\end{eqnarray}\]
\(\therefore\) optimal linear predictor looks like \[
\Expect{Y} + \beta(Z-\Expect{Z})
\] \(\Rightarrow\) centering \(Z\) and/or \(Y\) won’t change the slope
The slope
\[\begin{eqnarray}
\left. \frac{\partial}{\partial b}\Expect{(Y-(a+bZ))^2}\right|_{a=\alpha, b=\beta} & = & -2\Expect{YZ} + 2\alpha \Expect{Z} + 2\beta \Expect{Z^2} = 0\\
0 & = & -\Expect{YZ} + (\Expect{Y} - \beta\Expect{Z})\Expect{Z} + \beta\Expect{Z^2} \\
0 & = & \Expect{Y}\Expect{Z} - \Expect{YZ} + \beta(\Expect{Z^2} - \Expect{Z}^2)\\
0 & = & -\Cov{Y,Z} + \beta \Var{Z}\\
\beta & = & \frac{\Cov{Y,Z}}{\Var{Z}}
\end{eqnarray}\]
The optimal linear predictor of \(Y\) from \(Z\)
The optimal linear predictor of \(Y\) from a single \(Z\) is always
\[
\alpha + \beta Z = \Expect{Y} + \left(\frac{\Cov{Z,Y}}{\Var{Z}}\right) (Z - \Expect{Z})
\]
What did we not assume?
- That the true relationship between \(Y\) and \(Z\) is linear
- That anything is Gaussian
- That anything has constant variance
- That anything is independent or even uncorrelated
NONE OF THAT MATTERS for the optimal linear predictor
The prediction errors average out to zero
\[\begin{eqnarray}
\Expect{Y-m(Z)} & = & \Expect{Y - (\Expect{Y} + \beta(Z-\Expect{Z}))}\\
& = & \Expect{Y} - \Expect{Y} - \beta(\Expect{Z} - \Expect{Z}) = 0
\end{eqnarray}\]
- If they didn’t average to zero, we’d adjust the coefficients until they did
- Important: In general, \(\Expect{Y-m(Z)|Z} \neq 0\)
How big are the prediction errors?
\[\begin{eqnarray}
\Var{Y-m(Z)} & = & \Var{Y - \alpha - \beta Z}\\
& = & \Var{Y - \beta Z}\\
\end{eqnarray}\]
In-class exercise: finish this! Answer in terms of \(\Var{Y}\), \(\Var{Z}\), \(\Cov{Y,Z}\)
How big are the prediction errors?
\[\begin{eqnarray}
\Var{Y-m(Z)} & = & \Var{Y - \alpha - \beta Z}\\
& = & \Var{Y - \beta Z}\\
& = & \Var{Y} + \beta^2\Var{Z} - 2\beta\Cov{Y,Z}
\end{eqnarray}\]
but \(\beta = \Cov{Y,Z}/\Var{Z}\) so
\[\begin{eqnarray}
\Var{Y-m(Z)} & = & \Var{Y} + \frac{\Cov{Y,Z}^2}{\Var{Z}} - 2\frac{\Cov{Y,Z}^2}{\Var{Z}}\\
& = & \Var{Y} - \frac{\Cov{Y,Z}^2}{\Var{Z}}\\
& < & \Var{Y} \text{unless}\ \Cov{Y,Z} = 0
\end{eqnarray}\]
\(\Rightarrow\) Optimal linear predictor is almost always better than nothing…
Multivariate case
We try to predict \(Y\) from a whole bunch of variables
Bundle those predictor variables into \(\vec{Z}\)
Solution:
\[
m(\vec{Z}) = \alpha+\vec{\beta}\cdot \vec{Z} = \Expect{Y} + \Var{\vec{Z}}^{-1} \Cov{\vec{Z},Y} (\vec{Z} - \Expect{\vec{Z}})
\]
and
\[
\Var{Y-m(\vec{Z})} = \Var{Y} - \Cov{Y,\vec{Z}}^T \Var{\vec{Z}}^{-1} \Cov{Y,\vec{Z}}
\]
What we don’t assume, again
- Anything about the distributions of \(Y\) or \(\vec{Z}\)
- That the linear predictor is correct
- That anything is Gaussian
Some possible contexts
- Interpolating or extrapolating one variable over space and/or time
- Predicting one variable from another
- Predicting one variable from 2+ others
Prediction for \(X(r_0, t_0)\) is a linear combination of \(X\) at other points
\[\begin{eqnarray}
\EstRegFunc(r_0, t_0) & = & \alpha + \vec{\beta} \cdot \left[\begin{array}{c} X(r_1, t_1) \\ X(r_2, t_2) \\ \vdots \\ X(r_n, t_n) \end{array}\right]\\
\alpha & = & \Expect{X(r_0, t_0)} - \vec{\beta} \cdot \left[\begin{array}{c} \Expect{X(r_1, t_1)}\\ \Expect{X(r_2, t_2)} \\ \vdots \\ \Expect{X(r_n, t_n)}\end{array}\right] ~ \text{(goes away if everything's centered)}\\
\beta & = & {\left[\begin{array}{cccc} \Var{X(r_1, t_1)} & \Cov{X(r_1, t_1), X(r_2, t_2)} & \ldots & \Cov{X(r_1, t_1), X(r_n, t_n)}\\
\Cov{X(r_1, t_1), X(r_2, t_2)} & \Var{X(r_2, t_2)} & \ldots & \Cov{X(r_2, t_2), X(r_n, t_n)}\\
\vdots & \vdots & \ldots & \vdots\\
\Cov{X(r_1, t_1), X(r_n, t_n)} & \Cov{X(r_2, t_2), X(r_n, t_n)} & \ldots & \Var{X(r_n, t_n)}\end{array}\right]}^{-1} \left[\begin{array}{c} \Cov{X(r_0, t_0), X(r_1, t_1)}\\
\Cov{X(r_0, t_0), X(r_2, t_2)}\\ \vdots \\ \Cov{X(r_0, t_0), X(r_n, t_n)}\end{array}\right]
\end{eqnarray}\]
- looks a lot like a linear smoother
- best choice of weights \(\mathbf{w}\) from variances and covariances
Predicting one variable from another
- Given: values of variable \(U\) at many points, \(U(r_1, t_1), \ldots U(r_n, t_n)\)
- Desired: estimate of \(X\) at point \((r_0, t_0)\), \(X\neq U\)
\[\begin{eqnarray}
Y & = & X(r_0, t_0)\\
\vec{Z} & = & [U(r_1, t_1), U(r_2, t_2), \ldots U(r_n, t_n)]\\
\end{eqnarray}\]
- Need to find covariances of the \(U\)s with each other, and their covariances with \(X\)
Predicting one variable from 2+ others
- Given: values of two variables \(U\), \(V\) at many points
- Desired: estimate of \(X\) at one point
\[\begin{eqnarray}
Y & = & X(r_0, t_0)\\
\vec{Z} & = & [U(r_1, t_1), V(r_1, t_1), U(r_2, t_2), V(r_2, t_2), \ldots U(r_n, t_n), V(r_n, t_n)]
\end{eqnarray}\]
- Need to find covariances of \(U\)s and \(V\)s with each other, and with \(X\)
Optimal prediction depends on variances and covariances
so how do we get these?
- Repeat the experiment many times
- OR make assumptions
- E.g., some covariances should be the same
- E.g., covariances should change smoothly in time or space
- E.g., covariances should follow a particular model
Summing up
- We can always decide to use a linear predictor, \(m(\vec{Z}) = \alpha + \vec{\beta} \cdot \vec{Z}\)
- The optimal linear predictor of \(Y\) from \(\vec{Z}\) always takes the same form: \[
m(Y) = \Expect{Y} + \Var{\vec{Z}}^{-1} \Cov{Y,\vec{Z}} (\vec{Z} - \Expect{\vec{Z}})
\]
- Doing linear prediction requires finding the covariances
- Next few lectures: how to find and use covariances over time, over space, over both
Gory details for multivariate predictors
\[\begin{eqnarray}
m(\vec{Z}) & = & a + \vec{b} \cdot \vec{Z}\\
(\alpha, \vec{\beta}) & = & \argmin_{a, \vec{b}}{\Expect{(Y-(a + \vec{b} \cdot \vec{Z}))^2}}\\
\Expect{(Y-(a+\vec{b}\cdot \vec{Z}))^2} & = & \Expect{Y^2} + a^2 + \Expect{(\vec{b}\cdot \vec{Z})^2}\\
\nonumber & & - 2\Expect{Y (\vec{b}\cdot \vec{Z})} - 2 \Expect{Y a} + 2 \Expect{a \vec{b} \cdot \vec{Z}}\\
& = & \Expect{Y^2} + a^2 + \vec{b} \cdot \Expect{\vec{Z} \otimes \vec{Z}} b \\
\nonumber & & -2a\Expect{Y} - 2 \vec{b} \cdot \Expect{Y\vec{Z}} + 2a\vec{b}\cdot \Expect{\vec{Z}}\\
\end{eqnarray}\]
Gory details: the intercept
Take derivative w.r.t. \(a\), set to 0:
\[\begin{eqnarray}
0 & = & -2\Expect{Y} + 2\beta \Expect{\vec{Z}} + 2\alpha \\
\alpha & = & \Expect{Y} - \vec{\beta} \cdot \Expect{\vec{Z}}\\
\end{eqnarray}\]
just like when \(Z\) was univariate
Gory details: the slopes
\[\begin{eqnarray}
-2 \Expect{Y\vec{Z}} + 2 \Expect{\vec{Z} \otimes \vec{Z}} \beta + 2 \alpha \Expect{\vec{Z}} & = & 0\\
\Expect{Y\vec{Z}} - \alpha\Expect{\vec{Z}} & = & \Expect{\vec{Z} \otimes \vec{Z}} \beta\\
\Expect{Y\vec{Z}} - (\Expect{Y} - \vec{\beta} \cdot \Expect{\vec{Z}}) \Expect{\vec{Z}} & = & \Expect{\vec{Z} \otimes \vec{Z}} \beta\\
\Cov{Y,\vec{Z}} & = & \Var{\vec{Z}} \beta\\
\beta & = & (\Var{\vec{Z}})^{-1} \Cov{Y,\vec{Z}}
\end{eqnarray}\]
Reduces to \(\Cov{Y,Z}/\Var{Z}\) when \(Z\) is univariate
Gory details: the PCA view
The factor of \(\Var{\vec{Z}}^{-1}\) rotates and scales \(\vec{Z}\) to uncorrelated, unit-variance variables
\[\begin{eqnarray}
\Var{\vec{Z}} & = & \mathbf{w} \mathbf{\Lambda} \mathbf{w}^T\\
\Var{\vec{Z}}^{-1} & = & \mathbf{w} \mathbf{\Lambda}^{-1} \mathbf{w}^T\\
\Var{\vec{Z}}^{-1} & = & (\mathbf{w} \mathbf{\Lambda}^{-1/2}) (\mathbf{w} \mathbf{\Lambda}^{-1/2})^T\\
& = & \Var{\vec{Z}}^{-1/2} \left(\Var{\vec{Z}}^{-1/2}\right)^T\\
\vec{U} & \equiv & \vec{Z} \Var{\vec{Z}}^{-1/2}\\
\Var{\vec{U}} & = & \mathbf{I}\\
\vec{Z}\cdot\vec{\beta} & = & \vec{Z} \cdot \Var{\vec{Z}}^{-1} \Cov{\vec{Z}, Y}\\
& = & \vec{Z} \Var{\vec{Z}}^{-1/2} \left(\Var{\vec{Z}}^{-1/2}\right)^T \Cov{\vec{Z}, Y}\\
& = & \vec{U} \Cov{\vec{U}, Y}\\
\end{eqnarray}\]
Estimation I: “plug-in”
- We don’t see the true expectations, variances, covariances
- But we can have sample/empirical values
- One estimate of the optimal linear predictor: plug in the sample values
so for univariate \(Z\),
\[
\EstRegFunc(z) = \overline{y} - \frac{\widehat{\Cov{Y,Z}}}{\widehat{\Var{Z}}}(z-\overline{z})
\]
Estimation II: ordinary least squares
- We don’t see the true expected squared error, but we do have the sample mean
- Minimize that
- Leads to exactly the same results as plug-in approach!
When does OLS/plug-in work?
- Jointly sufficient conditions:
- Sample means converge on expectation values
- Sample covariances converge on true covariance
- Sample variances converge on true, invertible variance
- Then by continuity OLS coefficients converge on true \(\beta\)
- This can all happen even when everything is dependent on everything else!
Square root of a matrix
- A square matrix \(\mathbf{d}\) is a square root of \(\mathbf{c}\) when \(\mathbf{c} = \mathbf{d} \mathbf{d}^T\)
- If there are any square roots, there are many square roots
- Pick any orthogonal matrix \(\mathbf{o}^T = \mathbf{o}^{-1}\)
- \((\mathbf{d}\mathbf{o})(\mathbf{d}\mathbf{o})^T = \mathbf{d}\mathbf{d}^T\)
- Just like every real number has two square roots…
- If \(\mathbf{c}\) is diagonal, define \(\mathbf{c}^{1/2}\) as the diagonal matrix of square roots
- If \(\mathbf{c} = \mathbf{w}\mathbf{\Lambda}\mathbf{w}^T\), one square root is \(\mathbf{w}\mathbf{\Lambda}^{1/2}\)