Optimal Linear Prediction
36-467/667
20 September 2020 (Lecture 7)
\[
\newcommand{\Expect}[1]{\mathbb{E}\left[ #1 \right]}
\newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]}
\newcommand{\SampleVar}[1]{\widehat{\mathrm{Var}}\left[ #1 \right]}
\newcommand{\Cov}[1]{\mathrm{Cov}\left[ #1 \right]}
\newcommand{\TrueRegFunc}{\mu}
\newcommand{\EstRegFunc}{\widehat{\TrueRegFunc}}
\DeclareMathOperator{\tr}{tr}
\DeclareMathOperator*{\argmin}{argmin}
\DeclareMathOperator{\det}{det}
\newcommand{\TrueNoise}{\epsilon}
\newcommand{\EstNoise}{\widehat{\TrueNoise}}
\]
In our previous episodes
- Linear smoothers
- Predictions are linear combinations of the data
- How to choose the weights?
- PCA
- Use correlations to break the data into additive components
Today: use correlations to do prediction
Optimal prediction in general
What’s the best constant guess for a random variable \(Y\)?
\[\begin{eqnarray}
\TrueRegFunc & = & \argmin_{m \in \mathbb{R}}{\Expect{(Y-m)^2}}\\
& = & \argmin_{m \in \mathbb{R}}{\left(\Var{(Y-m)} + (\Expect{Y-m})^2\right)}\\
& = & \argmin_{m \in \mathbb{R}}{\left(\Var{Y} + (\Expect{Y} - m)^2\right)}\\
& = & \argmin_{m \in \mathbb{R}}{ (\Expect{Y} - m)^2}\\
& = & \Expect{Y}
\end{eqnarray}\]
Optimal prediction in general
- Now we get a covariate \(Z\) which takes values in some arbitrary space \(\mathcal{Z}\)
- What’s the best function of \(Z\) to guess for \(Y\)? \[\begin{eqnarray}
\TrueRegFunc & = & \argmin_{m: ~ \mathcal{Z} \mapsto \mathbb{R}}{\Expect{(Y-m(Z))^2}}\\
& = & \argmin_{m: ~ \mathcal{Z} \mapsto \mathbb{R}}{\Expect{\Expect{(Y-m(Z))^2|Z}}}\\
& = & \argmin_{m: ~ \mathcal{Z} \mapsto \mathbb{R}}{\left( \int_{\mathcal{Z}}{\Expect{(Y-m(z))^2|Z=z} p(z) dz}\right)}
\end{eqnarray}\]
For each \(z \in \mathcal{Z}\), best \(m(z)\) is \(\Expect{Y|Z=z}\) (by previous slide), so \[
\TrueRegFunc(z) = \Expect{Y|Z=z}
\]
Optimal prediction in general
- Learning arbitrary functions is hard!
- Who knows what the right function might be?
- What if we decide to make our predictions linear?
Optimal linear prediction with univariate predictor
Our prediction will be of the form \[
m(z) = a + b z
\] and we want the best \(a, b\)
Optimal linear prediction, univariate case
\[
(\alpha, \beta) = \argmin_{a \in \mathbb{R}, b \in \mathbb{R}}{\Expect{(Y-(a+bZ))^2}}
\]
Expand out that expectation, then take derivatives and set them to 0
The intercept
\[\begin{eqnarray}
\Expect{(Y-(a+bZ))^2} & = & \Expect{Y^2} - 2\Expect{Y(a+bZ)} + \Expect{(a+bZ)^2}\\
& = & \Expect{Y^2} - 2a\Expect{Y} - 2b\Expect{YZ} +\\
& & a^2 + 2 ab \Expect{Z} + b^2 \Expect{Z^2}\\
\left. \frac{\partial}{\partial a}\Expect{(Y-(a+bZ))^2} \right|_{a=\alpha, b=\beta} & = & -2\Expect{Y} + 2\alpha + 2\beta\Expect{Z} = 0\\
\alpha & = & \Expect{Y} - \beta\Expect{Z}
\end{eqnarray}\]
Remember: optimal linear predictor is \(\alpha + \beta Z\)
\(\therefore\) optimal linear predictor looks like \[
\Expect{Y} + \beta(Z-\Expect{Z})
\] \(\Rightarrow\) centering \(Z\) and/or \(Y\) won’t change the slope
The slope
\[\begin{eqnarray}
\left. \frac{\partial}{\partial b}\Expect{(Y-(a+bZ))^2}\right|_{a=\alpha, b=\beta} & = & -2\Expect{YZ} + 2\alpha \Expect{Z} + 2\beta \Expect{Z^2} = 0\\
0 & = & -\Expect{YZ} + (\Expect{Y} - \beta\Expect{Z})\Expect{Z} + \beta\Expect{Z^2} \\
0 & = & \Expect{Y}\Expect{Z} - \Expect{YZ} + \beta(\Expect{Z^2} - \Expect{Z}^2)\\
0 & = & -\Cov{Y,Z} + \beta \Var{Z}\\
\beta & = & \frac{\Cov{Y,Z}}{\Var{Z}}
\end{eqnarray}\]
The optimal linear predictor of \(Y\) from \(Z\)
The optimal linear predictor of \(Y\) from a single \(Z\) is always
\[
\alpha + \beta Z = \Expect{Y} + \left(\frac{\Cov{Z,Y}}{\Var{Z}}\right) (Z - \Expect{Z})
\]
What did we not assume?
- That the true relationship between \(Y\) and \(Z\) is linear
- That anything is Gaussian
- That anything has constant variance
- That anything is independent or even uncorrelated
- NONE OF THAT MATTERS for the optimal linear predictor
A little worked example (I)
- We see \(X(r_1, t_1)\), for short \(X_1\)
- We want to guess \(X(r_0, t_0)\), for short \(X_0\)
- Assume: \(\Expect{X_0} = \Expect{X_1} = \mu\)
- Assume: \(\Var{X_0} = \Var{X_1} = \sigma^2\)
- Assume: \(\Cov{X_0, X_1} = \gamma\)
We know: \[
m(Z) = \Expect{Y} + \frac{\Cov{Y, Z}}{\Var{Z}}(Z - \Expect{Z})
\]
Substituting in: \[
m(X_1) = \mu + \frac{\gamma}{\sigma^2}(X_1 - \mu)
\]
Some general properties of the optimal linear predictor
- The prediction errors average out to zero
- The prediction errors are uncorrelated with \(Z\)
- The variance of the prediction errors \(\leq\) the variance of \(Y\)
The prediction errors average out to zero
\[\begin{eqnarray}
\Expect{Y-m(Z)} & = & \Expect{Y - (\Expect{Y} + \beta(Z-\Expect{Z}))}\\
& = & \Expect{Y} - \Expect{Y} - \beta(\Expect{Z} - \Expect{Z}) = 0
\end{eqnarray}\]
- If they didn’t average to zero, we’d adjust the coefficients until they did
- Important: In general, \(\Expect{Y-m(Z)|Z} \neq 0\)
How big are the prediction errors?
\[\begin{eqnarray}
\Var{Y-m(Z)} & = & \Var{Y - \alpha - \beta Z}\\
& = & \Var{Y - \beta Z}\\
& = & \Var{Y} + \beta^2\Var{Z} - 2\beta\Cov{Y,Z}
\end{eqnarray}\]
but \(\beta = \Cov{Y,Z}/\Var{Z}\) so
\[\begin{eqnarray}
\Var{Y-m(Z)} & = & \Var{Y} + \frac{(\Cov{Y,Z})^2}{\Var{Z}} - 2\frac{(\Cov{Y,Z})^2}{\Var{Z}}\\
& = & \Var{Y} - \frac{(\Cov{Y,Z})^2}{\Var{Z}}\\
& < & \Var{Y} ~ \text{unless}\ \Cov{Y,Z} = 0
\end{eqnarray}\]
\(\Rightarrow\) Optimal linear predictor is (almost) always better than nothing…
Multivariate case
We try to predict \(Y\) from a whole bunch of variables
Bundle those predictor variables into \(\vec{Z}\)
Solution:
\[
m(\vec{Z}) = \alpha+\vec{\beta}\cdot \vec{Z} = \Expect{Y} + (\Var{\vec{Z}})^{-1} \Cov{\vec{Z},Y} \cdot (\vec{Z} - \Expect{\vec{Z}})
\]
and
\[
\Var{Y-m(\vec{Z})} = \Var{Y} - \Cov{Y,\vec{Z}}^T (\Var{\vec{Z}})^{-1} \Cov{Y,\vec{Z}}
\]
What we don’t assume, again
- That anything is Gaussian
- Anything else about the distributions of \(Y\) or \(\vec{Z}\)
- That the linear predictor is correct
Some possible contexts
- Interpolating or extrapolating one variable over space and/or time
- Predicting one variable from another
- Predicting one variable from 2+ others
- Prediction for \(X(r_0, t_0)\) is a linear combination of \(X\) at other points \[\begin{eqnarray}
m(r_0, t_0) & = & \alpha + \vec{\beta} \cdot \left[\begin{array}{c} X(r_1, t_1) \\ X(r_2, t_2) \\ \vdots \\ X(r_n, t_n) \end{array}\right]\\
\alpha & = & \Expect{X(r_0, t_0)} - \vec{\beta} \cdot \left[\begin{array}{c} \Expect{X(r_1, t_1)}\\ \Expect{X(r_2, t_2)} \\ \vdots \\ \Expect{X(r_n, t_n)}\end{array}\right] ~ \text{(goes away if everything's centered)}\\
\vec{\beta} & = & {\left[\begin{array}{cccc} \Var{X(r_1, t_1)} & \Cov{X(r_1, t_1), X(r_2, t_2)} & \ldots & \Cov{X(r_1, t_1), X(r_n, t_n)}\\
\Cov{X(r_1, t_1), X(r_2, t_2)} & \Var{X(r_2, t_2)} & \ldots & \Cov{X(r_2, t_2), X(r_n, t_n)}\\
\vdots & \vdots & \ldots & \vdots\\
\Cov{X(r_1, t_1), X(r_n, t_n)} & \Cov{X(r_2, t_2), X(r_n, t_n)} & \ldots & \Var{X(r_n, t_n)}\end{array}\right]}^{-1} \left[\begin{array}{c} \Cov{X(r_0, t_0), X(r_1, t_1)}\\
\Cov{X(r_0, t_0), X(r_2, t_2)}\\ \vdots \\ \Cov{X(r_0, t_0), X(r_n, t_n)}\end{array}\right]
\end{eqnarray}\]
- looks a lot like a linear smoother
- specifically \(\vec{\beta}\) looks like one row of \(\mathbf{w}\)
- best choice of weights \(\mathbf{w}\) comes from variances and covariances
A little worked example (II)
- We see \(X(r_1, t_1)\) and \(X(r_2, t_2)\), for short \(X_1\) and \(X_2\)
- We want to predict \(X(r_0, t_0)\), for short \(X_0\)
- Assume: \(\Expect{X(r, t)} = \mu\) for all \(r,t\)
- Assume: \(\Var{X(r,t)} = \sigma^2\) ditto
- Assume: \(\Cov{X_1, X_0} = \Cov{X_2, X_0} = \gamma\)
- Assume: \(\Cov{X_1, X_2} = \rho\)
Work out \(\vec{\beta}\) (off-line!) and get \[\begin{eqnarray}
m(x_1, x_2) = \mu + \frac{\gamma}{\sigma^2 + \rho}\left( (x_1 - \mu) + (x_2 - \mu)\right)
\end{eqnarray}\] vs. with one predictor \[
\mu + \frac{\gamma}{\sigma^2}(x_1 - \mu)
\]
Predicting one variable from another
- Given: values of variable \(U\) at many points, \(U(r_1, t_1), \ldots U(r_n, t_n)\)
- Desired: estimate of \(X\) at point \((r_0, t_0)\), \(X\neq U\)
\[\begin{eqnarray}
Y & = & X(r_0, t_0)\\
\vec{Z} & = & [U(r_1, t_1), U(r_2, t_2), \ldots U(r_n, t_n)]\\
\end{eqnarray}\]
- Need to find covariances of the \(U\)s with each other, and their covariances with \(X\)
Predicting one variable from 2+ others
- Given: values of two variables \(U\), \(V\) at many points
- Desired: estimate of \(X\) at one point
\[\begin{eqnarray}
Y & = & X(r_0, t_0)\\
\vec{Z} & = & [U(r_1, t_1), V(r_1, t_1), U(r_2, t_2), V(r_2, t_2), \ldots U(r_n, t_n), V(r_n, t_n)]
\end{eqnarray}\]
- Need to find covariances of \(U\)s and \(V\)s with each other, and with \(X\)
Optimal prediction depends on variances and covariances
so how do we get these?
- Repeat the experiment many times
- OR make assumptions
- E.g., some covariances should be the same
- E.g., covariances should change smoothly in time or space
- E.g., covariances should follow a particular model
Summing up
- We can always decide to use a linear predictor, \(m(\vec{Z}) = \alpha + \vec{\beta} \cdot \vec{Z}\)
- The optimal linear predictor of \(Y\) from \(\vec{Z}\) always takes the same form: \[
m(\vec{Z}) = \Expect{Y} + (\Var{\vec{Z}})^{-1} \Cov{Y,\vec{Z}} \cdot (\vec{Z} - \Expect{\vec{Z}})
\]
- Doing linear prediction requires finding the covariances
- Next few lectures: how to find and use covariances over time, over space, over both
Backup: Gory details for multivariate predictors
\[\begin{eqnarray}
m(\vec{Z}) & = & a + \vec{b} \cdot \vec{Z}\\
(\alpha, \vec{\beta}) & = & \argmin_{a \in \mathbb{R}, \vec{b} \in \mathbb{R}^n}{\Expect{(Y-(a + \vec{b} \cdot \vec{Z}))^2}}\\
\Expect{(Y-(a+\vec{b}\cdot \vec{Z}))^2} & = & \Expect{Y^2} + a^2 + \Expect{(\vec{b}\cdot \vec{Z})^2}\\
\nonumber & & - 2\Expect{Y (\vec{b}\cdot \vec{Z})} - 2 \Expect{Y a} + 2 \Expect{a \vec{b} \cdot \vec{Z}}\\
& = & \Expect{Y^2} + a^2 + \vec{b} \cdot \Expect{\vec{Z} \otimes \vec{Z}} b \\
\nonumber & & -2a\Expect{Y} - 2 \vec{b} \cdot \Expect{Y\vec{Z}} + 2a\vec{b}\cdot \Expect{\vec{Z}}\\
\end{eqnarray}\]
(\(\vec{u} \otimes \vec{v}\) is the outer product, the square matrix where \((\vec{u} \times \vec{v})_{ij} = u_i v_j\))
Backup: Gory details: the intercept
Take derivative w.r.t. \(a\), set to 0 at \(a=\alpha\), \(\vec{b}=\vec{\beta}\):
\[\begin{eqnarray}
0 & = & -2\Expect{Y} + 2\vec{\beta} \cdot \Expect{\vec{Z}} + 2\alpha \\
\alpha & = & \Expect{Y} - \vec{\beta} \cdot \Expect{\vec{Z}}\\
\end{eqnarray}\]
just like when \(Z\) was univariate
Backup: Gory details: the slopes
\[\begin{eqnarray}
-2 \Expect{Y\vec{Z}} + 2 \Expect{\vec{Z} \otimes \vec{Z}} \beta + 2 \alpha \Expect{\vec{Z}} & = & 0\\
\Expect{Y\vec{Z}} - \alpha\Expect{\vec{Z}} & = & \Expect{\vec{Z} \otimes \vec{Z}} \beta\\
\Expect{Y\vec{Z}} - (\Expect{Y} - \vec{\beta} \cdot \Expect{\vec{Z}}) \Expect{\vec{Z}} & = & \Expect{\vec{Z} \otimes \vec{Z}} \beta\\
\Cov{Y,\vec{Z}} & = & \Var{\vec{Z}} \beta\\
\beta & = & (\Var{\vec{Z}})^{-1} \Cov{Y,\vec{Z}}
\end{eqnarray}\]
Reduces to \(\Cov{Y,Z}/\Var{Z}\) when \(Z\) is univariate
Backup: Gory details: the PCA view
The factor of \((\Var{\vec{Z}})^{-1}\) rotates and scales \(\vec{Z}\) to uncorrelated, unit-variance variables
- Start with the eigendecomposition of \(\Var{\vec{Z}}\): \[\begin{eqnarray}
\Var{\vec{Z}} & = & \mathbf{w} \mathbf{\Lambda} \mathbf{w}^T\\
(\Var{\vec{Z}})^{-1} & = & \mathbf{w} \mathbf{\Lambda}^{-1} \mathbf{w}^T\\
(\Var{\vec{Z}})^{-1} & = & (\mathbf{w} \mathbf{\Lambda}^{-1/2}) (\mathbf{w} \mathbf{\Lambda}^{-1/2})^T\\
& = & (\Var{\vec{Z}})^{-1/2} \left((\Var{\vec{Z}})^{-1/2}\right)^T\\
\end{eqnarray}\]
- (For the idea of the square root of a matrix, see further backup)
- Use this to motivate defining new, uncorrelated, unit-variance variables: \[\begin{eqnarray}
\vec{U} & \equiv & \vec{Z} \Var{\vec{Z}}^{-1/2}\\
\Var{\vec{U}} & = & \mathbf{I}\\
\end{eqnarray}\]
Now replace \(\vec{Z}\) in the linear predictor with \(\vec{U}\) and see how simple it is: \[\begin{eqnarray}
\vec{Z}\cdot\vec{\beta} & = & \vec{Z} \cdot (\Var{\vec{Z}})^{-1} \Cov{\vec{Z}, Y}\\
& = & \vec{Z} \Var{\vec{Z}}^{-1/2} \cdot \left(\Var{\vec{Z}}^{-1/2}\right)^T \Cov{\vec{Z}, Y}\\
& = & \vec{U} \cdot \Cov{\vec{U}, Y}\\
\end{eqnarray}\]
Backup: Estimation I: “plug-in”
- We don’t see the true expectations, variances, covariances
- But we can have sample/empirical values
- One estimate of the optimal linear predictor: plug in the sample values
so for univariate \(Z\), \[
\hat{m}(z) = \overline{y} - \frac{\widehat{\Cov{Y,Z}}}{\widehat{\Var{Z}}}(z-\overline{z})
\]
Backup: Estimation II: ordinary least squares
- We don’t see the true expected squared error, but we do have the sample mean squared error
- Minimize that
- Leads to exactly the same results as plug-in approach!
Backup: Estimation: When does OLS/plug-in work?
- Jointly sufficient conditions:
- Sample means converge on expectation values
- Sample covariances converge on true covariance
- Sample variances converge on true, invertible variance
- Then by continuity OLS coefficients converge on true \(\beta\)
- This can all happen even when everything is dependent on everything else!
Backup: Square roots of a matrix
- A square matrix \(\mathbf{d}\) is a square root of \(\mathbf{c}\) when \(\mathbf{c} = \mathbf{d} \mathbf{d}^T\)
- If there are any square roots, there are many square roots
- Pick any orthogonal matrix \(\mathbf{o}^T = \mathbf{o}^{-1}\)
- \((\mathbf{d}\mathbf{o})(\mathbf{d}\mathbf{o})^T = \mathbf{d}\mathbf{d}^T\)
- Just like every real number has two square roots…
- If \(\mathbf{c}\) is diagonal, define \(\mathbf{c}^{1/2}\) as the diagonal matrix of square roots
- If \(\mathbf{c} = \mathbf{w}\mathbf{\Lambda}\mathbf{w}^T\), one square root is \(\mathbf{w}\mathbf{\Lambda}^{1/2}\)