\[ \newcommand{\Expect}[1]{\mathbb{E}\left[ #1 \right]} \newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]} \newcommand{\SampleVar}[1]{\widehat{\mathrm{Var}}\left[ #1 \right]} \newcommand{\Cov}[1]{\mathrm{Cov}\left[ #1 \right]} \newcommand{\TrueRegFunc}{\mu} \newcommand{\EstRegFunc}{\widehat{\TrueRegFunc}} \newcommand{\OptLinPred}{m} \newcommand{\EstLinPred}{\hat{m}} \DeclareMathOperator{\tr}{tr} \DeclareMathOperator*{\argmin}{argmin} \DeclareMathOperator{\det}{det} \newcommand{\TrueNoise}{\epsilon} \newcommand{\EstNoise}{\widehat{\TrueNoise}} \]

Context

We want to find patterns in our data which will let us predict one variable from another
The oldest and most widely-used prediction method is linear regression
We’re going to review linear regression, as a prediction method
Without a lot of the mythology from 401
But with an eye to more powerful methods, starting next week

Optimal prediction in general

We want to predict a numerical random variable \(Y\)
We want a one-number (“point”) prediction, not a range or a distribution
We say how good our guess is by expected squared error

Optimal prediction in general (cont’d.)

What’s the best constant guess for a random variable \(Y\)?

\[\begin{eqnarray} \TrueRegFunc & = & \argmin_{m}{\Expect{(Y-m)^2}}\\ & = & \argmin_{m}{\Var{(Y-m)} + (\Expect{Y-m})^2}\\ & = & \argmin_m{\Var{Y} + (\Expect{Y} - m)^2}\\ & = & \argmin_m{ (\Expect{Y} - m)^2}\\ & = & \Expect{Y} \end{eqnarray}\]

Optimal prediction in general (cont’d.)

What’s the best function of \(X\) to guess for \(Y\)?

\[\begin{eqnarray} \TrueRegFunc & = & \argmin_{m}{\Expect{(Y-m(X))^2}}\\ & = & \argmin_{m}{\Expect{\Expect{(Y-m(X))^2|X}}} \end{eqnarray}\]

For each \(x\), best \(m(x)\) is \(\Expect{Y|X=x}\)

\[ \TrueRegFunc(x) = \Expect{Y|X=x} \]

Optimal prediction in general (cont’d.)

Learning arbitrary functions is hard!

Who knows what the right function might be?

What if we decide to make our predictions linear?

Optimal linear prediction with univariate predictor

Our prediction will be of the form \[ \OptLinPred(x) = a + b x \] and we want the best \(a, b\)

Optimal linear prediction, univariate case

\[ (\alpha, \beta) = \argmin_{a,b}{\Expect{(Y-(a+bX))^2}} \]

Expand out that expectation, then take derivatives and set them to 0

The intercept

\[\begin{eqnarray} \Expect{(Y-(a+bX))^2} & = & \Expect{Y^2} - 2\Expect{Y(a+bX)} + \Expect{(a+bX)^2}\\ & = & \Expect{Y^2} - 2a\Expect{Y} - 2b\Expect{YX} +\\ & & a^2 + 2 ab \Expect{X} + b^2 \Expect{X^2}\\ \left. \frac{\partial}{\partial a}\Expect{(Y-(a+bX))^2} \right|_{a=\alpha, b=\beta} & = & -2\Expect{Y} + 2\alpha + 2\beta\Expect{X} = 0\\ \alpha & = & \Expect{Y} - \beta\Expect{X} \end{eqnarray}\]

\(\therefore\) optimal linear predictor \(m(X) = \alpha+\beta X\) looks like \[\begin{eqnarray} m(X) & = & \alpha + \beta X\\ & = & \Expect{Y} - \beta\Expect{X} + \beta X\\ & = & \Expect{Y} + \beta(X-\Expect{X}) \end{eqnarray}\] The optimal linear predictor only cares about how far \(X\) is from its expectation \(\Expect{X}\) And when \(X=\Expect{X}\), we will always predict \(\Expect{Y}\)

The slope

\[\begin{eqnarray} \left. \frac{\partial}{\partial b}\Expect{(Y-(a+bX))^2}\right|_{a=\alpha, b=\beta} & = & -2\Expect{YX} + 2\alpha \Expect{X} + 2\beta \Expect{X^2} = 0\\ 0 & = & -\Expect{YX} + (\Expect{Y} - \beta\Expect{X})\Expect{X} + \beta\Expect{X^2} \\ 0 & = & \Expect{Y}\Expect{X} - \Expect{YX} + \beta(\Expect{X^2} - \Expect{X}^2)\\ 0 & = & -\Cov{Y,X} + \beta \Var{X}\\ \beta & = & \frac{\Cov{Y,X}}{\Var{X}} \end{eqnarray}\]

Notice: if we replace \(X\) with \(X' = X-\Expect{X}\), \(\beta\) doesn’t change
Notice: if we replace \(Y\) with \(Y' = Y-\Expect{Y}\), \(\beta\) doesn’t change
\(\therefore\) centering the variables doesn’t change the slope

The optimal linear predictor of \(Y\) from \(X\)

The optimal linear predictor of \(Y\) from a single \(X\) is always

\[ \alpha + \beta X = \Expect{Y} + \left(\frac{\Cov{X,Y}}{\Var{X}}\right) (X - \Expect{X}) \]

What did we not assume?

That the true relationship between \(Y\) and \(X\) is linear
That anything is Gaussian
That anything has constant variance
That anything is independent or even uncorrelated

NONE OF THAT MATTERS for the optimal linear predictor

The prediction errors average out to zero

\[\begin{eqnarray} \Expect{Y-\OptLinPred(X)} & = & \Expect{Y - (\Expect{Y} + \beta(X-\Expect{X}))}\\ & = & \Expect{Y} - \Expect{Y} - \beta(\Expect{X} - \Expect{X}) = 0 \end{eqnarray}\]

If they didn’t average to zero, we’d adjust the coefficients until they did
Important: In general, \(\Expect{Y-\OptLinPred(X)|X} \neq 0\)

The prediction errors are uncorrelated with \(X\)

\[\begin{eqnarray} \Cov{X, Y-\OptLinPred(X)} & = & \Expect{X(Y-\OptLinPred(X))} ~\text{(by previous slide)}\\ & = & \Expect{X(Y - \Expect{Y} - \frac{\Cov{Y,X}}{\Var{X}}(X-\Expect{X}))}\\ & = & \Expect{XY - X\Expect{Y} - \frac{\Cov{Y,X}}{\Var{X}}(X^2) + \frac{\Cov{Y,X}}{\Var{X}} (X \Expect{X})}\\ & = & \Expect{XY} - \Expect{X}\Expect{Y} - \frac{\Cov{Y,X}}{\Var{X}}\Expect{X^2} + \frac{\Cov{Y,X}}{\Var{X}} (\Expect{X})^2\\ & = & \Cov{X,Y} - \frac{\Cov{Y,X}}{\Var{X}}(\Var{X})\\ & = & 0 \end{eqnarray}\]

If the errors weren’t uncorrelated with \(X\), we’d adjust the coefficients until they were

The prediction errors are uncorrelated with \(X\)

Alternate take:

\[\begin{eqnarray} \Cov{X, Y-\OptLinPred(X)} & = & \Cov{X, Y} - \Cov{X, \alpha + \beta X}\\ & = & \Cov{Y,X} - \Cov{X, \beta X}\\ & = & \Cov{Y,X} - \beta\Cov{X,X}\\ & = & \Cov{Y,X} - \beta\Var{X}\\ & = & \Cov{Y,X} - \Cov{Y,X} = 0 \end{eqnarray}\]

How big are the prediction errors?

\[\begin{eqnarray} \Var{Y-\OptLinPred(X)} & = & \Var{Y - \alpha - \beta X}\\ & = & \Var{Y - \beta X}\\ \end{eqnarray}\]

In-class exercise: finish this! Answer in terms of \(\Var{Y}\), \(\Var{X}\), \(\Cov{Y,X}\)

How big are the prediction errors?

\[\begin{eqnarray} \Var{Y-\OptLinPred(X)} & = & \Var{Y - \alpha - \beta X}\\ & = & \Var{Y - \beta X}\\ & = & \Var{Y} + \beta^2\Var{X} - 2\beta\Cov{Y,X} \end{eqnarray}\]

but \(\beta = \Cov{Y,X}/\Var{X}\) so

\[\begin{eqnarray} \Var{Y-\OptLinPred(X)} & = & \Var{Y} + \frac{\Cov{Y,X}^2}{\Var{X}} - 2\frac{\Cov{Y,X}^2}{\Var{X}}\\ & = & \Var{Y} - \frac{\Cov{Y,X}^2}{\Var{X}}\\ & < & \Var{Y} \text{unless}\ \Cov{Y,X} = 0 \end{eqnarray}\]

\(\Rightarrow\) Optimal linear predictor is almost always better than nothing…

Multivariate case

We try to predict \(Y\) from a whole bunch of variables
Bundle those predictor variables into \(\vec{X}\)
We need a vector of slope coefficients \(\vec{\beta}\)
Solution:

\[ \OptLinPred(\vec{X}) = \alpha+ \vec{\beta} \cdot \vec{X} = \Expect{Y} + \Var{\vec{X}}^{-1} \Cov{\vec{X},Y} (\vec{X} - \Expect{\vec{X}}) \]

and

\[ \Var{Y-\OptLinPred(\vec{X})} = \Var{Y} - \Cov{Y,\vec{X}}^T \Var{\vec{X}}^{-1} \Cov{Y,\vec{X}} \]

(Gory details in the back-up slides)

What we don’t assume, again

Anything about the distributions of \(Y\) or \(\vec{X}\)
That the linear predictor is correct
That anything is Gaussian

Estimation I: “plug-in”

We don’t see the true expectations, variances, covariances
But we can have sample/empirical values
One estimate of the optimal linear predictor: plug in the sample values

so for univariate \(X\),

\[ \EstLinPred(x) = \overline{y} + \frac{\widehat{\Cov{Y,X}}}{\widehat{\Var{X}}}(x-\overline{x}) \]

Estimation II: ordinary least squares

We don’t see the true expected squared error, but we do have the sample mean
Minimize that
Leads to exactly the same results as plug-in approach!

When does OLS/plug-in work?

“Work” = converge on the optimal linear predictor
Jointly sufficient conditions:

Sample means converge on expectation values
Sample covariances converge on true covariance
Sample variances converge on true, invertible variance

Then by continuity OLS coefficients converge on true \(\beta\)
None of this requires that the linear model is right, that anything is Gaussian, etc.

What do the estimates look like?

Bundle all the regressor values into an \(n\times (p+1)\) matrix \(\mathbf{x}\), with an extra column of 1s
Bundle all the regressand values into an \(n\times 1\) matrix \(\mathbf{y}\)
Then the \((p+1)\times 1\) matrix of least-squares coefficients \(\hat{\mathbf{\beta}}\) is \[ \hat{\mathbf{\beta}} = (\mathbf{x}^T\mathbf{x})^{-1}\mathbf{x}^T \mathbf{y} \] (as you learned in linear models)
This is really \[\begin{eqnarray} \hat{\mathbf{\beta}} & = & {\left(\frac{1}{n}\mathbf{x}^T\mathbf{x}\right)}^{-1} \left(\frac{1}{n}\mathbf{x}^T \mathbf{y}\right)\\ & = & (\mathrm{sample\ variance\ matrix\ of\ regressors})^{-1} (\mathrm{sample\ covariances\ between\ regressors\ and\ response}) \end{eqnarray}\]

What do the predictions look like?

The prediction at an arbitrary point \(\vec{x} = (x_1, x_2, \ldots x_p)\) is \[ \EstLinPred(\vec{x}) = \begin{array}{ccccc}[1 & x_1 & x_2 & \ldots & x_p]\end{array} \hat{\beta} = \begin{array}{ccccc}[1 & x_1 & x_2 & \ldots & x_p]\end{array} \left((\mathbf{x}^T\mathbf{x})^{-1}\mathbf{x}^T \mathbf{y}\right) \]
The \(n\times 1\) matrix of fitted values, at the training points, is \[ \mathbf{\EstLinPred} = \mathbf{x} (\mathbf{x}^T\mathbf{x})^{-1}\mathbf{x}^T \mathbf{y} \]
The matrix \(\mathbf{h} \equiv \mathbf{x} (\mathbf{x}^T\mathbf{x})^{-1}\mathbf{x}^T\) is the weight matrix, influence matrix or hat matrix and says how much \(y_j\) matters to the prediction of \(y_i\)

Fitted values and other predictions are weighted sums of the observations

\[\begin{eqnarray} \EstLinPred(\vec{x}) & = & \vec{x} \hat{\beta} = \vec{x} (\mathbf{x}^T\mathbf{x})^{-1}\mathbf{x}^T \mathbf{y}\\ \mathbf{\EstLinPred} & = & \mathbf{x} (\mathbf{x}^T\mathbf{x})^{-1}\mathbf{x}^T \mathbf{y} \end{eqnarray}\]

Every prediction we make is linear in \(\mathbf{y}\), it’s a weighted sum of all the observed values of the response
How much weight do we put on \(y_i\) when we’re predicting at \(\vec{x}\)?

Explicit form of the weights for OLS

For univariate OLS, if we’re trying to predict at the point \(x\) and built the model with data \((x_1, y_1), \ldots (x_n, y_n)\), we end up with: \[\begin{eqnarray} \EstLinPred(x) & = & \overline{y} + \hat{\beta}_1(x-\overline{x})\\ & = & \sum_{i=1}^{n}{\frac{1}{n}\left(1 + \frac{(x-\overline{x})(x_i-\overline{x})}{\hat{\sigma}^2_x}\right)y_i} \end{eqnarray}\]
So we give more weight to \(y_i\) if \(x_i\) is far from \(\overline{x}\)
and we give more weight to \(y_i\) if \(x\) is far from \(\overline{x}\)
Weights don’t care about \(x - x_i\), which is stupid, but also required to get a straight line
Same idea, with more algebra, for multiple regressors

Generalizing: linear smoothers

Linear regression is a special case of a linear smoother \[ \EstRegFunc(\vec{x}) = \sum_{i=1}^{n}{w(\vec{x}, \vec{x}_i) y_i} \]
Notice: Linear in the \(y\)’s, not in \(\vec{x}\)
General idea: predict at \(\vec{x}\) by finding similar \(\vec{x}_i\)’s and averaging their \(y_i\)’s
Different choices of \(w\) are different ideas about “similar” and about weighted averaging
Most prediction methods used in practice are linear smoothers
- Nearest neighbors (next week)
- Trees and forests (week after next)
- Kernels (two weeks from now)
- Neural networks, a.k.a. “deep learning”, a.k.a. “artificial intelligence” (take another course)

What about the rest of your linear models course?

Say you want to test whether \(\beta_{37} = 1\), or to give a confidence set for \((\alpha, \beta_{12})\)
That’s hard to do explicitly without more assumptions
Generally, you need to know the sampling distribution of \(\mathbf{\hat{\beta}}\)

What about the rest of your linear models course? (cont’d)

The usual (“401”) assumptions give nice formulas for the sampling distribution and so for inference:

The true regression function is exactly linear.
\(Y=\alpha + \vec{X} \cdot \vec{\beta} + \epsilon\) where \(\epsilon\) is independent of \(x\).
\(\epsilon\) is independent across observations.
\(\epsilon \sim \mathcal{N}(0,\sigma^2)\).

Assuming (1)–(4), the usual formulas (embedded in R) are exactly right
- Gaussian distribution for \(\hat{\mathbf{\beta}}\), centered at true \(\mathbf{\beta}\)
Assuming (1)–(3), non-Gaussian noise, and the influence of any one observation \(\rightarrow 0\), then the usual formulas hold as \(n\rightarrow\infty\)
- Get back the Gaussian distribution by a (complicated) central limit theorem
Do not use those formulas without checking those assumptions

The most important assumption to check

If the true regression function is exactly linear, then \[ \Expect{Y-(\alpha+ \vec{X} \beta)|\vec{X}=\vec{x}} = 0 \] for all \(\vec{x}\)
So \[ \Expect{Y-(\hat{\alpha}+\vec{X} \hat{\beta})|\vec{X}=\vec{x}} \approx 0 \]
The residuals should have conditional mean 0 everywhere if the linear model is right

Summing up

We can always decide to use a linear predictor, \(\OptLinPred(\vec{X}) = \alpha + \vec{\beta} \cdot \vec{X}\)
The optimal linear predictor of \(Y\) from \(\vec{X}\) always takes the same form: \[ \OptLinPred(Y) = \Expect{Y} + \Var{\vec{X}}^{-1} \Cov{Y,\vec{X}} (\vec{X} - \Expect{\vec{X}}) \]
Doing linear prediction requires finding those covariances
We usually estimate them (implicitly) by ordinary least squares, which converges pretty robustly to the optimal linear predictor (not to the truth)
Once we use OLS, all our predictions are weighted sums of the observations
The weights for linear regression are weird and implausible
We’re going to explore other ways of weighting
The usual “401” assumptions are needed for inference, not prediction

Backup: Further reading

See Berk (2008), chapter 1, and @{tEoSL-2nd, sections 2.3.1 and 2.6, and chapter 3 (especially through section 3.4), for more on linear regression as a prediction method
If you want more from the point of view taken here, see Shalizi (n.d.), chapter 2; if you want a lot more, see Shalizi (2015)
If you need to do inference about the coefficients but the usual assumptions don’t hold, your best bet is the bootstrap (covered in 402, later in this course, and in some excellent textbooks like Davison and Hinkley (1997)); some of the complicated procedures developed in econometrics for “robust standard errors” in linear models turn out to be equivalent to simple bootstraps (Buja et al. 2014)

Backup: Gory details for multivariate predictors

\[\begin{eqnarray} \OptLinPred(\vec{X}) & = & a + \vec{b} \cdot \vec{X}\\ (\alpha, \vec{\beta}) & = & \argmin_{a, \vec{b}}{\Expect{(Y-(a + \vec{b} \cdot \vec{X}))^2}}\\ \Expect{(Y-(a+\vec{b}\cdot \vec{X}))^2} & = & \Expect{Y^2} + a^2 + \Expect{(\vec{b}\cdot \vec{X})^2}\\ \nonumber & & - 2\Expect{Y (\vec{b}\cdot \vec{X})} - 2 \Expect{Y a} + 2 \Expect{a \vec{b} \cdot \vec{X}}\\ & = & \Expect{Y^2} + a^2 + \vec{b} \cdot \Expect{\vec{X} \otimes \vec{X}} b \\ \nonumber & & -2a\Expect{Y} - 2 \vec{b} \cdot \Expect{Y\vec{X}} + 2a\vec{b}\cdot \Expect{\vec{X}}\\ \end{eqnarray}\]

Backup: Gory details: the intercept

Take derivative w.r.t. \(a\), set to 0:

\[\begin{eqnarray} 0 & = & -2\Expect{Y} + 2\beta \Expect{\vec{X}} + 2\alpha \\ \alpha & = & \Expect{Y} - \vec{\beta} \cdot \Expect{\vec{X}}\\ \end{eqnarray}\]

just like when \(X\) was univariate

Backup: Gory details: the slopes

\[\begin{eqnarray} -2 \Expect{Y\vec{X}} + 2 \Expect{\vec{X} \otimes \vec{X}} \beta + 2 \alpha \Expect{\vec{X}} & = & 0\\ \Expect{Y\vec{X}} - \alpha\Expect{\vec{X}} & = & \Expect{\vec{X} \otimes \vec{X}} \beta\\ \Expect{Y\vec{X}} - (\Expect{Y} - \vec{\beta} \cdot \Expect{\vec{X}}) \Expect{\vec{X}} & = & \Expect{\vec{X} \otimes \vec{X}} \beta\\ \Cov{Y,\vec{X}} & = & \Var{\vec{X}} \beta\\ \beta & = & (\Var{\vec{X}})^{-1} \Cov{Y,\vec{X}} \end{eqnarray}\]

Reduces to \(\Cov{Y,X}/\Var{X}\) when \(X\) is univariate

Backup: Gory details: the PCA view

The factor of \(\Var{\vec{X}}^{-1}\) rotates and scales \(\vec{X}\) to uncorrelated, unit-variance variables

\[\begin{eqnarray} \Var{\vec{X}} & = & \mathbf{w} \mathbf{\Lambda} \mathbf{w}^T\\ \Var{\vec{X}}^{-1} & = & \mathbf{w} \mathbf{\Lambda}^{-1} \mathbf{w}^T\\ \Var{\vec{X}}^{-1} & = & (\mathbf{w} \mathbf{\Lambda}^{-1/2}) (\mathbf{w} \mathbf{\Lambda}^{-1/2})^T\\ & = & \Var{\vec{X}}^{-1/2} \left(\Var{\vec{X}}^{-1/2}\right)^T\\ \vec{U} & \equiv & \vec{X} \Var{\vec{X}}^{-1/2}\\ \Var{\vec{U}} & = & \mathbf{I}\\ \vec{X}\cdot\vec{\beta} & = & \vec{X} \cdot \Var{\vec{X}}^{-1} \Cov{\vec{X}, Y}\\ & = & \vec{X} \Var{\vec{X}}^{-1/2} \left(\Var{\vec{X}}^{-1/2}\right)^T \Cov{\vec{X}, Y}\\ & = & \vec{U} \Cov{\vec{U}, Y}\\ \end{eqnarray}\]

Backup: Square root of a matrix

A square matrix \(\mathbf{d}\) is a square root of \(\mathbf{c}\) when \(\mathbf{c} = \mathbf{d} \mathbf{d}^T\)
If there are any square roots, there are many square roots
- Pick any orthogonal matrix \(\mathbf{o}^T = \mathbf{o}^{-1}\)
- \((\mathbf{d}\mathbf{o})(\mathbf{d}\mathbf{o})^T = \mathbf{d}\mathbf{d}^T\)
- Just like every real number has two square roots…
If \(\mathbf{c}\) is diagonal, define \(\mathbf{c}^{1/2}\) as the diagonal matrix of square roots
If \(\mathbf{c} = \mathbf{w}\mathbf{\Lambda}\mathbf{w}^T\), one square root is \(\mathbf{w}\mathbf{\Lambda}^{1/2}\)

Backup/Aside: \(R^2\) is useless

\(R^2 \equiv \Var{\mathrm{fitted values}}/\Var{Y}\)
Suppose we know \(Y=\beta X+\epsilon\), with \(\Var{\epsilon} = \sigma^2\) and \(\epsilon\) independent of \(X\)
Then \(R^2 = \frac{\beta^2 \Var{X}}{\beta^2 \Var{X} + \sigma^2}\)
Consequence 1: \(R^2\) can be arbitrarily close to 0, even for the completely correct model (shrink \(\Var{X}\) and/or grow \(\sigma^2\))
Consequence 2: If we make \(\Var{X}\) bigger, and the linear model is right, \(R^2\) will grow \(\uparrow 1\) (unless \(\beta = 0\)); \(R^2\) measures the range of the regressor, not goodness of fit
Suppose the linear model is wrong but we use it anyway; \(R^2 = \beta^2\Var{X}/\Var{Y}\)
Consequence 3: \(R^2\) can be arbitrarily close to 1 even if every single assumption of the linear model is badly, obviously broken
Conclusion: \(R^2\) tells us nothing about how good or bad our model is that we didn’t already know from the mean squared error; you’d be better off forgetting about it
More: Shalizi (2015), chapter 10

References

Berk, Richard A. 2008. Statistical Learning from a Regression Perspective. New York: Springer-Verlag.

Buja, Andreas, Richard Berk, Lawrence Brown, Edward George, Emil Pitkin, Mikhail Traskin, Linda Zhao, and Kai Zhang. 2014. “Models as Approximations, Part I: A Conspiracy of Nonlinearity and Random Regressors in Linear Regression.” arxiv:1404.1578. http://arxiv.org/abs/1404.1578.

Davison, A. C., and D. V. Hinkley. 1997. Bootstrap Methods and Their Applications. Cambridge, England: Cambridge University Press.

Shalizi, Cosma Rohilla. 2015. “The Truth About Linear Regression.” Online Manuscript. http:///www.stat.cmu.edu/~cshalizi/TALR.

———. n.d. Advanced Data Analysis from an Elementary Point of View. Cambridge, England: Cambridge University Press. http://www.stat.cmu.edu/~cshalizi/ADAfaEPoV.