The Truth About Linear Regression

36-462/36-662, Spring 2020

16 January 2020

\[ \newcommand{\Expect}[1]{\mathbb{E}\left[ #1 \right]} \newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]} \newcommand{\SampleVar}[1]{\widehat{\mathrm{Var}}\left[ #1 \right]} \newcommand{\Cov}[1]{\mathrm{Cov}\left[ #1 \right]} \newcommand{\TrueRegFunc}{\mu} \newcommand{\EstRegFunc}{\widehat{\TrueRegFunc}} \newcommand{\OptLinPred}{m} \newcommand{\EstLinPred}{\hat{m}} \DeclareMathOperator{\tr}{tr} \DeclareMathOperator*{\argmin}{argmin} \DeclareMathOperator{\det}{det} \newcommand{\TrueNoise}{\epsilon} \newcommand{\EstNoise}{\widehat{\TrueNoise}} \]

Context

Optimal prediction in general

Optimal prediction in general (cont’d.)

What’s the best constant guess for a random variable \(Y\)?

\[\begin{eqnarray} \TrueRegFunc & = & \argmin_{m}{\Expect{(Y-m)^2}}\\ & = & \argmin_{m}{\Var{(Y-m)} + (\Expect{Y-m})^2}\\ & = & \argmin_m{\Var{Y} + (\Expect{Y} - m)^2}\\ & = & \argmin_m{ (\Expect{Y} - m)^2}\\ & = & \Expect{Y} \end{eqnarray}\]

Optimal prediction in general (cont’d.)

What’s the best function of \(X\) to guess for \(Y\)?

\[\begin{eqnarray} \TrueRegFunc & = & \argmin_{m}{\Expect{(Y-m(X))^2}}\\ & = & \argmin_{m}{\Expect{\Expect{(Y-m(X))^2|X}}} \end{eqnarray}\]

For each \(x\), best \(m(x)\) is \(\Expect{Y|X=x}\)

\[ \TrueRegFunc(x) = \Expect{Y|X=x} \]

Optimal prediction in general (cont’d.)

Learning arbitrary functions is hard!

Who knows what the right function might be?

What if we decide to make our predictions linear?

Optimal linear prediction with univariate predictor

Our prediction will be of the form \[ \OptLinPred(x) = a + b x \] and we want the best \(a, b\)

Optimal linear prediction, univariate case

\[ (\alpha, \beta) = \argmin_{a,b}{\Expect{(Y-(a+bX))^2}} \]

Expand out that expectation, then take derivatives and set them to 0

The intercept

\[\begin{eqnarray} \Expect{(Y-(a+bX))^2} & = & \Expect{Y^2} - 2\Expect{Y(a+bX)} + \Expect{(a+bX)^2}\\ & = & \Expect{Y^2} - 2a\Expect{Y} - 2b\Expect{YX} +\\ & & a^2 + 2 ab \Expect{X} + b^2 \Expect{X^2}\\ \left. \frac{\partial}{\partial a}\Expect{(Y-(a+bX))^2} \right|_{a=\alpha, b=\beta} & = & -2\Expect{Y} + 2\alpha + 2\beta\Expect{X} = 0\\ \alpha & = & \Expect{Y} - \beta\Expect{X} \end{eqnarray}\]

\(\therefore\) optimal linear predictor \(m(X) = \alpha+\beta X\) looks like \[\begin{eqnarray} m(X) & = & \alpha + \beta X\\ & = & \Expect{Y} - \beta\Expect{X} + \beta X\\ & = & \Expect{Y} + \beta(X-\Expect{X}) \end{eqnarray}\] The optimal linear predictor only cares about how far \(X\) is from its expectation \(\Expect{X}\) And when \(X=\Expect{X}\), we will always predict \(\Expect{Y}\)

The slope

\[\begin{eqnarray} \left. \frac{\partial}{\partial b}\Expect{(Y-(a+bX))^2}\right|_{a=\alpha, b=\beta} & = & -2\Expect{YX} + 2\alpha \Expect{X} + 2\beta \Expect{X^2} = 0\\ 0 & = & -\Expect{YX} + (\Expect{Y} - \beta\Expect{X})\Expect{X} + \beta\Expect{X^2} \\ 0 & = & \Expect{Y}\Expect{X} - \Expect{YX} + \beta(\Expect{X^2} - \Expect{X}^2)\\ 0 & = & -\Cov{Y,X} + \beta \Var{X}\\ \beta & = & \frac{\Cov{Y,X}}{\Var{X}} \end{eqnarray}\]

Notice: if we replace \(X\) with \(X' = X-\Expect{X}\), \(\beta\) doesn’t change
Notice: if we replace \(Y\) with \(Y' = Y-\Expect{Y}\), \(\beta\) doesn’t change
\(\therefore\) centering the variables doesn’t change the slope

The optimal linear predictor of \(Y\) from \(X\)

The optimal linear predictor of \(Y\) from a single \(X\) is always

\[ \alpha + \beta X = \Expect{Y} + \left(\frac{\Cov{X,Y}}{\Var{X}}\right) (X - \Expect{X}) \]

What did we not assume?

NONE OF THAT MATTERS for the optimal linear predictor

The prediction errors average out to zero

\[\begin{eqnarray} \Expect{Y-\OptLinPred(X)} & = & \Expect{Y - (\Expect{Y} + \beta(X-\Expect{X}))}\\ & = & \Expect{Y} - \Expect{Y} - \beta(\Expect{X} - \Expect{X}) = 0 \end{eqnarray}\]

The prediction errors are uncorrelated with \(X\)

\[\begin{eqnarray} \Cov{X, Y-\OptLinPred(X)} & = & \Expect{X(Y-\OptLinPred(X))} ~\text{(by previous slide)}\\ & = & \Expect{X(Y - \Expect{Y} - \frac{\Cov{Y,X}}{\Var{X}}(X-\Expect{X}))}\\ & = & \Expect{XY - X\Expect{Y} - \frac{\Cov{Y,X}}{\Var{X}}(X^2) + \frac{\Cov{Y,X}}{\Var{X}} (X \Expect{X})}\\ & = & \Expect{XY} - \Expect{X}\Expect{Y} - \frac{\Cov{Y,X}}{\Var{X}}\Expect{X^2} + \frac{\Cov{Y,X}}{\Var{X}} (\Expect{X})^2\\ & = & \Cov{X,Y} - \frac{\Cov{Y,X}}{\Var{X}}(\Var{X})\\ & = & 0 \end{eqnarray}\]

The prediction errors are uncorrelated with \(X\)

Alternate take:

\[\begin{eqnarray} \Cov{X, Y-\OptLinPred(X)} & = & \Cov{X, Y} - \Cov{X, \alpha + \beta X}\\ & = & \Cov{Y,X} - \Cov{X, \beta X}\\ & = & \Cov{Y,X} - \beta\Cov{X,X}\\ & = & \Cov{Y,X} - \beta\Var{X}\\ & = & \Cov{Y,X} - \Cov{Y,X} = 0 \end{eqnarray}\]

How big are the prediction errors?

\[\begin{eqnarray} \Var{Y-\OptLinPred(X)} & = & \Var{Y - \alpha - \beta X}\\ & = & \Var{Y - \beta X}\\ \end{eqnarray}\]

In-class exercise: finish this! Answer in terms of \(\Var{Y}\), \(\Var{X}\), \(\Cov{Y,X}\)

How big are the prediction errors?

\[\begin{eqnarray} \Var{Y-\OptLinPred(X)} & = & \Var{Y - \alpha - \beta X}\\ & = & \Var{Y - \beta X}\\ & = & \Var{Y} + \beta^2\Var{X} - 2\beta\Cov{Y,X} \end{eqnarray}\]

but \(\beta = \Cov{Y,X}/\Var{X}\) so

\[\begin{eqnarray} \Var{Y-\OptLinPred(X)} & = & \Var{Y} + \frac{\Cov{Y,X}^2}{\Var{X}} - 2\frac{\Cov{Y,X}^2}{\Var{X}}\\ & = & \Var{Y} - \frac{\Cov{Y,X}^2}{\Var{X}}\\ & < & \Var{Y} \text{unless}\ \Cov{Y,X} = 0 \end{eqnarray}\]

\(\Rightarrow\) Optimal linear predictor is almost always better than nothing…

Multivariate case

\[ \OptLinPred(\vec{X}) = \alpha+ \vec{\beta} \cdot \vec{X} = \Expect{Y} + \Var{\vec{X}}^{-1} \Cov{\vec{X},Y} (\vec{X} - \Expect{\vec{X}}) \]

and

\[ \Var{Y-\OptLinPred(\vec{X})} = \Var{Y} - \Cov{Y,\vec{X}}^T \Var{\vec{X}}^{-1} \Cov{Y,\vec{X}} \]

(Gory details in the back-up slides)

What we don’t assume, again

Estimation I: “plug-in”

so for univariate \(X\),

\[ \EstLinPred(x) = \overline{y} + \frac{\widehat{\Cov{Y,X}}}{\widehat{\Var{X}}}(x-\overline{x}) \]

Estimation II: ordinary least squares

When does OLS/plug-in work?

  1. Sample means converge on expectation values
  2. Sample covariances converge on true covariance
  3. Sample variances converge on true, invertible variance

What do the estimates look like?

What do the predictions look like?

Fitted values and other predictions are weighted sums of the observations

\[\begin{eqnarray} \EstLinPred(\vec{x}) & = & \vec{x} \hat{\beta} = \vec{x} (\mathbf{x}^T\mathbf{x})^{-1}\mathbf{x}^T \mathbf{y}\\ \mathbf{\EstLinPred} & = & \mathbf{x} (\mathbf{x}^T\mathbf{x})^{-1}\mathbf{x}^T \mathbf{y} \end{eqnarray}\]

Explicit form of the weights for OLS

Generalizing: linear smoothers

What about the rest of your linear models course?

What about the rest of your linear models course? (cont’d)

  1. The true regression function is exactly linear.
  2. \(Y=\alpha + \vec{X} \cdot \vec{\beta} + \epsilon\) where \(\epsilon\) is independent of \(x\).
  3. \(\epsilon\) is independent across observations.
  4. \(\epsilon \sim \mathcal{N}(0,\sigma^2)\).

The most important assumption to check

Summing up

Backup: Further reading

Backup: Gory details for multivariate predictors

\[\begin{eqnarray} \OptLinPred(\vec{X}) & = & a + \vec{b} \cdot \vec{X}\\ (\alpha, \vec{\beta}) & = & \argmin_{a, \vec{b}}{\Expect{(Y-(a + \vec{b} \cdot \vec{X}))^2}}\\ \Expect{(Y-(a+\vec{b}\cdot \vec{X}))^2} & = & \Expect{Y^2} + a^2 + \Expect{(\vec{b}\cdot \vec{X})^2}\\ \nonumber & & - 2\Expect{Y (\vec{b}\cdot \vec{X})} - 2 \Expect{Y a} + 2 \Expect{a \vec{b} \cdot \vec{X}}\\ & = & \Expect{Y^2} + a^2 + \vec{b} \cdot \Expect{\vec{X} \otimes \vec{X}} b \\ \nonumber & & -2a\Expect{Y} - 2 \vec{b} \cdot \Expect{Y\vec{X}} + 2a\vec{b}\cdot \Expect{\vec{X}}\\ \end{eqnarray}\]

Backup: Gory details: the intercept

Take derivative w.r.t. \(a\), set to 0:

\[\begin{eqnarray} 0 & = & -2\Expect{Y} + 2\beta \Expect{\vec{X}} + 2\alpha \\ \alpha & = & \Expect{Y} - \vec{\beta} \cdot \Expect{\vec{X}}\\ \end{eqnarray}\]

just like when \(X\) was univariate

Backup: Gory details: the slopes

\[\begin{eqnarray} -2 \Expect{Y\vec{X}} + 2 \Expect{\vec{X} \otimes \vec{X}} \beta + 2 \alpha \Expect{\vec{X}} & = & 0\\ \Expect{Y\vec{X}} - \alpha\Expect{\vec{X}} & = & \Expect{\vec{X} \otimes \vec{X}} \beta\\ \Expect{Y\vec{X}} - (\Expect{Y} - \vec{\beta} \cdot \Expect{\vec{X}}) \Expect{\vec{X}} & = & \Expect{\vec{X} \otimes \vec{X}} \beta\\ \Cov{Y,\vec{X}} & = & \Var{\vec{X}} \beta\\ \beta & = & (\Var{\vec{X}})^{-1} \Cov{Y,\vec{X}} \end{eqnarray}\]

Reduces to \(\Cov{Y,X}/\Var{X}\) when \(X\) is univariate

Backup: Gory details: the PCA view

The factor of \(\Var{\vec{X}}^{-1}\) rotates and scales \(\vec{X}\) to uncorrelated, unit-variance variables

\[\begin{eqnarray} \Var{\vec{X}} & = & \mathbf{w} \mathbf{\Lambda} \mathbf{w}^T\\ \Var{\vec{X}}^{-1} & = & \mathbf{w} \mathbf{\Lambda}^{-1} \mathbf{w}^T\\ \Var{\vec{X}}^{-1} & = & (\mathbf{w} \mathbf{\Lambda}^{-1/2}) (\mathbf{w} \mathbf{\Lambda}^{-1/2})^T\\ & = & \Var{\vec{X}}^{-1/2} \left(\Var{\vec{X}}^{-1/2}\right)^T\\ \vec{U} & \equiv & \vec{X} \Var{\vec{X}}^{-1/2}\\ \Var{\vec{U}} & = & \mathbf{I}\\ \vec{X}\cdot\vec{\beta} & = & \vec{X} \cdot \Var{\vec{X}}^{-1} \Cov{\vec{X}, Y}\\ & = & \vec{X} \Var{\vec{X}}^{-1/2} \left(\Var{\vec{X}}^{-1/2}\right)^T \Cov{\vec{X}, Y}\\ & = & \vec{U} \Cov{\vec{U}, Y}\\ \end{eqnarray}\]

Backup: Square root of a matrix

Backup/Aside: \(R^2\) is useless

References

Berk, Richard A. 2008. Statistical Learning from a Regression Perspective. New York: Springer-Verlag.

Buja, Andreas, Richard Berk, Lawrence Brown, Edward George, Emil Pitkin, Mikhail Traskin, Linda Zhao, and Kai Zhang. 2014. “Models as Approximations, Part I: A Conspiracy of Nonlinearity and Random Regressors in Linear Regression.” arxiv:1404.1578. http://arxiv.org/abs/1404.1578.

Davison, A. C., and D. V. Hinkley. 1997. Bootstrap Methods and Their Applications. Cambridge, England: Cambridge University Press.

Shalizi, Cosma Rohilla. 2015. “The Truth About Linear Regression.” Online Manuscript. http:///www.stat.cmu.edu/~cshalizi/TALR.

———. n.d. Advanced Data Analysis from an Elementary Point of View. Cambridge, England: Cambridge University Press. http://www.stat.cmu.edu/~cshalizi/ADAfaEPoV.