The Truth About Linear Regression
36-462/36-662, Spring 2020
16 January 2020
\[
\newcommand{\Expect}[1]{\mathbb{E}\left[ #1 \right]}
\newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]}
\newcommand{\SampleVar}[1]{\widehat{\mathrm{Var}}\left[ #1 \right]}
\newcommand{\Cov}[1]{\mathrm{Cov}\left[ #1 \right]}
\newcommand{\TrueRegFunc}{\mu}
\newcommand{\EstRegFunc}{\widehat{\TrueRegFunc}}
\newcommand{\OptLinPred}{m}
\newcommand{\EstLinPred}{\hat{m}}
\DeclareMathOperator{\tr}{tr}
\DeclareMathOperator*{\argmin}{argmin}
\DeclareMathOperator{\det}{det}
\newcommand{\TrueNoise}{\epsilon}
\newcommand{\EstNoise}{\widehat{\TrueNoise}}
\]
Context
- We want to find patterns in our data which will let us predict one variable from another
- The oldest and most widely-used prediction method is linear regression
- We’re going to review linear regression, as a prediction method
- Without a lot of the mythology from 401
- But with an eye to more powerful methods, starting next week
Optimal prediction in general
- We want to predict a numerical random variable \(Y\)
- We want a one-number (“point”) prediction, not a range or a distribution
- We say how good our guess is by expected squared error
Optimal prediction in general (cont’d.)
What’s the best constant guess for a random variable \(Y\)?
\[\begin{eqnarray}
\TrueRegFunc & = & \argmin_{m}{\Expect{(Y-m)^2}}\\
& = & \argmin_{m}{\Var{(Y-m)} + (\Expect{Y-m})^2}\\
& = & \argmin_m{\Var{Y} + (\Expect{Y} - m)^2}\\
& = & \argmin_m{ (\Expect{Y} - m)^2}\\
& = & \Expect{Y}
\end{eqnarray}\]
Optimal prediction in general (cont’d.)
What’s the best function of \(X\) to guess for \(Y\)?
\[\begin{eqnarray}
\TrueRegFunc & = & \argmin_{m}{\Expect{(Y-m(X))^2}}\\
& = & \argmin_{m}{\Expect{\Expect{(Y-m(X))^2|X}}}
\end{eqnarray}\]
For each \(x\), best \(m(x)\) is \(\Expect{Y|X=x}\)
\[
\TrueRegFunc(x) = \Expect{Y|X=x}
\]
Optimal prediction in general (cont’d.)
Learning arbitrary functions is hard!
Who knows what the right function might be?
What if we decide to make our predictions linear?
Optimal linear prediction with univariate predictor
Our prediction will be of the form \[
\OptLinPred(x) = a + b x
\] and we want the best \(a, b\)
Optimal linear prediction, univariate case
\[
(\alpha, \beta) = \argmin_{a,b}{\Expect{(Y-(a+bX))^2}}
\]
Expand out that expectation, then take derivatives and set them to 0
The intercept
\[\begin{eqnarray}
\Expect{(Y-(a+bX))^2} & = & \Expect{Y^2} - 2\Expect{Y(a+bX)} + \Expect{(a+bX)^2}\\
& = & \Expect{Y^2} - 2a\Expect{Y} - 2b\Expect{YX} +\\
& & a^2 + 2 ab \Expect{X} + b^2 \Expect{X^2}\\
\left. \frac{\partial}{\partial a}\Expect{(Y-(a+bX))^2} \right|_{a=\alpha, b=\beta} & = & -2\Expect{Y} + 2\alpha + 2\beta\Expect{X} = 0\\
\alpha & = & \Expect{Y} - \beta\Expect{X}
\end{eqnarray}\]
\(\therefore\) optimal linear predictor \(m(X) = \alpha+\beta X\) looks like \[\begin{eqnarray}
m(X) & = & \alpha + \beta X\\
& = & \Expect{Y} - \beta\Expect{X} + \beta X\\
& = & \Expect{Y} + \beta(X-\Expect{X})
\end{eqnarray}\] The optimal linear predictor only cares about how far \(X\) is from its expectation \(\Expect{X}\) And when \(X=\Expect{X}\), we will always predict \(\Expect{Y}\)
The slope
\[\begin{eqnarray}
\left. \frac{\partial}{\partial b}\Expect{(Y-(a+bX))^2}\right|_{a=\alpha, b=\beta} & = & -2\Expect{YX} + 2\alpha \Expect{X} + 2\beta \Expect{X^2} = 0\\
0 & = & -\Expect{YX} + (\Expect{Y} - \beta\Expect{X})\Expect{X} + \beta\Expect{X^2} \\
0 & = & \Expect{Y}\Expect{X} - \Expect{YX} + \beta(\Expect{X^2} - \Expect{X}^2)\\
0 & = & -\Cov{Y,X} + \beta \Var{X}\\
\beta & = & \frac{\Cov{Y,X}}{\Var{X}}
\end{eqnarray}\]
Notice: if we replace \(X\) with \(X' = X-\Expect{X}\), \(\beta\) doesn’t change
Notice: if we replace \(Y\) with \(Y' = Y-\Expect{Y}\), \(\beta\) doesn’t change
\(\therefore\) centering the variables doesn’t change the slope
The optimal linear predictor of \(Y\) from \(X\)
The optimal linear predictor of \(Y\) from a single \(X\) is always
\[
\alpha + \beta X = \Expect{Y} + \left(\frac{\Cov{X,Y}}{\Var{X}}\right) (X - \Expect{X})
\]
What did we not assume?
- That the true relationship between \(Y\) and \(X\) is linear
- That anything is Gaussian
- That anything has constant variance
- That anything is independent or even uncorrelated
NONE OF THAT MATTERS for the optimal linear predictor
The prediction errors average out to zero
\[\begin{eqnarray}
\Expect{Y-\OptLinPred(X)} & = & \Expect{Y - (\Expect{Y} + \beta(X-\Expect{X}))}\\
& = & \Expect{Y} - \Expect{Y} - \beta(\Expect{X} - \Expect{X}) = 0
\end{eqnarray}\]
- If they didn’t average to zero, we’d adjust the coefficients until they did
- Important: In general, \(\Expect{Y-\OptLinPred(X)|X} \neq 0\)
How big are the prediction errors?
\[\begin{eqnarray}
\Var{Y-\OptLinPred(X)} & = & \Var{Y - \alpha - \beta X}\\
& = & \Var{Y - \beta X}\\
\end{eqnarray}\]
In-class exercise: finish this! Answer in terms of \(\Var{Y}\), \(\Var{X}\), \(\Cov{Y,X}\)
How big are the prediction errors?
\[\begin{eqnarray}
\Var{Y-\OptLinPred(X)} & = & \Var{Y - \alpha - \beta X}\\
& = & \Var{Y - \beta X}\\
& = & \Var{Y} + \beta^2\Var{X} - 2\beta\Cov{Y,X}
\end{eqnarray}\]
but \(\beta = \Cov{Y,X}/\Var{X}\) so
\[\begin{eqnarray}
\Var{Y-\OptLinPred(X)} & = & \Var{Y} + \frac{\Cov{Y,X}^2}{\Var{X}} - 2\frac{\Cov{Y,X}^2}{\Var{X}}\\
& = & \Var{Y} - \frac{\Cov{Y,X}^2}{\Var{X}}\\
& < & \Var{Y} \text{unless}\ \Cov{Y,X} = 0
\end{eqnarray}\]
\(\Rightarrow\) Optimal linear predictor is almost always better than nothing…
Multivariate case
- We try to predict \(Y\) from a whole bunch of variables
- Bundle those predictor variables into \(\vec{X}\)
- We need a vector of slope coefficients \(\vec{\beta}\)
- Solution:
\[
\OptLinPred(\vec{X}) = \alpha+ \vec{\beta} \cdot \vec{X} = \Expect{Y} + \Var{\vec{X}}^{-1} \Cov{\vec{X},Y} (\vec{X} - \Expect{\vec{X}})
\]
and
\[
\Var{Y-\OptLinPred(\vec{X})} = \Var{Y} - \Cov{Y,\vec{X}}^T \Var{\vec{X}}^{-1} \Cov{Y,\vec{X}}
\]
(Gory details in the back-up slides)
What we don’t assume, again
- Anything about the distributions of \(Y\) or \(\vec{X}\)
- That the linear predictor is correct
- That anything is Gaussian
Estimation I: “plug-in”
- We don’t see the true expectations, variances, covariances
- But we can have sample/empirical values
- One estimate of the optimal linear predictor: plug in the sample values
so for univariate \(X\),
\[
\EstLinPred(x) = \overline{y} + \frac{\widehat{\Cov{Y,X}}}{\widehat{\Var{X}}}(x-\overline{x})
\]
Estimation II: ordinary least squares
- We don’t see the true expected squared error, but we do have the sample mean
- Minimize that
- Leads to exactly the same results as plug-in approach!
When does OLS/plug-in work?
- “Work” = converge on the optimal linear predictor
- Jointly sufficient conditions:
- Sample means converge on expectation values
- Sample covariances converge on true covariance
- Sample variances converge on true, invertible variance
- Then by continuity OLS coefficients converge on true \(\beta\)
- None of this requires that the linear model is right, that anything is Gaussian, etc.
What do the estimates look like?
- Bundle all the regressor values into an \(n\times (p+1)\) matrix \(\mathbf{x}\), with an extra column of 1s
- Bundle all the regressand values into an \(n\times 1\) matrix \(\mathbf{y}\)
- Then the \((p+1)\times 1\) matrix of least-squares coefficients \(\hat{\mathbf{\beta}}\) is \[
\hat{\mathbf{\beta}} = (\mathbf{x}^T\mathbf{x})^{-1}\mathbf{x}^T \mathbf{y}
\] (as you learned in linear models)
- This is really \[\begin{eqnarray}
\hat{\mathbf{\beta}} & = & {\left(\frac{1}{n}\mathbf{x}^T\mathbf{x}\right)}^{-1} \left(\frac{1}{n}\mathbf{x}^T \mathbf{y}\right)\\
& = & (\mathrm{sample\ variance\ matrix\ of\ regressors})^{-1} (\mathrm{sample\ covariances\ between\ regressors\ and\ response})
\end{eqnarray}\]
What do the predictions look like?
- The prediction at an arbitrary point \(\vec{x} = (x_1, x_2, \ldots x_p)\) is \[
\EstLinPred(\vec{x}) = \begin{array}{ccccc}[1 & x_1 & x_2 & \ldots & x_p]\end{array} \hat{\beta} = \begin{array}{ccccc}[1 & x_1 & x_2 & \ldots & x_p]\end{array} \left((\mathbf{x}^T\mathbf{x})^{-1}\mathbf{x}^T \mathbf{y}\right)
\]
- The \(n\times 1\) matrix of fitted values, at the training points, is \[
\mathbf{\EstLinPred} = \mathbf{x} (\mathbf{x}^T\mathbf{x})^{-1}\mathbf{x}^T \mathbf{y}
\]
- The matrix \(\mathbf{h} \equiv \mathbf{x} (\mathbf{x}^T\mathbf{x})^{-1}\mathbf{x}^T\) is the weight matrix, influence matrix or hat matrix and says how much \(y_j\) matters to the prediction of \(y_i\)
Fitted values and other predictions are weighted sums of the observations
\[\begin{eqnarray}
\EstLinPred(\vec{x}) & = & \vec{x} \hat{\beta} = \vec{x} (\mathbf{x}^T\mathbf{x})^{-1}\mathbf{x}^T \mathbf{y}\\
\mathbf{\EstLinPred} & = & \mathbf{x} (\mathbf{x}^T\mathbf{x})^{-1}\mathbf{x}^T \mathbf{y}
\end{eqnarray}\]
- Every prediction we make is linear in \(\mathbf{y}\), it’s a weighted sum of all the observed values of the response
- How much weight do we put on \(y_i\) when we’re predicting at \(\vec{x}\)?
Generalizing: linear smoothers
- Linear regression is a special case of a linear smoother \[
\EstRegFunc(\vec{x}) = \sum_{i=1}^{n}{w(\vec{x}, \vec{x}_i) y_i}
\]
- Notice: Linear in the \(y\)’s, not in \(\vec{x}\)
- General idea: predict at \(\vec{x}\) by finding similar \(\vec{x}_i\)’s and averaging their \(y_i\)’s
- Different choices of \(w\) are different ideas about “similar” and about weighted averaging
- Most prediction methods used in practice are linear smoothers
- Nearest neighbors (next week)
- Trees and forests (week after next)
- Kernels (two weeks from now)
- Neural networks, a.k.a. “deep learning”, a.k.a. “artificial intelligence” (take another course)
What about the rest of your linear models course?
- Say you want to test whether \(\beta_{37} = 1\), or to give a confidence set for \((\alpha, \beta_{12})\)
- That’s hard to do explicitly without more assumptions
- Generally, you need to know the sampling distribution of \(\mathbf{\hat{\beta}}\)
What about the rest of your linear models course? (cont’d)
- The usual (“401”) assumptions give nice formulas for the sampling distribution and so for inference:
- The true regression function is exactly linear.
- \(Y=\alpha + \vec{X} \cdot \vec{\beta} + \epsilon\) where \(\epsilon\) is independent of \(x\).
- \(\epsilon\) is independent across observations.
- \(\epsilon \sim \mathcal{N}(0,\sigma^2)\).
- Assuming (1)–(4), the usual formulas (embedded in R) are exactly right
- Gaussian distribution for \(\hat{\mathbf{\beta}}\), centered at true \(\mathbf{\beta}\)
- Assuming (1)–(3), non-Gaussian noise, and the influence of any one observation \(\rightarrow 0\), then the usual formulas hold as \(n\rightarrow\infty\)
- Get back the Gaussian distribution by a (complicated) central limit theorem
- Do not use those formulas without checking those assumptions
The most important assumption to check
- If the true regression function is exactly linear, then \[
\Expect{Y-(\alpha+ \vec{X} \beta)|\vec{X}=\vec{x}} = 0
\] for all \(\vec{x}\)
- So \[
\Expect{Y-(\hat{\alpha}+\vec{X} \hat{\beta})|\vec{X}=\vec{x}} \approx 0
\]
- The residuals should have conditional mean 0 everywhere if the linear model is right
Summing up
- We can always decide to use a linear predictor, \(\OptLinPred(\vec{X}) = \alpha + \vec{\beta} \cdot \vec{X}\)
- The optimal linear predictor of \(Y\) from \(\vec{X}\) always takes the same form: \[
\OptLinPred(Y) = \Expect{Y} + \Var{\vec{X}}^{-1} \Cov{Y,\vec{X}} (\vec{X} - \Expect{\vec{X}})
\]
- Doing linear prediction requires finding those covariances
- We usually estimate them (implicitly) by ordinary least squares, which converges pretty robustly to the optimal linear predictor (not to the truth)
- Once we use OLS, all our predictions are weighted sums of the observations
- The weights for linear regression are weird and implausible
- We’re going to explore other ways of weighting
- The usual “401” assumptions are needed for inference, not prediction
Backup: Further reading
- See Berk (2008), chapter 1, and @{tEoSL-2nd, sections 2.3.1 and 2.6, and chapter 3 (especially through section 3.4), for more on linear regression as a prediction method
- If you want more from the point of view taken here, see Shalizi (n.d.), chapter 2; if you want a lot more, see Shalizi (2015)
- If you need to do inference about the coefficients but the usual assumptions don’t hold, your best bet is the bootstrap (covered in 402, later in this course, and in some excellent textbooks like Davison and Hinkley (1997)); some of the complicated procedures developed in econometrics for “robust standard errors” in linear models turn out to be equivalent to simple bootstraps (Buja et al. 2014)
Backup: Gory details for multivariate predictors
\[\begin{eqnarray}
\OptLinPred(\vec{X}) & = & a + \vec{b} \cdot \vec{X}\\
(\alpha, \vec{\beta}) & = & \argmin_{a, \vec{b}}{\Expect{(Y-(a + \vec{b} \cdot \vec{X}))^2}}\\
\Expect{(Y-(a+\vec{b}\cdot \vec{X}))^2} & = & \Expect{Y^2} + a^2 + \Expect{(\vec{b}\cdot \vec{X})^2}\\
\nonumber & & - 2\Expect{Y (\vec{b}\cdot \vec{X})} - 2 \Expect{Y a} + 2 \Expect{a \vec{b} \cdot \vec{X}}\\
& = & \Expect{Y^2} + a^2 + \vec{b} \cdot \Expect{\vec{X} \otimes \vec{X}} b \\
\nonumber & & -2a\Expect{Y} - 2 \vec{b} \cdot \Expect{Y\vec{X}} + 2a\vec{b}\cdot \Expect{\vec{X}}\\
\end{eqnarray}\]
Backup: Gory details: the intercept
Take derivative w.r.t. \(a\), set to 0:
\[\begin{eqnarray}
0 & = & -2\Expect{Y} + 2\beta \Expect{\vec{X}} + 2\alpha \\
\alpha & = & \Expect{Y} - \vec{\beta} \cdot \Expect{\vec{X}}\\
\end{eqnarray}\]
just like when \(X\) was univariate
Backup: Gory details: the slopes
\[\begin{eqnarray}
-2 \Expect{Y\vec{X}} + 2 \Expect{\vec{X} \otimes \vec{X}} \beta + 2 \alpha \Expect{\vec{X}} & = & 0\\
\Expect{Y\vec{X}} - \alpha\Expect{\vec{X}} & = & \Expect{\vec{X} \otimes \vec{X}} \beta\\
\Expect{Y\vec{X}} - (\Expect{Y} - \vec{\beta} \cdot \Expect{\vec{X}}) \Expect{\vec{X}} & = & \Expect{\vec{X} \otimes \vec{X}} \beta\\
\Cov{Y,\vec{X}} & = & \Var{\vec{X}} \beta\\
\beta & = & (\Var{\vec{X}})^{-1} \Cov{Y,\vec{X}}
\end{eqnarray}\]
Reduces to \(\Cov{Y,X}/\Var{X}\) when \(X\) is univariate
Backup: Gory details: the PCA view
The factor of \(\Var{\vec{X}}^{-1}\) rotates and scales \(\vec{X}\) to uncorrelated, unit-variance variables
\[\begin{eqnarray}
\Var{\vec{X}} & = & \mathbf{w} \mathbf{\Lambda} \mathbf{w}^T\\
\Var{\vec{X}}^{-1} & = & \mathbf{w} \mathbf{\Lambda}^{-1} \mathbf{w}^T\\
\Var{\vec{X}}^{-1} & = & (\mathbf{w} \mathbf{\Lambda}^{-1/2}) (\mathbf{w} \mathbf{\Lambda}^{-1/2})^T\\
& = & \Var{\vec{X}}^{-1/2} \left(\Var{\vec{X}}^{-1/2}\right)^T\\
\vec{U} & \equiv & \vec{X} \Var{\vec{X}}^{-1/2}\\
\Var{\vec{U}} & = & \mathbf{I}\\
\vec{X}\cdot\vec{\beta} & = & \vec{X} \cdot \Var{\vec{X}}^{-1} \Cov{\vec{X}, Y}\\
& = & \vec{X} \Var{\vec{X}}^{-1/2} \left(\Var{\vec{X}}^{-1/2}\right)^T \Cov{\vec{X}, Y}\\
& = & \vec{U} \Cov{\vec{U}, Y}\\
\end{eqnarray}\]
Backup: Square root of a matrix
- A square matrix \(\mathbf{d}\) is a square root of \(\mathbf{c}\) when \(\mathbf{c} = \mathbf{d} \mathbf{d}^T\)
- If there are any square roots, there are many square roots
- Pick any orthogonal matrix \(\mathbf{o}^T = \mathbf{o}^{-1}\)
- \((\mathbf{d}\mathbf{o})(\mathbf{d}\mathbf{o})^T = \mathbf{d}\mathbf{d}^T\)
- Just like every real number has two square roots…
- If \(\mathbf{c}\) is diagonal, define \(\mathbf{c}^{1/2}\) as the diagonal matrix of square roots
- If \(\mathbf{c} = \mathbf{w}\mathbf{\Lambda}\mathbf{w}^T\), one square root is \(\mathbf{w}\mathbf{\Lambda}^{1/2}\)
Backup/Aside: \(R^2\) is useless
- \(R^2 \equiv \Var{\mathrm{fitted values}}/\Var{Y}\)
- Suppose we know \(Y=\beta X+\epsilon\), with \(\Var{\epsilon} = \sigma^2\) and \(\epsilon\) independent of \(X\)
- Then \(R^2 = \frac{\beta^2 \Var{X}}{\beta^2 \Var{X} + \sigma^2}\)
- Consequence 1: \(R^2\) can be arbitrarily close to 0, even for the completely correct model (shrink \(\Var{X}\) and/or grow \(\sigma^2\))
- Consequence 2: If we make \(\Var{X}\) bigger, and the linear model is right, \(R^2\) will grow \(\uparrow 1\) (unless \(\beta = 0\)); \(R^2\) measures the range of the regressor, not goodness of fit
- Suppose the linear model is wrong but we use it anyway; \(R^2 = \beta^2\Var{X}/\Var{Y}\)
- Consequence 3: \(R^2\) can be arbitrarily close to 1 even if every single assumption of the linear model is badly, obviously broken
- Conclusion: \(R^2\) tells us nothing about how good or bad our model is that we didn’t already know from the mean squared error; you’d be better off forgetting about it
- More: Shalizi (2015), chapter 10
References
Berk, Richard A. 2008. Statistical Learning from a Regression Perspective. New York: Springer-Verlag.
Buja, Andreas, Richard Berk, Lawrence Brown, Edward George, Emil Pitkin, Mikhail Traskin, Linda Zhao, and Kai Zhang. 2014. “Models as Approximations, Part I: A Conspiracy of Nonlinearity and Random Regressors in Linear Regression.” arxiv:1404.1578. http://arxiv.org/abs/1404.1578.
Davison, A. C., and D. V. Hinkley. 1997. Bootstrap Methods and Their Applications. Cambridge, England: Cambridge University Press.