\[ \newcommand{\Expect}[1]{\mathbb{E}\left[ #1 \right]} \newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]} \newcommand{\Cov}[1]{\mathrm{Cov}\left[ #1 \right]} \newcommand{\TrueRegFunc}{\mu} \newcommand{\EstRegFunc}{\widehat{\TrueRegFunc}} \newcommand{\TrueNoise}{\epsilon} \newcommand{\EstNoise}{\widehat{\TrueNoise}} \DeclareMathOperator{\tr}{tr} \DeclareMathOperator*{\argmin}{argmin} \DeclareMathOperator{\dof}{DoF} \DeclareMathOperator{\det}{det} \]
Our data will generally be a realization of a stochastic process, i.e., a collection of random variables spread over time and space. The random variable we observe at spatial coordinate \(r\) and time \(t\) will be, generically, \(X(r,t)\). If we just care about spatial data, we’ll just have \(X(r)\), and purely temporal data (like the Kyoto data) will be \(X(t)\). In general, the random variables at different points will all be dependent on each other.
Because they are random variables, it makes sense to think about their central tendency, and the expected value or mean is often a convenient one: \[ \TrueRegFunc(r,t) = \Expect{X(r,t)} \] This is a deterministic function of the coordinates; it’s what we mean by the trend.
Mathematically, we can always decompose the deterministic random variables into the trend plus random or stochastic fluctuations, which always have expectation zero: \[\begin{eqnarray} X(r,t) & = & \TrueRegFunc(r,t) + \TrueNoise(r,t)\\ \Expect{\TrueNoise(r,t)} & = & 0 \end{eqnarray}\] We can’t assert anything more about \(\TrueNoise\) without further assumptions. In particular, we can’t make the usual assumptions that are made in regression courses, that \(\TrueNoise\) is uncorrelated, that it has constant variance, that it has the same distribution at each point, that it’s Gaussian, etc. But we can, without any loss of generality, say \(\Expect{\TrueNoise(r,t)} = 0\) everywhere.
We would like to learn the trend function \(\TrueRegFunc\).
If we have many independent realizations of the process, we could just average them. Say we can re-run some experiment many times, getting independent realizations \(X^{(1)}, X^{(2)}, \ldots X^{(m)}\). Then \[ \frac{1}{m}\sum_{i=1}^{m}{X^{(i)}(r,t)} \rightarrow \TrueRegFunc(r,t) \] by the ordinary law of large numbers. As my mention of “experiment” suggests, this idea can be applied in the experimental sciences. In neuroscience, for instance, we might repeatedly measure what happens to a particular neuron (or group of neurons, or brain region) after we apply some stimulus to an experimental animal1. This is, however, not available in non-experimental contexts.
More typically, we have only one realization of the \(X\) process — only one history of cherry-blossoms in Kyoto, only one map of ground-water under Wisconsin, etc. We will only be able to learn about \(\TrueRegFunc\) under assumptions.
In some situations, we might have a well-confirmed scientific theory which tells us exactly what the trend should be, say that \(\TrueRegFunc(r,t) = f(r,t)\) for some completely-specified function \(f\). There are very few situations where this applies — perhaps in parts of astronomy and physics.
More common is the situation where theory pins down a specific functional form for the trend, but there are some unknown coefficients in the formula. That is, \(\TrueRegFunc(r,t) = f(r,t,\theta)\) for one or more parameters \(\theta\). One would then need to estimate \(\theta\) from the data \(X\) to obtain an estimate of the trend. The dependence of the observations on each other makes the statistical inference more complicated, which we’ll come back to later in the course, but in principle this is just about estimating parameters.
Of course, for these approaches to be good ideas, you have to actually have reason to think your theory really does give you a good idea of the trend. There are domains where this is true — physics, chemistry, some parts of biology (especially genetics, evolution and ecology) — where we have very well-supported theories which are detailed enough to pin down specific functional forms. (Some economists think this is true in economics as well, but I think they’re fooling themselves.) Beyond that, though, scientific theories are much looser – they might tell us that one variable tends to go up with another, but just don’t have the detail to give a functional form. This might not be a permanent condition — it used to be true even of physics! — but for now, in lots of scientific areas, let alone in business or policy, we just don’t have credible theories which tell us about trends.
We might, however, make a different sort of assumption, which is just that \(\TrueRegFunc\) doesn’t change very rapidly — that it is a smooth function. The basic idea we need is that \(\TrueRegFunc(t-h) \approx \TrueRegFunc(t) \approx \TrueRegFunc(t+h)\), at least for sufficiently small \(h\), at least at most \(t\). (You can write out the corresponding idea for spatio-temporal trends, with some more notation.) Each \(X(t)\) is a noisy observation of \(\TrueRegFunc(t)\), but if \(\TrueRegFunc(t-h)\) and \(\TrueRegFunc(t+h)\) are close to \(\TrueRegFunc(t)\), then \(X(t-h)\) and \(X(t+h)\) will be noisy observations of things close to \(\TrueRegFunc(t)\), giving us more information.
To make this concrete, and see how it can work, let’s think of a very crude estimate: \[ \EstRegFunc(t_i) = \frac{1}{3}\sum_{k=-1}^{k=+1}{X(t_{i+k})} \] Each \(X\) is trend plus fluctuation, so \[\begin{eqnarray} \EstRegFunc(t_i) & = & \frac{1}{3}\sum_{k=-1}^{k=+1}{\TrueRegFunc(t_{i+k})} + \frac{1}{3}\sum_{k=-1}^{k=+1}{\TrueNoise(t_{i+k})}\\ & = & \TrueRegFunc(t_i) + \frac{1}{3}\sum_{k=-1}^{k=+1}{(\TrueRegFunc(t_{i+k}) - \TrueRegFunc(t_i))} + \frac{1}{3}\sum_{k=-1}^{k=+1}{\TrueNoise(t_{i+k})}\\ & = & \text{truth} + \text{bias} + \text{noise} \end{eqnarray}\]
The first term is exactly what we want. The third term is noise, and the average of many noise terms is generally smaller than the individual noises. (If the \(\TrueNoise\) were uncorrelated with equal variance \(\sigma^2\), this would have variance \(\sigma^2/3\).) The middle term is bias, systematic error. The more rapidly \(\TrueRegFunc\) changes, and the further apart our observations are located, big this bias gets. Conversely, if \(\TrueRegFunc\) changes slowly and our observations are closely spaced, the bias will be small.
At this point, we could go into considerable detail about exactly what the bias is, under various precise mathematical characterizations of “smooth function”. (For instance, we could use Taylor approximation2 to relate the bias to the first and second derivatives of \(\TrueRegFunc\).) But it’ll be more productive to think about the more general type of estimator which includes the simple averaging of near-by values. There is, after all, no reason to just use the immediate neighbors, nor to use equal weights.
We have data points \(x(t_1), \ldots x(t_n)\); for conciseness I will sometimes abbreviate these as \(x_1, \ldots x_n\).
The general form of a linear smoother is \[ \EstRegFunc(t) = \sum_{j=1}^{n}{w(t, t_j) x_j} \] That is, the predictions or fitted values, the estimates of the trend \(\TrueRegFunc\), are always linear combinations of the observed values of \(x\). Notice that the smoother is linear in \(x\), not in \(t\) — the resulting curve can be very nonlinear in \(t\).
Most of the smoothing, regression and estimation techniques used in practice fall in this class.
The fact that lots of statistical methods can be seen as linear smoothers is useful because it gives us a unified way of understanding how these methods work, and in particular of when they will do a good job at recovering trends. Before establishing these properties, I need to introduce one more bit of notation.
At each \(t_i\) where we took an observation, we have a fitted value \(\EstRegFunc(t_i)\), or for short \(\EstRegFunc_i\). Group these together into a vector, or \(n\times 1\) matrix, \[ \mathbf{\EstRegFunc} = \left[\begin{array}{c} \EstRegFunc(t_1) \\ \EstRegFunc(t_2) \\ \vdots \\ \EstRegFunc(t_n) \end{array} \right] \] Similarly, group all the observations of \(x\) into a vector \(\mathbf{x}\). Then the fitted vectors are the observations times a matrix: \[ \mathbf{\EstRegFunc} = \mathbf{w} \mathbf{x} \] where \(w_{ij} = w(t_i, t_j)\).
The first thing this lets us get at is the expected value of the fitted values: \[\begin{eqnarray} \Expect{\mathbf{\EstRegFunc}} & = & \Expect{\mathbf{w}\mathbf{X}}\\ & = & \mathbf{w}\Expect{\mathbf{X}}\\ & = & \mathbf{w}\Expect{\mathbf{\TrueRegFunc} + \mathbf{\TrueNoise}}\\ & = & \mathbf{w}\Expect{\mathbf{\TrueRegFunc}} + \mathbf{w}\Expect{\mathbf{\TrueNoise}}\\ & = & \mathbf{w}\mathbf{\TrueRegFunc} \end{eqnarray}\] because \(\mathbf{\TrueRegFunc}\) is a constant vector, and the noise \(\TrueNoise\) always has expectation 0.
The bias of any estimator is the difference between its expected value and the truth. The estimator is unbiased when that difference is zero. So we have a simple equation for when a smoother gives us an unbiased estimate of the trend: \[\begin{equation} \text{unbiased trend estimates} \Leftrightarrow \mathbf{\TrueRegFunc} = \mathbf{w}\mathbf{\TrueRegFunc} \end{equation}\] This says that we get an unbiased estimate if, and only if, \(\mathbf{\TrueRegFunc}\) is an eigenvector of \(\mathbf{w}\) with eigenvalue 1. (See below for some reminders of the necessary linear algebra.) Otherwise, we get some bias in our estimate.
Suppose (what is generally true) that \(\mathbf{w}\) has \(n\) linearly independent eigenvectors, say \(\mathbf{v}_1, \ldots \mathbf{v}_n\), with corresponding eigenvalues \(\lambda_1, \ldots \lambda_n\). These eigenvectors form a basis for space of vectors, so for an arbitrary vector \(\mathbf{x}\), \[ \mathbf{x} = \sum_{i=1}^{n}{c_i \mathbf{v}_i} \] for some coordinates \(c_1, \ldots c_n\). These coordinates say how much \(\mathbf{x}\) projects on to the eigenvectors, i.e., how similar it is to each of the patterns those eigenvectors represent.
Now what happens when we multiply by \(\mathbf{w}\), i.e., when we apply our favorite linear smoother to \(\mathbf{x}\)? \[ \mathbf{w}\mathbf{x} = \sum_{i=1}^{n}{\mathbf{w} c_i \mathbf{v}_i} = \sum_{i=1}^{n}{c_i \lambda_i \mathbf{v}_i} \] The components of \(\mathbf{x}\) which match eigenvectors \(\mathbf{v}_i\) with large eigenvalues \(\lambda_i\) get enhance, and those which match eigenvectors with small eigenvalues get shrunk.
Most of the time, when we’re dealing with smoothers, the largest eigenvalue is exactly 1. The eigenvectors with eigenvalue 1 define an invariant subspace, one which is left alone by the smoother. Because all the other eigenvalues are smaller, smoothing shrinks the part of the data which doesn’t conform to the invariant subspace. If the eigenvalues which are less than 1 are in fact exactly 0, every component of the data which doesn’t fall in the invariant subspace is projected away, and the vector of fitted values \(\mathbf{\EstRegFunc}\) has to fall in the span of the leading eigenvectors. If the smaller eigenvalues aren’t quite 0, the shrinkage isn’t quite so extreme, and \(\mathbf{\EstRegFunc}\) doesn’t have to be in the invariant subspace, but it is always closer to that space than the data was.
If all the weights are non-negative, \(w_{ij} \geq 0\), and the weights for each fitted value sum to 1, \(\sum_{j}{w_{ij}} = 1\) for all \(i\), then the largest eigenvalue of \(\mathbf{w}\) is exactly 1, and all the other eigenvalues are \(\leq 1\) in magnitude. So when the linear smoother really does proper averages to get each fitted value, the vector of fitted values is shrunk towards the invariant subspace.
For linear regression, every eigenvalue of \(\mathbf{w}\) is either 0 or 1. Hence vector of fitted values is always in the invariant subspace.
Let’s go back to the idea of averaging each observation with the one ahead and behind it. We need to decide what to do with the first and last observations; I’ll set them to weights of 1/2
n <- 10
w <- matrix(0, nrow=10, ncol=10)
diag(w) <- 1/3
for (i in 2:(n-1)) {
w[i,i+1] <- 1/3
w[i,i-1] <- 1/3
}
w[1,1] <- 1/2
w[1,2] <- 1/2
w[n,n-1] <- 1/2
w[n,n] <- 1/2
(You can check that the w
this produces has the right entries.)
Here are the eigenvalues:
eigen(w)$values
## [1] 1.00000000 0.96261129 0.85490143 0.68968376 0.48651845
## [6] -0.31012390 0.26920019 -0.23729622 -0.11137134 0.06254301
Notice, as promised, that 1 is an eigenvalue, and all the others are smaller in magnitude. The corresponding eigenvector is
eigen(w)$vectors[,1]
## [1] 0.3162278 0.3162278 0.3162278 0.3162278 0.3162278 0.3162278 0.3162278
## [8] 0.3162278 0.3162278 0.3162278
i.e., constant. This should make sense — if every entry in \(\mathbf{x}\) was the same, averaging wouldn’t change anything.
The next few eigenvectors are better grasped by visualization than by looking at the numbers.
plot(1:10, eigen(w)$vectors[,2], lty="solid", type="l",
xlab="Position", ylab="Eigenvectors of smoothing matrix")
lines(1:10, eigen(w)$vectors[,3], lty="dashed")
lines(1:10, eigen(w)$vectors[,4], lty="dotted")
legend("bottomleft", legend=c(expression(j==2), expression(j==3),
expression(j==3)),
lty=c("solid","dashed","dotted"))
These look somewhat like sine waves, because they are somewhat like sine waves3. Because the eigenvalues which go with these waves are pretty close to 1, the components of the original data which match these sine waves will be shrunk just a little, while the other components of the data will be shrunk a lot. So these (plus the constant) are the sorts of patterns which this moving average will preserve; everything else it tends to ignore.
Because the data contains fluctuations, which are random, it makes sense to worry about the variance and covariance of the fitted values. These are easy: \[\begin{eqnarray} \Var{\mathbf{\EstRegFunc}} & = & \Var{\mathbf{w}\mathbf{X}}\\ & = & \mathbf{w}\Var{\mathbf{X}}\mathbf{w}^T\\ & = & \mathbf{w}\Var{\mathbf{\TrueRegFunc} + \mathbf{\TrueNoise}}\mathbf{w}^T\\ & = & \mathbf{w}\Var{\mathbf{\TrueNoise}}\mathbf{w}^T \end{eqnarray}\]
Of course, we don’t necessarily know what \(\Var{\mathbf{\TrueNoise}}\) is. In other classes, we often assume that all of the noise terms have equal variance \(\sigma^2\), and are uncorrelated with each other. That means \(\Var{\mathbf{\TrueNoise}} = \sigma^2 \mathbf{I}\), and we get the special case \[ \Var{\EstRegFunc} = \sigma^2 \mathbf{w}\mathbf{w}^T \]
As the data randomly vary, the fitted values are going to randomly vary in response. One way to get a sense of how sensitive the fitted values are to the data is to look at the sum of the covariances, \[\begin{eqnarray} \sum_{i=1}^{n}{\Cov{\EstRegFunc_i, X_i}} & = & \sum_{i=1}^{n}{\Cov{\sum_{j=1}^{n}{w_{ij} X_j}, X_i}}\\ & = & \sum_{i=1}^{n}{\sum_{j=1}^{n}{w_{ij} \Cov{X_i, X_j}}}\\ & = & \sum_{i=1}^{n}{\sum_{j=1}^{n}{w_{ij} \Cov{\TrueNoise_i, \TrueNoise_j}}} \end{eqnarray}\]
Again, in the special case where \(\Cov{\TrueNoise_i, \TrueNoise_j} = 0\) if \(i\neq j\), and \(=\sigma^2\) if \(i=j\), this simplifies to \[ \sum_{i=1}^{n}{\Cov{\EstRegFunc_i, X_i}} = \sigma^2 \tr{\mathbf{w}} \] Remembering that the trace of a matrix equals the sum of its eigenvalues, we use this to define the effective number of degrees of freedom: \[ \dof{\mathbf{w}} \equiv \tr{\mathbf{w}} \]
If \(\mathbf{w}\) is an idempotent projection matrix, this is exactly equal to the number of dimensions we’re projecting down to. If \(\mathbf{w}\) isn’t a projection matrix, we pick up the dimension of the invariant space (= the number of \(\lambda=1\) eigenvalues), plus a bit more.
I have teased you several times by now by mentioning splines, or more formally smoothing splines4. These are smooth curves which we find by balancing coming close to the data points, against not having too much wiggliness or curvature. More specifically, we pick a \(\lambda > 0\), and then solve the optimization problem
\[ \EstRegFunc = \argmin_{m}{\frac{1}{n}\sum_{i=1}^{n}{(x_i - m(t_i))^2} + \lambda\int{(m^{\prime\prime}(t))^2 dt}} \]
The first term we’re optimizing over is the mean squared error. The second term is the average curvature of the function. It would be zero for any linear function (whose second derivative is 0 everywhere), and it grows the more the function bends and twists around. One way to read this is that we are penalizing curvature. In fact, you can think of \(\lambda\) as a price — we’re willing to trade one unit more of curvature for at least \(\lambda\) units of reduced MSE. (Conversely, we’re willing to increase the MSE by one unit, if in doing so we reduce the curvature by at least \(1/\lambda\) units.) As \(\lambda \rightarrow 0\), we get incredibly wiggly functions which go through the data points exactly. As \(\lambda \rightarrow \infty\), we approach the best-fitting straight line, i.e., linear regression.
The optimization here is over all possible functions with second derivatives. You might well ask how we could possibly do this, since it’s an infinite-dimensional space. The answer is that the solution is always a piecewise cubic polynomial, with pieces at the data points \(t_i\), and that the solution is also continuous, with continuous first and second derivatives. (See the optional section below, if you care.) All of this means that we only have \(n\) unknown coefficients to find, rather than having to search over all possible functions. (Again, see the optional section below.)
One upshot of being a piecewise polynomial is that the spline is a linear smoother, with weights which do a kind of local averaging. Another is that there is a direct correspondence between \(\lambda\) and the effective degrees of freedom, where increasing \(\lambda\) lowers the degrees of freedom and vice versa. (The exact relationship depends on the values of \(t_i\) — again, see the optional section.)
All of the above is for one-dimensional smoothing. If we’re dealing with data over two dimensions, so it’s really \(x(r,t)\) rather than just \(x(t)\), we can define the analogous problem: \[ \argmin_{m}{\frac{1}{n}\sum_{i=1}^{n}{(x_i - m(r_i, t_i))^2} + \lambda\int{\left[ {\left(\frac{\partial^2 m}{dr^2}\right)}^2 + 2{\left(\frac{\partial^2 m}{dr dt}\right)}^2 + {\left(\frac{\partial^2 m}{dt^2}\right)}^2\right] dr dt}} \] (Three or more coordinates is similar, but with even more notation.) These are called thin-plate splines.
An alternative approach is to use what are called a “tensor-product splines”. If you dive into the math (see the optional section), you find that one-dimensional splines are linear combinations of “basis functions” \(B_1(t), \ldots B_n(t)\) — the exact shape of the basis functions depends on the \(t_i\) but not on the \(x_i\). In two dimensions, we can find the basis functions for the two coordinates, say \(B_i\) for \(r\) and \(C_j\) for \(t\), and take products, so \[ m(r,t) = \sum_{i=1}^{n}{\sum_{j=1}^{n}{\beta_{ij} B_i(r) C_j(t)}} \]
The simplest way to work with splines in R is to use the smooth.spline
command. This takes two vectors, x
(corresponding to our \(t\) variable) and y
(corresponding to our \(x\) variable), and returns a slightly complicated object. The x
and y
components contain the x
values, and the fitted y
values, so it can be used for plotting. (See the example in the slides for lecture 1.) You can also calculate predictions at new values, not corresponding to the data.
For more sophisticated uses, especially multi-dimensional splines, you need to use a package. I recommend the mgcv
package (Wood 2004,@Wood-on-GAMs), which is primarily designed for smooth, additive models, and as such includes a lot of different smoothers. A command like
library(mgcv)
my.fit <- gam(x ~ s(t), data=df)
estimates a regression model where x
(drawn from data frame df
) is a smooth spline function of t
(drawn from the same source), and stores the result in the object my.fit
. On the other hand,
gam(x ~ s(r,t), data=df)
models x
as a thin-plate spline over the two variables r
and t
, while
gam(x ~ te(r, t), data=df)
uses a tensor-product spline. If you want to model x
as the sum of two smooth functions, that would be
gam(x ~ s(r) + s(t), data=df)
There are also a large number of more advanced packages exclusively for dealing with splines — too many for me to list here.
In econometrics, it is very common to use what is called a “Hodrick-Prescott filter” to extract the trend of a time series. (The canonical reference is Hodrick and Prescott (1997), but this circulated as a working paper for many years before being formally published.) As shown by Paige and Trindade (2010), this is exactly the same as applying a smoothing spline to the data. The only difference is that Hodrick and Prescott supplied a (fallacious) argument for why \(\lambda\) should always be exactly 1600.
This section is optional, but a lot of it is cool. The text is largely ripped off from Shalizi (n.d.).
The spline problem asks us to find the function which minimize the sum of the MSE and a certain integral. Even the MSE can be brought inside the integral, using Dirac delta functions5: \[\begin{equation} \mathcal{L} = \int{\left[\lambda(m^{\prime\prime}(t))^2 + \frac{1}{n} \sum_{i=1}^{n}{(x_i - m(t_i))^2 \delta(t-t_i)}\right] dt} \end{equation}\] In what follows, without loss of generality, assume that the \(t_i\) are ordered, so \(t_1 \leq t_2 \leq \ldots t_i \leq t_{i+1} \leq \ldots t_n\). With some loss of generality but a great gain in simplicity, assume none of the \(t_i\) are equal, so we can make those inequalities strict.
The subject which deals with maximizing or minimizing integrals of functions is the calculus of variations, and one of its basic tricks is to write the integrand as a function of \(x\), the function, and its derivatives: \[\begin{equation} \mathcal{L} = \int{L(t,m,m^{\prime},m^{\prime\prime}) dt} \end{equation}\] where, in our case, \[\begin{equation} L = \lambda(m^{\prime\prime}(t))^2 + \frac{1}{n} \sum_{i=1}^{n}{(x_i - m(t_i))^2 \delta(t-t_i)} \end{equation}\] This sets us up to use a general theorem of the calculus of variations, to the effect that any function \(\hat{m}\) which minimizes \(L\) must also solve \(L\)’s Euler-Lagrange equation: \[ \left.{\frac{\partial L}{\partial m} - \frac{d}{dt}\frac{\partial L}{\partial m^{\prime}} + \frac{d^2}{dt^2}\frac{\partial L}{\partial m^{\prime\prime}}}\right|_{m=\hat{m}} = 0 \] In our case, the Euler-Lagrange equation reads \[ -\frac{2}{n}\sum_{i=1}^{n}{(x_i-\hat{m}(t_i))\delta(t-t_i)} + 2\lambda\frac{d^2}{dt^2}\hat{m}^{\prime\prime}(t) = 0 \] Remembering that \(\hat{m}^{\prime\prime}(t) = d^2\hat{m}/dt^2\), \[ \frac{d^4}{dt^4}\hat{m}(t) = \frac{1}{n\lambda}\sum_{i=1}^{n}{(x_i-\hat{m}(t_i))\delta(t-t_i)} \] The right-hand side is zero at any point \(t\) other than one of the \(t_i\), so the fourth derivative has to be zero in between the \(t_i\). This in turn means that the function must be piecewise cubic. Now fix an \(t_i\), and pick any two points which bracket it, but are both greater than \(t_{i-1}\) and less than \(t_{i+1}\); call them \(l\) and \(u\). Integrate our Euler-Lagrange equation from \(l\) to \(u\): \[\begin{eqnarray} \int_{l}^{u}{\frac{d^4}{dt^4}\hat{m}(t) dt} & = & \int_{l}^{u}{\frac{1}{n\lambda}\sum_{i=1}^{n}{(x_i-\hat{m}(t_i))\delta(t-t_i)}}\\ \hat{m}^{\prime\prime\prime}(u) - \hat{m}^{\prime\prime\prime}(l) &= & \frac{x_i - \hat{m}(t_i)}{n\lambda} \end{eqnarray}\] That is, the third derivative makes a jump when we move across \(t_i\), though (since the fourth derivative is zero), it doesn’t matter which pair of points above and below \(t_i\) we compare third derivatives at. Integrating the equation again, \[\begin{equation} \hat{m}^{\prime\prime}(u) - \hat{m}^{\prime\prime}(l) = (u-l)\frac{x_i - \hat{m}(t_i)}{n\lambda} \end{equation}\] Letting \(u\) and \(l\) approach \(x_i\) from either side, so \(u-l \rightarrow 0\), we see that \(\hat{m}^{\prime\prime}\) makes no jump at \(t_i\). Repeating this trick twice more, we conclude the same about \(\hat{m}^{\prime}\) and \(\hat{m}\) itself. In other words, \(\hat{m}\) must be continuous, with continuous first and second derivatives, and a third derivative that is constant on each \((t_i, t_{i+1})\) interval. Since the fourth derivative is zero on those intervals (and undefined at the \(t_i\)), the function must be a piecewise cubic, with the piece boundaries at the \(t_i\), and continuity (up to the second derivative) across pieces.
To see that the optimal function must be linear below \(t_1\) and above \(t_n\), suppose that it wasn’t. Clearly, though, we could reduce the curvature as much as we want in those regions, without altering the value of the function at the boundary, or even its first derivative there. This would yield a better function, i.e., one with a lower value of \(\mathcal{L}\), since the MSE would be unchanged and the average curvature would be smaller. Taking this to the limit, then, the function must be linear outside the observed data range.
Splines, I said, are piecewise cubic polynomials. To see how to fit them, let’s think about how to fit a global cubic polynomial. We would define four basis functions, \[\begin{eqnarray} B_1(x) & = & 1\\ B_2(x) & = & t\\ B_3(x) & = & t^2\\ B_4(x) & = & t^3 \end{eqnarray}\] and chose to only consider regression functions that are linear combinations of the basis functions, \[\begin{equation} m(t) = \sum_{j=1}^{4}{\beta_j B_j(t)} \end{equation}\] Such regression functions would be linear in the transformed variables \(B_1(t), \ldots B_4(t)\), even though it is nonlinear in \(t\).
To estimate the coefficients of the cubic polynomial, we would apply each basis function to each data point \(t_i\) and gather the results in an \(n\times 4\) matrix \(\mathbf{b}\), \[\begin{equation} b_{ij} = B_j(t_i) \end{equation}\] Then we would do ordinary least squares using the \(\mathbf{b}\) matrix in place of the usual data matrix \(\mathbf{t}\): \[\begin{equation} \hat{\beta} = (\mathbf{b}^T\mathbf{b})^{-1}\mathbf{b}^T \mathbf{x} \end{equation}\]
Since splines are piecewise cubics, things proceed similarly, but we need to be a little more careful in defining the basis functions. Recall that we have \(n\) values of \(t\), \(t_1, t_2, \ldots t_n\), and we can assume they’re in increasing order. These \(n\) knots define \(n+1\) pieces or segments: \(n-1\) of them between the knots, one from \(-\infty\) to \(t_1\), and one from \(t_n\) to \(+\infty\). A third-order polynomial on each segment would seem to need a constant, linear, quadratic and cubic term per segment. So the segment running from \(t_i\) to \(t_{i+1}\) would need the basis functions \[\begin{equation} \mathbf{1}_{(t_i,t_{i+1})}(t), ~ (t-t_i)\mathbf{1}_{(t_i,t_{i+1})}(t), ~ (t-t_i)^2\mathbf{1}_{(t_i,t_{i+1})}(t), ~ (t-t_i)^3\mathbf{1}_{(t_i,t_{i+1})}(t) \end{equation}\] where as usual the indicator function \(\mathbf{1}_{(t_i,t_{i+1})}(t)\) is 1 if \(t \in (t_i,t_{i+1})\) and \(0\) otherwise. This makes it seem like we need \(4(n+1) = 4n+4\) basis functions.
However, we know from linear algebra that the number of basis vectors we need is equal to the number of dimensions of the vector space. The number of adjustable coefficients for an arbitrary piecewise cubic with \(n+1\) segments is indeed \(4n+4\), but splines are constrained to be smooth. The spline must be continuous, which means that at each \(t_i\), the value of the cubic from the left, defined on \((t_{i-1}, t_i)\), must match the value of the cubic from the right, defined on \((t_{i}, t_{i+1})\). This gives us one constraint per data point, reducing the number of adjustable coefficients to at most \(3n+4\). Since the first and second derivatives are also continuous, we are down to just \(n+4\) coefficients. Finally, we know that the spline function is linear outside the range of the data, i.e., on \((-\infty,t_1)\) and on (\(t_n,\infty)\), lowering the number of coefficients to \(n\). There are no more constraints, so we end up needing only \(n\) basis functions. And in fact, from linear algebra, any set of \(n\) piecewise cubic functions which are linearly independent can be used as a basis. One common choice is \[\begin{eqnarray} B_1(t) & = & 1\\ B_2(t) & = & t\\ B_{i+2}(x) & = & \frac{(t-t_i)_{+}^3 - (t-t_n)_{+}^3}{t_n - t_i} - \frac{(t-t_{n-1})_{+}^3 - (t-t_n)_{+}^3}{t_n - t_{n-1}} \end{eqnarray}\] where \((a)_{+} = a\) if \(a > 0\), and \(=0\) otherwise. This rather unintuitive-looking basis has the nice property that the second and third derivatives of each \(B_j\) are zero outside the interval \((t_1, t_n)\).
Now that we have our basis functions, we can once again write the spline as a weighted sum of them, \[\begin{equation} m(t) = \sum_{j=1}^{m}{\beta_j B_j(t)} \end{equation}\] and put together the matrix \(\mathbf{b}\) where \(b_{ij} = B_j(t_i)\). We can write the spline objective function in terms of the basis functions, \[\begin{equation} n\mathcal{L} = (\mathbf{x}-\mathbf{b}\beta)^T (\mathbf{x}-\mathbf{b}\beta) + n\lambda \beta^T \Omega \beta \end{equation}\] where the matrix \(\Omega\) encodes information about the curvature of the basis functions: \[\begin{equation} \Omega_{jk} = \int{B^{\prime\prime}_j(t) B^{\prime\prime}_k(t) dt} \end{equation}\] Notice that only the quadratic and cubic basis functions will make non-zero contributions to \(\Omega\). With the choice of basis above, the second derivatives are non-zero on, at most, the interval \((t_1, t_n)\), so each of the integrals in \(\Omega\) is going to be finite. \(\Omega\) is something we (or, realistically, R) can calculate once, no matter what \(\lambda\) is. Now we can find the smoothing spline by differentiating with respect to \(\beta\): \[\begin{eqnarray} 0 & = & -2\mathbf{b}^T\mathbf{x} + 2 \mathbf{b}^T\mathbf{b} \hat{\beta} + 2n\lambda\Omega\hat{\beta}\\ \mathbf{b}^T\mathbf{x} & = & \left(\mathbf{b}^T \mathbf{b} + n\lambda\Omega\right)\hat{\beta}\\ \hat{\beta} & = & {\left(\mathbf{b}^T \mathbf{b} + n\lambda\Omega\right)}^{-1} \mathbf{b}^T\mathbf{x} \label{eqn:spline-coefficients-as-ridge-regression} \end{eqnarray}\] Notice, incidentally, that we can now show splines are linear smoothers: \[\begin{eqnarray} \mathbf{\EstRegFunc} & = & \mathbf{b}\hat{\beta}\\ & = & \mathbf{b}{\left(\mathbf{b}^T \mathbf{b} + n\lambda\Omega\right)}^{-1} \mathbf{b}^T\mathbf{x} \end{eqnarray}\]
Once again, if this were ordinary linear regression, the OLS estimate of the coefficients would be \((\mathbf{t}^T\mathbf{t})^{-1} \mathbf{t}^T \mathbf{x}\). In comparison to that, we’ve made two changes. First, we’ve substituted the basis-function matrix \(\mathbf{b}\) for the original matrix of predictor variables, \(\mathbf{t}\) — a change we’d have made already for a polynomial regression. Second, the “denominator” is not \(\mathbf{t}^T\mathbf{t}\), or even \(\mathbf{b}^T \mathbf{b}\), but \(\mathbf{b}^T \mathbf{b} + n\lambda\Omega\). Since \(\mathbf{t}^T\mathbf{t}\) is \(n\) times the covariance matrix of the independent variables, we are taking the covariance matrix of the spline basis functions and adding some extra covariance — how much depends on the shapes of the functions (through \(\Omega\)) and how much smoothing we want to do (through \(\lambda\)). The larger we make \(\lambda\), the less the actual data matters to the fit.
In addition to explaining how splines can be fit quickly (do some matrix arithmetic), this illustrates two important tricks. One, which we won’t explore further here, is to turn a nonlinear regression problem into one which is linear in another set of basis functions. This is like using not just one transformation of the input variables, but a whole library of them, and letting the data decide which transformations are important. There remains the issue of selecting the basis functions, which can be quite tricky. In addition to the spline basis, most choices are various sorts of waves — sine and cosine waves of different frequencies, various wave-forms of limited spatial extent (“wavelets”), etc. The ideal is to chose a function basis where only a few non-zero coefficients would need to be estimated, but this requires some understanding of the data…
The other trick is that of stabilizing an unstable estimation problem by adding a penalty term. This reduces variance at the cost of introducing some bias.
Since the effective degrees of freedom of a linear smoother is \(\tr{\mathbf{w}}\), we’ve just shown that the a spline has \[ \tr{\left(\mathbf{b}{\left(\mathbf{b}^T \mathbf{b} + n\lambda\Omega\right)}^{-1} \mathbf{b}^T\right)} \] degrees of freedom. Holding the data (and so \(\mathbf{b}\) and \(\Omega\)) fixed, as \(\lambda\) goes up, this will go down, and vice versa.
Pretty much any smoothing method we might use has some adjustable control settings6 — the size of the moving-average window, say, or the “bandwidth” of a kernel — which we need to set. How should we tune these control settings?
One of the simplest, but also most practical, ways of picking control settings is to make the smoother predict well on data it didn’t see. The uses to which we’re putting our smoother — interpolating to fill in gaps in the data, extrapolating beyond the range of the data, filtering to remove noise from the data — are all, in the end, about predicting unseen values. The most natural way to make sure the smoother can predict unseen values is to have it predict unseen values.
This brings us to the idea of cross-validation. There are many variants, but the key notion is this:
This gives a score for each smoother, which says how well it was able to predict data which it didn’t get to see during the fitting or training process. This is exactly what we want.
The most straight-forward form of this is leave-one-out cross-validation (LOOCV).7 We do the following for each smoother we’re considering:
This makes it sound as though, with LOOCV, we need to estimate each possible smoother \(n\) times, which would be slow for a large data set. There is in fact a trick for linear smoothers which lets us fit the smoother just once. What happens when we hold back data point \(i\), and then make a prediction at \(t_i\)? The observed response at \(i\) can’t contribute to the prediction, but otherwise the linear smoother should work as before, so \[ \EstRegFunc^{(-i)}(t_i)= \frac{({\mathbf{w} \mathbf{x})}_i - w_{ii} x_i}{1-w_{ii}} \] The numerator just removes the contribution to \(\EstRegFunc(t_i)\) that came from \(x_i\), and the denominator just re-normalizes the weights in the smoother. Now a little algebra says that \[ x_i - \EstRegFunc^{(-i)}(t_i) = \frac{x_i - \EstRegFunc(x_i)}{1-w_{ii}} \] The quantity on the left of that equation is what we want to square and average to get the leave-one-out CV score, but everything on the right can be calculated from the fit we did to the whole data. The leave-one-out CV score is therefore \[\begin{equation} \frac{1}{n}\sum_{i=1}^{n}{\left(\frac{x_i-\EstRegFunc(t_i)}{1-w_{ii}}\right)^2} \end{equation}\] Again, this formula requires both leave-one-out-CV and a linear smoother.
If we had independent observations, you might think that LOOCV would give us \(n\) independent, unbiased estimates of how well a smoother predicts new data, and so an unbiased, low-variance estimate of model performance. The problem is that even if the data points are independent, \(\EstRegFunc^{(-i)}\) and \(\EstRegFunc^{(-j)}\) are based on \(n-2\) shared data points, so the leave-one-out errors for each data point are pretty correlated with each other. (Can you work out the covariance between \(x_i - \EstRegFunc^{(-i)}(t_i)\) and \(x_j - \EstRegFunc^{(-j)}(t_j)\)?) One consequence of this is that LOOCV tends to over-fit — to pick models with more degrees of freedom than are strictly necessary.
We sometimes therefore prefer to use many data points in each testing set, so the predictions are less correlated with each other. The commonest approach is \(k\)-fold CV, where we divide all the data points into \(k\) equal groups, or “folds”, and each fold gets to be the testing set in turn. (Typically \(k=5\) or \(10\).) This is much less prone to over-fitting than leave-one-out.
These forms of cross-validation work very, very well for independent data, to the point where they’re pretty much the standard for any situation involving statistical modeling of independent data. But what about dependent data? If each data point tends to be like its neighbors, you might well worry that a smoother which just copies the neighbors will have an unfair advantage in cross-validation. At the very least, this would seem to lead to more noise in our estimate of how well each smoother predicts.
A natural way out of this is to do leave-one-out cross-validation, but to leave out a buffer of size (say) \(h\) around each point (Burman, Chow, and Nolan 1994). This buffer is not part of the training set, but we also don’t use it in evaluating the prediction of the smoother. (We do use values in the buffer when we calculate the prediction at the held-out point, however.) If remote observations are nearly independent of each other, and the buffer is big enough, then the testing set is nearly independent of the training set, and everything is well-behaved. You can combine this idea with \(k\)-fold cross-validation, where you omit a buffer around each block (Racine 2000).
The obvious question is “how big a buffer?”, to which the obvious answer is “big enough to make the training and test points nearly independent”. Since we usually don’t know that, we typically rely on rules of thumb. Burman, Chow, and Nolan (1994) suggest using a constant fraction of \(n\), say \(h=0.5 n\). Racine (2000) suggests that this rule is prone to over-fitting, and that a better one would be \(h = n^{0.25}\). I don’t know of a really great, automatic resolution of this issue.
In practice, it can often be easiest to just live with the extra imprecision and over-fitting that come from ignoring dependence, but we’ll come back to this later when we look in more detail at time-series forecasting.
Before moving on from smoothers, I should mention one last special consideration. When our data consists of a ratio or proportion, say \(X(t) = Y(t)/Z(t)\), we could smooth it the same way we’d smooth any other numerical data. However, if the denominator is small, any noise in it can have a massive effect on the ratio. It can make a lot more sense to separately smooth \(Y\) and \(Z\), and then take the ratio of the smoothed values. Kafadar (1996), from whom I learned this point, explores how best to do this with splines and related smoothers.
After we have chosen how to smooth the data, and gotten an estimate of the trend, we also have an estimate of the fluctuations. Since we can always write \[ X(r,t) = \TrueRegFunc(r,t) + \TrueNoise(r,t) \] where \(\TrueRegFunc\) is a deterministic (but unknown) function and \(\TrueNoise\) is a stochastic process, the natural estimate of the latter is \[ \EstNoise(r,t) \equiv X(r,t) - \EstRegFunc(r,t) \] These natural estimates of the noise are the residuals of the smoother.
When we use a linear smoother, we can find the residuals in matrix form: \[\begin{eqnarray} \mathbf{\EstNoise} & = & \mathbf{x} - \mathbf{\EstRegFunc}\\ & = & \mathbf{x} - \mathbf{w}\mathbf{x}\\ & = & (\mathbf{I} - \mathbf{w})\mathbf{x} \end{eqnarray}\] Thus the residuals, like the fitted values, also linear in \(\mathbf{x}\). We can therefore work out properties of the residuals much as we did properties of the fitted values.
The expected value of the residual vector is \[\begin{eqnarray} \Expect{\mathbf{\EstNoise}} & = & \Expect{(\mathbf{I}-\mathbf{w})\mathbf{X}}\\ & = & (\mathbf{I}-\mathbf{w})\mathbf{\TrueRegFunc} \end{eqnarray}\] This will be zero — i.e., match the true expectation of the noise — if and only if \(\TrueRegFunc\) is an eigenvalue of \(\mathbf{I}-\mathbf{w}\) with eigenvalue 0.
— It should not be hard to convince yourself that if \(\mathbf{v}\) is an eigenvector of \(\mathbf{w}\) with eigenvalue \(\lambda\), then \(\mathbf{v}\) is also an eigenvector of \(\mathbf{I}-\mathbf{w}\), with eigenvalue \(1-\lambda\). It should also not be hard to check that if \(\mathbf{w}\) is symmetric and idempotent, then so is \(\mathbf{I} - \mathbf{w}\).
The variance of the residuals is \[ \Var{\mathbf{\EstNoise}} = (\mathbf{I}-\mathbf{w}) \Var{\mathbf{\epsilon}} (\mathbf{I}-\mathbf{w})^T \] Under the special assumption that the true noise has constant variance and is uncorrelated, \(\Var{\mathbf{\epsilon}} = \sigma^2 \mathbf{I}\), and this simplifies to \[ \sigma^2 (\mathbf{I}-\mathbf{w})(\mathbf{I}-\mathbf{w})^T \] Notice that in general this is not \(\sigma^2 \mathbf{I}\). That is, the residuals generally have somewhat different variances, and correlation structure, than the true noise. Homework 2 will, among other things, explore some consequences of this.
This section can be skipped, if you’re willing to take the fact that linear regression is really, or also, a linear smoother on trust. We will be coming back to some of this stuff later, when we look at linear predictors more generally.
Think of the simple linear regression model: \[ m(t) = \beta_0 + \beta_1 t \]
We usually estimate this by minimizing the mean squared error (MSE) — what’s called the ordinary least sqaures or OLS estimate: \[\begin{eqnarray} (\hat{\beta}_0, \hat{\beta}_1) = \argmin_{b_0, b_1}{\frac{1}{n}\sum_{i=1}^{n}{(x_i - (b_0 + b_1 t_i))^2}} \end{eqnarray}\] Take the derivatives with respect to \(b_0\) and \(b_1\), and set them to zero. \[\begin{eqnarray} \frac{1}{n}\sum_{i=1}^{n}{2(x_i - (\hat{\beta}_0 + \hat{\beta}_1 t_i))(-1)} & = & 0\\ \frac{1}{n}\sum_{i=1}^{n}{2(x_i - (\hat{\beta}_0 + \hat{\beta}_1 t_i))(-t_i)} & = & 0 \end{eqnarray}\] We can drop the over-all factor of \(-2\) in what follows.
First, solve for \(\hat{\beta}_0\) in terms of \(\hat{\beta}_1\): \[\begin{eqnarray} \overline{x} - \hat{\beta}_0 - \hat{\beta}_1 \overline{t} & = & 0\\ \overline{x} - \hat{\beta}_1 \overline{t} & = & \hat{\beta}_0 \end{eqnarray}\] Now solve for \(\hat{\beta}_1\): \[\begin{eqnarray} \overline{xt} - \hat{\beta}_0 \overline{t} - \hat{\beta}_1 \overline{t^2} & = & 0\\ \overline{xt} - \overline{x}\overline{t} + \hat{\beta}_1 (\overline{t})^2 - \hat{\beta}_1 \overline{t^2} & = & 0\\ \frac{\overline{xt} - \overline{x}\overline{t}}{\overline{t^2} - \overline{t}^2} & = & \hat{\beta}_1 \end{eqnarray}\] It’s convenient to abbreviate the denominator as \(s_t^2\).
Putting this together, \[ \EstRegFunc(t) = \overline{x} + \frac{t - \overline{t}}{s_t^2}(\overline{xt} - \overline{x}\overline{t}) \] This is linear in \(x\), so we know it can be put in the linear-smoother form. In fact, \[ \EstRegFunc(t) = \sum_{j=1}^{n}{\left(\frac{1}{n} + \frac{(t-\overline{t})(t_j - \overline{t})}{n s_t^2}\right) x_j} \]
This particular form of weights ensures both that the estimated curve is a straight line, and that it comes as close to the data as possible, on average, among all straight lines.
Beyond that, however, the weights are really weird. The first part is just the sample mean, but the variable part is \(\propto (t-\overline{t})(t_j-\overline{t})\). This means we give more weight to data points which are far from the center of the data, no matter what. We also ramp up all the weights as the point \(t\) where we’re making a prediction moves away from the center of the data.
Suppose we want to make our prediction a linear combination of multiple variables: \[ m(t,z, \ldots u) = \beta_0 + \beta_ 1 + \beta_2 z + \ldots \beta_p u \] We fit this by least squares again. It simplifies the math immensely if we define two matrices, \[ \mathbf{t} = \left[ \begin{array}{ccccc} 1 & t_1 & z_1 & \ldots & u_1\\ 1 & t_2 & z_2 & \ldots & u_2\\ \vdots & \vdots & \vdots & \ldots & \vdots\\ 1 & t_n & z_n & \ldots & u_n \end{array} \right] \] and \[ \mathbf{b} = \left[ \begin{array}{c} b_0 \\ b_1 \ldots b_p \end{array}\right] \] Then the fitted values are \[ \mathbf{m} = \mathbf{t}\mathbf{b} \] and the mean-squared error is \[ \frac{1}{n}\sum_{i=1}^{n}{(x_i - (b_0 + b_1 t_i + b_2 z_i + \ldots + b_p u_i))^2} = \frac{1}{n} (\mathbf{x} - \mathbf{t}\mathbf{b})^T (\mathbf{x} - \mathbf{t}\mathbf{b}) \] Taking the derivative and setting it to zero \[ \frac{2}{n}\mathbf{t}^T (\mathbf{x} - \mathbf{t}\mathbf{\widehat{\beta}}) = 0 \] Solving, \[ \mathbf{\widehat{\beta}} = (\mathbf{t}^T\mathbf{t})^{-1} \mathbf{t}^T \mathbf{x} \] The fitted valued come from \[ \mathbf{\EstRegFunc} = \mathbf{t} (\mathbf{t}^T\mathbf{t})^{-1} \mathbf{t}^T \mathbf{x} \] so the weight matrix in this case is \[ \mathbf{w} = \mathbf{t} (\mathbf{t}^T\mathbf{t})^{-1} \mathbf{t}^T \] You can check that this matrix is symmetric, and that it is idempotent. You can also check that when we just deal with one predictor, these matrix formulas are equivalent to the ones for simple linear regression above.
These formula handles all of the coefficients at once. You can check, by explicitly taking the partial derivative just with respect to \(b_0\) and setting that to 0, that it is still the case that \[ \hat{\beta}_0 = \overline{y} - (\hat{\beta}_1 \overline{t} + \hat{\beta}_2 \overline{z} + \ldots + \hat{\beta}_p \overline{u}) \] This tells us that the individual weights in \(\mathbf{w}\) still have the same form — they’re a constant, plus terms that give more weight to points far from the center of the data.
Nothing in the logic for the previous section requires that the extra predictors be new variables. If we want to do polynomial regression, \[ m(t) = \beta_0 + \beta_1 t + \beta_2 t^2 + \ldots \beta_p t^p \] then everything carries through exactly as before. The matrix \(\mathbf{t}\) would just have columns containing the constant, \(t_i\), \(t_i^2\) and so on.
What’s important is that we’re just taking an additive combination of the functions of \(t\) on the right-hand side of the model formula, and that all of the functions on the RHS are linearly independent of each other. (If there was linear dependence, then the matrix \(\mathbf{t}^T \mathbf{t}\) wouldn’t be invertible.)
This section can be skipped if you are comfortable working with the eigenvalues and eigenvectors of matrices. Even if you are, it won’t hurt to at least skim it.
A (non-zero) vector or \(n\times 1\) matrix \(\mathbf{v}\) is an eigenvector of the \(n\times n\) matrix \(\mathbf{w}\), with eigenvalue \(\lambda\), when \[ \mathbf{w}\mathbf{v} = \lambda\mathbf{v} \] We can re-write this as \[ (\mathbf{w} - \lambda\mathbf{I})\mathbf{v} = 0 \] and demanding \(\mathbf{v} \neq 0\) tells us that the \(\lambda\) which make this true have to obey the characteristic polynomial or eigenpolynomial of the matrix \[ \det{(\mathbf{w} - \lambda\mathbf{I})} = 0 \]
In R, the command to find the eigenvalues and eigenvectors of a matrix is eigen
:
theta <- runif(n=2)
# Why "mark"?
mark <- matrix(c(theta[1], 1-theta[1], 1-theta[2], theta[2]),
byrow=TRUE, nrow=2, ncol=2)
mark
## [,1] [,2]
## [1,] 0.1080034 0.8919966
## [2,] 0.6706769 0.3293231
eigen(mark)
## eigen() decomposition
## $values
## [1] 1.0000000 -0.5626735
##
## $vectors
## [,1] [,2]
## [1,] -0.7071068 -0.7992771
## [2,] -0.7071068 0.6009627
This returns a list with two components, one of them a vector of eigenvalues, the other a matrix whose columns are the eigenvectors.
There are a number of salient properties of eigenvalues and eigenvectors.
The determinant of a matrix is the product of its eigenvalues, “counted with multiplicity”: \[ \det{\mathbf{w}} = \prod_{j=1}^{n}{\lambda_j} \] The trace of a matrix is the sum of its eigenvalues: \[ \tr{\mathbf{w}} = \sum_{j=1}^{n}{\lambda_j} \]
For any matrix \(\mathbf{w}\), and any power \(k > 0\), \(\mathbf{v}\) is an eigenvector of \(\mathbf{w}\) with eigenvalue \(\lambda\) if and only if \(\mathbf{v}\) is an eigenvector of \(\mathbf{w}^k\) with eigenvalue \(\lambda^k\). If \(\mathbf{w}\) is invertible, then \(\mathbf{v}\) is an eigenvector of \(\mathbf{w}^{-1}\) with eigenvalue \(1/\lambda\).
If \(\mathbf{v}\) is an eigenvector with eigenvalue \(\lambda\), then for any scalar \(a \neq 0\), \(a\mathbf{v}\) is also an eigenvector, with the same eigenvalue10. This means that the eigenvectors, unlike the eigenvalues, aren’t uniquely defined. Instead, we usually choose eigenvectors to have nice properties, e.g., length 1. Also, \(-\mathbf{v}\) is always another eigenvector, so two calculations of eigenvectors might end up with different signs.
If \(\mathbf{v}\) and \(\mathbf{u}\) are eigenvectors with the same eigenvalue \(\lambda\), then so is \(a\mathbf{v}+b\mathbf{u}\), for any scalars \(a, b\). (Check this yourself!)
For fixed \(\lambda\), the eigenvectors solve the equation \((\mathbf{w} - \lambda\mathbf{I})\mathbf{v} = 0\). This matrix equation is equivalent to a system of \(n\) linear equations in the \(n\) unknowns \((v_1, v_2, \ldots v_n)\). If there are no redundant (linearly dependent) equations in this system, then the solution is unique, up to the over-all scaling factor we just saw in (4). But if there are redundant equations, we can have a whole space of solutions, the eigenspace of \(\lambda\). The number of redundant equations gives the dimension of this eigenspace, a.k.a. the geometric multiplicity of \(\lambda\). When there are multiple, linearly-independent eigenvectors with eigenvalue \(\lambda\), we can choose orthogonal eigenvectors to represent the whole eigenspace.
If \(\mathbf{w}\) is symmetric, and \(\mathbf{v}\) and \(\mathbf{u}\) are eigenvectors with distinct eigenvalues \(\lambda_1 \neq \lambda_2\), then \(\mathbf{v}^T\mathbf{u} = 0\), and we say the vectors are orthogonal11. Remark: The orthogonality of eigenvectors actually holds for many asymmetric matrices as well, so long as they are normal, meaning \(\mathbf{w}^T\mathbf{w} = \mathbf{w}\mathbf{w}^T\). The proof is similar, but more involved. I am afraid I mis-spoke in lecture when I asserted that it is always true for all matrices12.
If \(\mathbf{w} = \mathbf{w}^T\), then all the eigenvalues are real. Because eigenvectors with different eigenvalues are distinct in this case, and we can always normalize all the eigenvectors, we have \(\mathbf{e}^{-1} = \mathbf{e}^T\), and \(\mathbf{w} = \mathbf{e}\mathbf{\lambda}\mathbf{e}^T\).
A matrix is called stochastic if \(w_{ij} \geq 0\) for all \(i, j\), and if \(\sum_{j=1}^{n}{w_{ij}} = 1\) for all \(i\). Some very important properties follow from this:
These are all consequences of a result in linear algebra called the Frobenius-Perron theorem, which we will take on trust.
A matrix is called idempotent13 when \(\mathbf{w}^2 = \mathbf{w} \mathbf{w} = \mathbf{w}\). It follows14 that the only possible eigenvalues are 0 and 1.
Projecting on to a linear subspace (line, plane, hyper-plane, etc.) always corresponds to an idempotent matrix. (Points already in the space aren’t affected by the projection, so projecting twice is the same as projecting once.) The dimension of the space corresponds to the number of eigenvalues equal to 1.
Burman, Prabir, Edmond Chow, and Deborah Nolan. 1994. “A Cross-Validatory Method for Dependent Data.” Biometrika 81:351–58. https://doi.org/10.1093/biomet/81.2.351.
Hodrick, Robert J., and Edward C. Prescott. 1997. “Postwar U.S. Business Cycles: An Empirical Investigation.” Journal of Money, Credit, and Banking 29:1–16.
Kafadar, Karen. 1996. “Smoothing Geographical Data, Particularly Rates of Disease.” Statistics in Medicine 15:2539–60. https://doi.org/10.1002/(SICI)1097-0258(19961215)15:23<2539::AID-SIM379>3.0.CO;2-B.
Kolmogorov, A. N., and S. V. Fomin. 1975. Introductory Real Analysis. New York: Dover.
Paige, Robert L., and A. Alexandre Trindade. 2010. “The Hodrick-Prescott Filter: A Special Case of Penalized Spline Smoothing.” Electronic Journal of Statistics 4:856–74. https://doi.org/10.1214/10-EJS570.
Racine, Jeff. 2000. “Consistent Cross-Validatory Model-Selection for Dependent Data: Hv-Block Cross-Validation.” Journal of Econometrics 99:39–61. https://doi.org/10.1016/S0304-4076(00)00030-0.
Shalizi, Cosma Rohilla. n.d. Advanced Data Analysis from an Elementary Point of View. Cambridge, England: Cambridge University Press. http://www.stat.cmu.edu/~cshalizi/ADAfaEPoV.
Wood, Simon N. 2004. “Stable and Efficient Multiple Smoothing Parameter Estimation for Generalized Additive Models.” Journal of the American Statistical Association 99:673–86. http://www.maths.bath.ac.uk/~sw283/simon/papers/magic.pdf.
———. 2006. Generalized Additive Models: An Introduction with R. Boca Raton, Florida: Chapman; Hall/CRC.
I once analyzed an experiment where a mechanical vibrator repeatedly gave very precise tweaks to a rat’s whiskers. (Rats devote a lot of their brain to processing sensations from their whiskers, which help them navigate in the dark.)↩
As you remember from calculus, the Taylor expansion of a function \(f\) around a point \(t_0\) is \(f(t) = f(t_0) + ((t-t_0) f^{\prime}(t_0) + \frac{1}{2}(t-t_0)^2 f^{\prime\prime}(t_0) + \ldots\), where \(f^{\prime}\), \(f^{\prime\prime}\), etc., are the first, second, etc., derivatives. In a Taylor approximation, we truncate the series at a certain order. The error of a \(k^{\mathrm{th}}\) order approximation is \(O((t-t_0)^{k+1})\).↩
We will come back to this connection, between moving averages and sine waves, when we look at Fourier analysis later in the course. For now, you can take it on trust. — If you are unwilling to take it on trust, think about how you’d approximate a 2nd derivative with finite differences: \(\frac{\frac{x(t+h) - x(t)}{h} - \frac{(x(t) - x(t-h)}{h}}{h} = \frac{x(t+h) - 2 x(t) + x(t-h)}{h^2}\). So replacing \(x(t)\) with the moving average is approximately the same as replacing \(x(t)\) with \(x(t) + \kappa x^{\prime\prime}(t)\) for some scaling factor \(\kappa\). And you can verify for yourself that every sine or cosine function is an eigenfunction of this differential operator, i.e., that \(\left(1+\kappa\frac{d^2}{dt^2}\right)\sin{\omega t} = \lambda\sin{\omega t}\). (Can you find \(\lambda\)?) So it should make some sense that the eigenvectors of the smoothing matrix are close to the eigenfunctions of the differential operator.↩
People have defined many variants on splines. When I use the word without an adjective, I mean smoothing splines, as defined by the optimization problem immediately following this footnote.↩
The Dirac delta function (named after the physics P. A. M. Dirac) is defined by how it integrates: for any function \(f(t)\), \(\int{f(t) \delta(t) dt} = f(0)\). You can think of this as the limit of a sequence of Gaussian pdfs, all centered at 0, with variances shrinking towards zero. The limit would be something like a function \(\delta\) which was \(0\) except at the origin, but had an infinitely skinny, infinitely tall spike at the origin, with the total area under the curve being exactly 1. There isn’t really a function like that, but it’s definitely true that \(\lim_{\sigma\rightarrow 0}{\int{f(t) \frac{1}{\sigma}phi(t/\sigma) dt}} = f(0)\), so \(\delta(t)\) makes sense in integrals. One can in fact build a whole theory of “generalized functions” which make sense in integrals in this way, including actual functions as a special case. If this interests you, Kolmogorov and Fomin (1975) is still one of the best ways in to the subject. (Dirac, being a physicist, didn’t let something like \(\delta\) not really being a function stop him from calculating.)↩
Some people call these “hyperparameters”, rather than “control settings”. I think this is a misleading name, because they’re not in any way parameters of the process we’re studying, they’re just aspects of our method.↩
Some of the text in this section is modified from Shalizi (n.d.).↩
The same idea works if you’d rather use the absolute error, a “robust” loss function, etc.↩
\(\det{(\mathbf{w} - \lambda\mathbf{I})} = 0\) is called the characteristic polynomial because working out the determinant gives us an \(n^{\mathrm{th}}\) order polynomial in \(\lambda\). This has at most \(n\) distinct roots, some of which might be complex; the same \(\lambda\) can appear multiple times as a root.↩
By the ordinary rules for matrix multiplication, \(\mathbf{w}(a\mathbf{v}) = a\mathbf{w}\mathbf{v} = a\lambda \mathbf{v}\), because \(\mathbf{v}\) is an eigenvector.↩
Since \(\mathbf{w}\) is symmetric, \(\mathbf{w} = \mathbf{w}^T\), we have \(\mathbf{u}^T \mathbf{w}\mathbf{v} = \mathbf{v}^T \mathbf{w}\mathbf{u}\). Because these are eigenvectors, this means \[\begin{eqnarray} \mathbf{u}^T \lambda_1 \mathbf{v} & = & \mathbf{v}^T \lambda_2 \mathbf{u}\\ \lambda_1 \mathbf{u}^T\mathbf{v} & = & \lambda_2 \mathbf{v}^T \mathbf{u} \end{eqnarray}\] but since \(\mathbf{v}^T\mathbf{u} = \mathbf{u}^T\mathbf{v}\) (it’s a \(1\times 1\) matrix, which is always symmetric), we get \[ (\lambda_1 - \lambda_2)( \mathbf{u}^T\mathbf{v}) = 0 \] and we assumed \(\lambda_1 \neq \lambda_2\).↩
Here’s a counter-example. The matrix \(\left[ \begin{array}{cc} 2 & 0 \\ 2 & 3 \end{array} \right]\), has eigenvalues are \(3\) and \(2\), and the corresponding eigenvectors are \(\left[ 0, 1 \right]\) and \(\left[ 0.4472136, -0.8944272 \right]\), which aren’t orthogonal.↩
From Latin idem, “the same”, and potens, “power”, hence something which is the same as its powers.↩
If \(\mathbf{v}\) is an eigenvector of \(\mathbf{w}\) with eigenvalue \(\lambda\), then clearly \(\mathbf{w}^2 \mathbf{v} = \lambda^2\mathbf{v}\), so \(\lambda^2\) is an eigenvalue of \(\mathbf{w}\). But, since we’re assuming \(\mathbf{w}\) is idempotent, \(\mathbf{w}^2 = \mathbf{w}\), so all eigenvalues must obey the equation \(\lambda^2=\lambda\), and the only two solutions are 0 and 1.↩