Trends and Smoothing II

36-467/36-667

4 September 2018

\[ \newcommand{\Expect}[1]{\mathbb{E}\left[ #1 \right]} \newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]} \newcommand{\Cov}[1]{\mathrm{Cov}\left[ #1 \right]} \newcommand{\TrueRegFunc}{\mu} \newcommand{\EstRegFunc}{\widehat{\TrueRegFunc}} \DeclareMathOperator{\tr}{tr} \DeclareMathOperator*{\argmin}{argmin} \DeclareMathOperator{\dof}{DoF} \DeclareMathOperator{\det}{det} \newcommand{\TrueNoise}{\epsilon} \newcommand{\EstNoise}{\widehat{\TrueNoise}} \]

In our last episode…

Data \(X(t) = \TrueRegFunc(t) + \TrueNoise(t)\)
\(\TrueRegFunc\) deterministic (=trend), \(\TrueNoise\) stochastic and mean-zero (=fluctuations)
Wanted: estimates of \(\TrueRegFunc\) and/or \(\TrueNoise\) from one data set
Hope: \(\TrueRegFunc\) is a smooth function \(\Rightarrow\) average nearby \(X\)’s
Linear smoother: \(\EstRegFunc(t) = \sum_{j=1}^{n}{w(t, t_j) x_j}\)
Fitted values on the data \(\mathbf{\EstRegFunc} = \mathbf{w}\mathbf{x}\)
\(\mathbf{w}\) is the source of all knowledge

Expectation of the fitted values

\[\begin{eqnarray} \Expect{\mathbf{\EstRegFunc}} & = & \Expect{\mathbf{w}\mathbf{X}}\\ & = & \mathbf{w}\Expect{\mathbf{X}}\\ & = & \mathbf{w} \mathbf{\mu} \end{eqnarray}\]

Unbiased estimate \(\Leftrightarrow \mathbf{w} \mathbf{\mu} = \mathbf{\mu}\)

Expanding in eigenvectors

Generally, \(\mathbf{w}\) has \(n\) linearly-independent eigenvectors \(\mathbf{e}_1, \ldots \mathbf{e}_n\), with eigenvalues \(\lambda_1, \ldots \lambda_n\)
So \(\mathbf{x} = \sum_{j=1}^{n}{c_j \mathbf{e}_j}\)
So \(\mathbf{w}\mathbf{x} = \mathbf{w}\sum_{j=1}^{n}{c_j \mathbf{e}_j} = \sum_{j=1}^{n}{c_j \lambda_j \mathbf{e}_j}\)
Components of the data which match large-\(\lambda\) eigenvectors are enhanced
Components of the data which match small-\(\lambda\) eigenvectors are shrunk

A little example

n <- 10
w <- matrix(0, nrow=10, ncol=10)
diag(w) <- 1/3
for (i in 2:(n-1)) {
    w[i,i+1] <- 1/3
    w[i,i-1] <- 1/3
}
w[1,1] <- 1/2
w[1,2] <- 1/2
w[n,n-1] <- 1/2
w[n,n] <- 1/2

A little example

##            [,1]      [,2]      [,3]      [,4]      [,5]      [,6]
##  [1,] 0.5000000 0.5000000 0.0000000 0.0000000 0.0000000 0.0000000
##  [2,] 0.3333333 0.3333333 0.3333333 0.0000000 0.0000000 0.0000000
##  [3,] 0.0000000 0.3333333 0.3333333 0.3333333 0.0000000 0.0000000
##  [4,] 0.0000000 0.0000000 0.3333333 0.3333333 0.3333333 0.0000000
##  [5,] 0.0000000 0.0000000 0.0000000 0.3333333 0.3333333 0.3333333
##  [6,] 0.0000000 0.0000000 0.0000000 0.0000000 0.3333333 0.3333333
##  [7,] 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.3333333
##  [8,] 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
##  [9,] 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
## [10,] 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
##            [,7]      [,8]      [,9]     [,10]
##  [1,] 0.0000000 0.0000000 0.0000000 0.0000000
##  [2,] 0.0000000 0.0000000 0.0000000 0.0000000
##  [3,] 0.0000000 0.0000000 0.0000000 0.0000000
##  [4,] 0.0000000 0.0000000 0.0000000 0.0000000
##  [5,] 0.0000000 0.0000000 0.0000000 0.0000000
##  [6,] 0.3333333 0.0000000 0.0000000 0.0000000
##  [7,] 0.3333333 0.3333333 0.0000000 0.0000000
##  [8,] 0.3333333 0.3333333 0.3333333 0.0000000
##  [9,] 0.0000000 0.3333333 0.3333333 0.3333333
## [10,] 0.0000000 0.0000000 0.5000000 0.5000000

A little example

eigen(w)$values

##  [1]  1.00000000  0.96261129  0.85490143  0.68968376  0.48651845
##  [6] -0.31012390  0.26920019 -0.23729622 -0.11137134  0.06254301

eigen(w)$vectors[,1]

##  [1] 0.3162278 0.3162278 0.3162278 0.3162278 0.3162278 0.3162278 0.3162278
##  [8] 0.3162278 0.3162278 0.3162278

A little example

Variance of the fitted values

\[\begin{eqnarray} \Var{\mathbf{\EstRegFunc}} & = & \Var{\mathbf{w}\mathbf{X}}\\ & = & \mathbf{w}\Var{\mathbf{X}}\mathbf{w}^T\\ & = & \mathbf{w}\Var{\mathbf{\TrueRegFunc} + \mathbf{\TrueNoise}}\mathbf{w}^T\\ & = & \mathbf{w}\Var{\mathbf{\TrueNoise}}\mathbf{w}^T \end{eqnarray}\]

IF \(\Var{\mathbf{\TrueNoise}} = \sigma^2 \mathbf{I}\), THEN \(\Var{\mathbf{\EstRegFunc}} = \sigma^2\mathbf{w}\mathbf{w}^T\)

How much do the fitted values respond to the data?

\[\begin{eqnarray} \sum_{i=1}^{n}{\Cov{\EstRegFunc_i, X_i}} & = & \sum_{i=1}^{n}{\Cov{\sum_{j=1}^{n}{w_{ij} X_j}, X_i}}\\ & = & \sum_{i=1}^{n}{\sum_{j=1}^{n}{w_{ij} \Cov{X_i, X_j}}}\\ & = & \sum_{i=1}^{n}{\sum_{j=1}^{n}{w_{ij} \Cov{\TrueNoise_i, \TrueNoise_j}}} \end{eqnarray}\]

IF \(\Var{\mathbf{\TrueNoise}} = \sigma^2 \mathbf{I}\), THEN this \(= \sigma^2\tr{\mathbf{w}} = \sigma^2 \text{(sum of eigenvalues)}\)

\(\tr{\mathbf{w}} =\) (effective) degrees of freedom

Data = trend + fluctuation

\(X(t) = \TrueRegFunc(t) + \TrueNoise(t)\)
\(\Rightarrow\) \(\TrueNoise(t) = X(t) - \TrueRegFunc(t)\)
\(\Rightarrow\) \(\EstNoise(t) \equiv X(t) - \EstRegFunc(t) =\) residuals

\[\begin{eqnarray} \mathbf{\EstNoise} & = & \mathbf{x} - \mathbf{\EstRegFunc}\\ & = & \mathbf{x} - \mathbf{w}\mathbf{x}\\ & = & (\mathbf{I} - \mathbf{w})\mathbf{x} \end{eqnarray}\]

Convince yourself: \(\mathbf{I}-\mathbf{w}\) has same eigenvectors as \(\mathbf{w}\), but eigenvalues \(1-\lambda\)

Expected residuals

\[\begin{eqnarray} \Expect{\mathbf{\EstNoise}} & = & \Expect{(\mathbf{I}-\mathbf{w})\mathbf{X}}\\ & = & (\mathbf{I}-\mathbf{w})\mathbf{\TrueRegFunc} \end{eqnarray}\]

Biased trend estimate \(\Leftrightarrow\) biased fluctuation estimate

Break for the in-class exercise

\(X(t) = \TrueRegFunc(t) + \TrueNoise(t)\), and \(\Var{\TrueNoise(t)} = \sigma^2\), \(\Cov{\TrueNoise(t_1), \TrueNoise(t_2)} = 0\)
Set \(\EstRegFunc(t) = \frac{1}{3}\sum_{s=t-1}^{t+1}{X(s)}\)
- Ignore the ends of the data where we don’t have neighbors on both sides
What is \(\Cov{\EstRegFunc(t), \EstRegFunc(t+1)}\)?
What is \(\Cov{\EstRegFunc(t), \EstRegFunc(t+2)}\)?
What is \(\Cov{\EstRegFunc(t), \EstRegFunc(t+3)}\)?
Why aren’t all of these 0?

Variance and covariance of the residuals

\[ \Var{\mathbf{\EstNoise}} = (\mathbf{I}-\mathbf{w}) \Var{\mathbf{\epsilon}} (\mathbf{I}-\mathbf{w})^T \]

IF \(\Var{\mathbf{\epsilon}} = \sigma^2 \mathbf{I}\), THEN this \(= \sigma^2 (\mathbf{I}-\mathbf{w})(\mathbf{I}-\mathbf{w})^T\)

NB: Correlations from off-diagonal entries in \(\mathbf{w}\)

Splines

\[ \EstRegFunc = \argmin_{m}{\frac{1}{n}\sum_{i=1}^{n}{(x_i - m(t_i))^2} + \lambda\int{(m^{\prime\prime}(t))^2 dt}} \]

This \(\lambda\) not an eigenvalue (sorry)
Fit the data points vs. over-all curvature
Minimization is over all functions
Solution is always a piecewise cubic polynomial, but continuous, with continuous 1st and 2nd derivatives
\(\lambda \rightarrow 0\) \(\Rightarrow\) Straight lines between data points
\(\lambda \rightarrow \infty\) \(\Rightarrow\) Global linear fit
\(\downarrow\) degrees of freedom as \(\uparrow \lambda\)

How do we pick \(\lambda\)?

Want trend to predict not-yet-seen stuff (interpolate, extrapolate, filter)
A good \(\lambda\) predicts new stuff well
Hold out part of the data and try to predict that from the rest

Leave-one-out cross-validation (LOOCV)

For each of the \(n\) data points:
- Fit using every data point except \(i\), get \(\EstRegFunc^{(-i)}\);
- Find \(\EstRegFunc^{(-i)}(t_i)\);
- Find \((x_i - \EstRegFunc^{(-i)}(t_i))^2\).
Average over all data points, \(n^{-1}\sum_{i=1}^{n}{(x_i - \EstRegFunc^{(-i)}(t_i))^2}\)
Low LOOCV \(\Leftrightarrow\) good ability to predict new data
This is what smooth.spline does automatically

Leave-one-out cross-validation (LOOCV)

Don’t have to re-fit linear smoothers \(n\) times

\[\begin{eqnarray} \EstRegFunc^{(-i)}(t_i) &= & \frac{({\mathbf{w} \mathbf{x})}_i - w_{ii} x_i}{1-w_{ii}}\\ x_i - \EstRegFunc^{(-i)}(t_i) & = & \frac{x_i - \EstRegFunc(t_i)}{1-w_{ii}}\\ LOOCV & = & \frac{1}{n}\sum_{i=1}^{n}{\left(\frac{x_i-\EstRegFunc(t_i)}{1-w_{ii}}\right)^2} \end{eqnarray}\]

Many variants

\(h\)-block CV: omit a buffer of radius \(h\) around the hold-out point from the training set
\(k\)- or \(v\)-fold CV: divide data into \(k\) equal-sized “folds”, try to predict each fold using the rest of the data
\(hv\)-block CV: \(v\)-fold with a buffer
etc., et.

The moral

Never care about how good the in-sample fit is (\(R^2\), \(R^2_{adj}\), etc.)
Always care about ability to predict new data

Summing up

If the trend is smooth, we can estimate it by smoothing
Every smoother is biased towards some patterns and against others
Properties of the fitted values come from the weights
Fluctuations are residuals after removing a trend
De-trending can create correlations
We decide how to smooth by cross-validation