Linear Prediction for Time Series
36-467/36-667
20 September 2018
\[
\newcommand{\Expect}[1]{\mathbb{E}\left[ #1 \right]}
\newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]}
\newcommand{\SampleVar}[1]{\widehat{\mathrm{Var}}\left[ #1 \right]}
\newcommand{\Cov}[1]{\mathrm{Cov}\left[ #1 \right]}
\newcommand{\TrueRegFunc}{\mu}
\newcommand{\EstRegFunc}{\widehat{\TrueRegFunc}}
\DeclareMathOperator{\tr}{tr}
\DeclareMathOperator*{\argmin}{argmin}
\DeclareMathOperator{\det}{det}
\newcommand{\TrueNoise}{\epsilon}
\newcommand{\EstNoise}{\widehat{\TrueNoise}}
\]
In our last episode
- General approach to optimal linear prediction
- Predict \(Y\) from \(\vec{Z} = [Z_1, Z_2, \ldots Z_p]\)
- Prediction is \(\alpha + \vec{\beta} \cdot \vec{Z}\)
- Best \(\alpha = \Expect{Y} - \vec{\beta} \cdot \Expect{\vec{Z}}\)
- Best \(\beta = \Var{\vec{Z}}^{-1} \Cov{\vec{Z}, Y}\)
- Today: application to time series
- What can this do for us?
- How do we find the covariances?
Optimal linear prediction for time series
- Given: \(X(t_1), X(t_2), \ldots X(t_n)\)
- Desired: prediction of \(X(t_0)\)
\[\begin{eqnarray}
\EstRegFunc(t_0) & = & \alpha + \vec{\beta} \cdot \left[\begin{array}{c} X(t_1) \\ X(t_2) \\ \vdots \\ X(t_n) \end{array}\right]\\
\alpha & = & \Expect{X(t_0)} - \vec{\beta} \cdot \left[\begin{array}{c} \Expect{X(t_1)}\\ \Expect{X(t_2)} \\ \vdots \\ \Expect{X(t_n)}\end{array}\right] ~ \text{(goes away if everything's centered)}\\
\vec{\beta} & = & {\left[\begin{array}{cccc} \Var{X(t_1)} & \Cov{X(t_1), X(t_2)} & \ldots & \Cov{X(t_1), X(t_n)}\\
\Cov{X(t_1), X(t_2)} & \Var{X(t_2)} & \ldots & \Cov{X(t_2), X(t_n)}\\
\vdots & \vdots & \ldots & \vdots\\
\Cov{X(t_1), X(t_n)} & \Cov{X(t_2), X(t_n)} & \ldots & \Var{X(t_n)}\end{array}\right]}^{-1} \left[\begin{array}{c} \Cov{X(t_0), X(t_1)}\\
\Cov{X(t_0), X(t_2)}\\ \vdots \\ \Cov{X(t_0), X(t_n)}\end{array}\right]
\end{eqnarray}\]
Interpolation
- Time series often have gaps
- Instruments fail, people mess up, circumstances…
- What happened at the times in between the observations?
Back to Kyoto
- A lot of this is just made up
What we didn’t tell R to make up
When did the cherries flower in 1015?
- We need \(\Cov{X(1015), X(t_i)}\) for every year \(t_i\) where we do have data
- We need \(\Expect{X(t_i)}\), \(\Var{X(t_i)}\) and \(\Cov{X(t_i), X(t_j)}\) ditto
- We need \(\Expect{X(1015)}\)
Getting the expectations and covariances
- We only see each \(X(t_i)\) once
- Maybe gives us an idea of \(\Expect{X(t_i)}\)
- but not \(\Var{X(t_i)}\) or \(\Cov{X(t_i), X(t_j)}\)
- let alone \(\Cov{X(t_0), X(t_i)}\)
- We could repeat the experiment many times
- We could make assumptions
Repeating the experiment
\[\begin{eqnarray}
\overline{x}(t) \equiv \frac{1}{n}\sum_{i=1}^{n}{x^{(i)}(t)} & \rightarrow & \Expect{X(t)}\\
\frac{1}{n}\sum_{i=1}^{n}{(x^{(i)}(t) - \overline{x}(t)) (x^{(i)}(s) - \overline{x}(s))} & \rightarrow & \Cov{X(t), X(s)}
\end{eqnarray}\]
Making assumptions
- Assume some covariances (and expectations) are equal
- Weaker: assume some of them are similar
Stationarity
- A time series is weakly stationary when \(\Expect{X(t)} = \Expect{X(s)}\) (constant in time) and \(\Cov{X(t), X(s)} = \gamma(|t-s|)\)
- \(\gamma(h)\) is the autocovariance function at lag \(h\)
- a.k.a. “second-order” or “wide-sense” stationarity
- Assumes all the expectations are the same
- Assumes \(\Cov{X(t), X(t+h)} = \Cov{X(s), X(s+h)}\) so lots of covariances are the same
- Pool information: \[\begin{eqnarray}
\overline{x} \equiv \frac{1}{n}\sum_{i=1}^{n}{X(t_i)} & \rightarrow & \Expect{X(0)}\\
\frac{1}{n}\sum_{i=1}^{n}{(x(t_i) - \overline{x})(x(t_i + h) - \overline{x})} & \rightarrow & \gamma(h)
\end{eqnarray}\]
The autocovariance function
- \(\gamma(0) = \Var{X(t)}\) (constant in \(t\))
- \(\rho(h) \equiv \frac{\gamma(h)}{\gamma(0)} =\) autocorrelation function
Assuming stationarity…
\[\begin{eqnarray}
\EstRegFunc(t_0) & = & \overline{x} + \left[\begin{array}{c} X(t_1) -\overline{x}\\
X(t_2) - \overline{x} \\ \vdots \\ X(t_n) -\overline{x}\end{array}\right]
{\left[\begin{array}{ccc} \gamma(0) & \gamma(|t_1 - t_2|) & \ldots & \gamma(|t_1 - t_n|)\\
\gamma(|t_2-t_1|) & \gamma(0) & \ldots & \gamma(|t_2-t_n|)\\
\vdots & \vdots & \ldots & \vdots\\
\gamma(|t_n-t_1|) & \gamma(|t_n-t_2|) & \ldots & \gamma(0)\end{array}\right]}^{-1}
\left[\begin{array}{c} \gamma(|t_0 - t_1|)\\ \gamma(|t_0-t_2|)\\ \vdots\\ \gamma(|t_0 -t_n|)\end{array}\right]\\
& = & \overline{x} + \left[\begin{array}{c} X(t_1) -\overline{x}\\
X(t_2) - \overline{x} \\ \vdots \\ X(t_n) -\overline{x}\end{array}\right] \mathbf{v}^{-1} \mathbf{c}\\
\Var{X(t_0) - \EstRegFunc(t_0)} & = & \gamma(0) - \mathbf{c}^T \mathbf{v}^{-1} \mathbf{c}
\end{eqnarray}\]
In-class exercise
\[\begin{eqnarray}
\EstRegFunc(t_0) & = & \overline{x} + \left[\begin{array}{c} X(t_1) -\overline{x}\\
X(t_2) - \overline{x} \\ \vdots \\ X(t_n) -\overline{x}\end{array}\right]
{\left[\begin{array}{ccc} \gamma(0) & \gamma(|t_1 - t_2|) & \ldots & \gamma(|t_1 - t_n|)\\
\gamma(|t_2-t_1|) & \gamma(0) & \ldots & \gamma(|t_2-t_n|)\\
\vdots & \vdots & \ldots & \vdots\\
\gamma(|t_n-t_1|) & \gamma(|t_n-t_2|) & \ldots & \gamma(0)\end{array}\right]}^{-1}
\left[\begin{array}{c} \gamma(|t_0 - t_1|)\\ \gamma(|t_0-t_2|)\\ \vdots\\ \gamma(|t_0 -t_n|)\end{array}\right]\\
\gamma(h) & = & \gamma(0) \rho(h)
\end{eqnarray}\]
Re-write the predictor in terms of \(\rho\) — do we need \(\gamma(0)\)?
In-class exercise — solution
\[\begin{eqnarray}
\EstRegFunc(t_0) & = & \overline{x} + \left[\begin{array}{c} X(t_1) -\overline{x}\\
X(t_2) - \overline{x} \\ \vdots \\ X(t_n) -\overline{x}\end{array}\right]
{\left[\begin{array}{ccc} \gamma(0) & \gamma(|t_1 - t_2|) & \ldots & \gamma(|t_1 - t_n|)\\
\gamma(|t_2-t_1|) & \gamma(0) & \ldots & \gamma(|t_2-t_n|)\\
\vdots & \vdots & \ldots & \vdots\\
\gamma(|t_n-t_1|) & \gamma(|t_n-t_2|) & \ldots & \gamma(0)\end{array}\right]}^{-1}
\left[\begin{array}{c} \gamma(|t_0 - t_1|)\\ \gamma(|t_0-t_2|)\\ \vdots\\ \gamma(|t_0 -t_n|)\end{array}\right]\\
\gamma(h) & = & \gamma(0) \rho(h)\\
\EstRegFunc(t_0) & = & \overline{x} + \left[\begin{array}{c} X(t_1) -\overline{x}\\
X(t_2) - \overline{x} \\ \vdots \\ X(t_n) -\overline{x}\end{array}\right]
{\left[\begin{array}{ccc} 1 & \rho(|t_1 - t_2|) & \ldots & \rho(|t_1 - t_n|)\\
\rho(|t_2-t_1|) & 1 & \ldots & \rho(|t_2-t_n|)\\
\vdots & \vdots & \ldots & \vdots\\
\rho(|t_n-t_1|) & \rho(|t_n-t_2|) & \ldots & 1\end{array}\right]}^{-1} \frac{1}{\gamma(0)} \gamma(0)
\left[\begin{array}{c} \rho(|t_0 - t_1|)\\ \rho(|t_0-t_2|)\\ \vdots\\ \rho(|t_0 -t_n|)\end{array}\right]\\
\end{eqnarray}\]
In R
acf(x, lag.max, type, plot, na.action, ...)
x
= time series or data frame of multiple time series
lag.max
= maximum value of lag \(h\)
type
= correlation (default) or covariance?
plot
= make a plot? (default)
na.action
= how to handle NAs? default is give up, na.pass
will use complete pairs
In R
kyoto.acf <- acf(kyoto$Flowering.DOY, lag.max = 100, type = "covariance", na.action = na.pass)
In R
kyoto.acf[0:5]
##
## Autocovariances of series 'kyoto$Flowering.DOY', by lag
##
## 0 1 2 3 4 5
## 41.00 7.96 7.47 7.74 5.96 6.07
kyoto.acf[0:5]$acf
## , , 1
##
## [,1]
## [1,] 41.000475
## [2,] 7.963781
## [3,] 7.470460
## [4,] 7.744336
## [5,] 5.960905
## [6,] 6.070874
When was the flowering in 1015?
available.years <- with(na.omit(kyoto), Year.AD[Year.AD > 1015 - 49 & Year.AD <
1015 + 49])
historical.mean <- mean(kyoto$Flowering.DOY, na.rm = TRUE)
CovYZ <- matrix(kyoto.acf[abs(available.years - 1015)]$acf, ncol = 1)
year.lags <- outer(available.years, available.years, "-")
year.lags <- abs(year.lags)
VarZ <- kyoto.acf[year.lags]$acf
VarZ <- matrix(VarZ, ncol = length(available.years), byrow = FALSE)
Z <- with(kyoto, Flowering.DOY[Year.AD %in% available.years])
(fitted.value <- historical.mean + (Z - historical.mean) %*% solve(VarZ) %*%
CovYZ)
## [,1]
## [1,] 106.1969
(fitted.value.se <- sqrt(as.matrix(kyoto.acf[0]$acf) - t(CovYZ) %*% solve(VarZ) %*%
CovYZ))
## [,1]
## [1,] 5.886278
When was the flowering in 1015?
This is a lot of work…
kyoto.prediction <- function(times) {
historical.mean <- mean(kyoto$Flowering.DOY, na.rm = TRUE)
historical.variance <- as.matrix(kyoto.acf[0]$acf)
fits <- matrix(0, nrow = length(times), ncol = 2)
colnames(fits) <- c("fit", "se")
rownames(fits) <- paste(times)
for (t in times) {
available.years <- with(na.omit(kyoto), Year.AD[Year.AD > t - 49 & Year.AD <
t + 49])
available.years <- setdiff(available.years, t)
CovYZ <- matrix(kyoto.acf[abs(available.years - t)]$acf, ncol = 1)
year.lags <- outer(available.years, available.years, "-")
year.lags <- abs(year.lags)
VarZ <- kyoto.acf[year.lags]$acf
VarZ <- matrix(VarZ, ncol = length(available.years), byrow = FALSE)
Z <- with(kyoto, Flowering.DOY[Year.AD %in% available.years])
fits[paste(t), "fit"] <- historical.mean + (Z - historical.mean) %*%
solve(VarZ) %*% CovYZ
fits[paste(t), "se"] <- sqrt(max(0, historical.variance - t(CovYZ) %*%
solve(VarZ) %*% CovYZ))
}
return(data.frame(fits))
}
Finally, some interpolation
plot(Flowering.DOY ~ Year.AD, data = kyoto, type = "p", pch = 16, cex = 0.5,
ylab = "Day in year of full flowering", xlab = "Year (AD)", xlim = c(1000,
1050))
all.fits <- kyoto.prediction(801:2015)
lines(801:2015, all.fits$fit, col = "blue")
lines(801:2015, all.fits$fit - all.fits$se, col = "blue", lty = "dashed")
lines(801:2015, all.fits$fit + all.fits$se, col = "blue", lty = "dashed")
Finally, some interpolation
Stationarity and trend-plus-fluctuation
- We can always say \(X(t) = \TrueRegFunc(t) + \TrueNoise(t)\), with \(\Expect{X(t)} =\TrueRegFunc(t)\) and \(\Expect{\TrueNoise(t)} = 0\)
- If \(X\) is (weakly) stationary, then
- \(\TrueRegFunc(t) = \mu(0)\), a constant
- \(\Cov{\TrueNoise(t), \TrueNoise(s)} = \Cov{X(t), X(s)} = \gamma(|t-s|)\)
- Assuming a flat trend is pretty strong…
Stationary fluctuations around a non-stationary trend
- If \(\TrueNoise\) is weakly stationary, then \(X - \TrueRegFunc\) is weakly stationary
- so \(X - \EstRegFunc\) should be (approximately) weakly stationary
- Recipe:
- Use smoothing to estimate \(\Expect{X(t)}\) as \(\EstRegFunc(t)\)
- Find ACF from \(\EstNoise(t) = X(t) - \EstRegFunc(t)\)
- Now get the coefficients and predict
Detrending the cherry blossoms with a spline
kyoto.spline <- with(na.omit(kyoto), smooth.spline(x = Year.AD, y = Flowering.DOY))
residuals.with.NAs <- rep(NA, times = nrow(kyoto))
residuals.with.NAs[!is.na(kyoto$Flowering.DOY)] <- na.omit(kyoto$Flowering.DOY) -
kyoto.spline$y
kyoto$fluctuation <- residuals.with.NAs
Detrending the cherry blossoms with a spline
Detrending the cherry blossoms with a spline
Detrending the cherry blossoms with a spline
Covariance estimation
acf
uses the sample covariance
- Estimates covariance at each lag separately
- Sometimes inefficient
- Covariances might have known form, e.g., \(\gamma(h) = \gamma(0)e^{-h/\tau}\)
- Fancier tricks (when we do Fourier methods)
Checking stationarity
- Divide the data into intervals
- Re-estimate ACF on each interval
- Shouldn’t change much
- How much is too much? (Will answer later with simulation)
- Some inherent pitfalls:
- Very slight non-stationarity looks pretty stationary
- Very slowly decaying correlations look pretty non-stationary
Multiple time series
- \(X\) and \(U\) are jointly weakly stationary when
- \(X\) is weakly stationary, ACF \(\gamma_X\)
- \(U\) is weakly stationary, ACF \(\gamma_U\)
- \(\Cov{X(t), U(s)} = \gamma_{UX}(|t-s|)\)
ccf
in R will calculate the cross-covariance (or cross-correlation) function
- So will
acf
if it’s given a matrix or data frame
- Plug and chug: if predicting \(X\) from \(U\)
- \(\gamma_U\) goes into the variance matrix \(\mathbf{v}\)
- \(\gamma_{UX}\) goes into the covariance matrix \(\mathbf{c}\)
Where did all this come from?
- Norbert Wiener (1894–1964)
- One of the great mathematicians of the 20th century
- Worked very closely with engineers
- Why we talk about “feedback”, “information”, “cyber-”, etc.
- (Photo from Kline (2017))
Wiener’s research in 1942
- Anti-aircraft fire control
- Aim at where the target will be
- The target moves erratically
- \(\Rightarrow\) Need to extrapolate a random process
- Wiener’s solution: basically what we’ve just done
- Plus the continuous-time version
- Plus implementation in 1940s electronics
- Declassified after the war as Wiener (1949)
- Related work by Kolmogorov (1941)
- Wiener and Kolmogorov were the first to realize optimal linear prediction didn’t need the usual regression assumptions (Gaussian noise, independent variables, etc.)
Summing up
- Optimal linear prediction needs to know the trend and the autocovariance function
- Estimating them from one time series needs assumptions
- Remove trend
- Find ACF from residuals / estimated fluctuations
- Once we have the ACF, finding coefficients is just linear algebra
Backup: Not inverting the variance matrix
- \(\vec{\beta} = \Var{\vec{Z}}^{-1} \Cov{\vec{Z}, Y}\)
- But inverting an \(n\times n\) matrix takes \(O(n^3)\) operations
- Solving the equation \(\Var{\vec{Z}} \vec{\beta} = \Cov{\vec{Z}, Y}\) for fixed right-hand side can be done in \(O(n^2)\) operations
- Details left to numerical linear algebra texts
Backup: How good is the prediction standard error?
all.fits$std.residuals <- (kyoto$Flowering.DOY - all.fits$fit)/all.fits$se
mean(all.fits$std.residuals, na.rm = TRUE)
## [1] -0.008311344
sd(all.fits$std.residuals, na.rm = TRUE)
## [1] 1.00777
Ideally: mean 0, standard deviation 1
Backup: How good is the prediction standard error?
References
Kline, Ronald R. 2017. The Cybernetics Moment: Or Why We Call Our Age the Information Age. Baltimore, Maryland: Johns Hopkins University Press.
Wiener, Norbert. 1949. Extrapolation, Interpolation, and Smoothing of Stationary Time Series: With Engineering Applications. Cambridge, Massachusetts: The Technology Press of the Massachusetts Institute of Technology.