\[ \newcommand{\Expect}[1]{\mathbb{E}\left[ #1 \right]} \newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]} \newcommand{\Cov}[1]{\mathrm{Cov}\left[ #1 \right]} \]

The data set for this exam is [http://www.stat.cmu.edu/~cshalizi/dst/18/exams/2/sial.csv]. Note that compared to the version from homework 7, this eliminates several countries which had no slave exports, and adds two new rows, one of which is the “final” population in 1890, the other of which is the relative error in the slave export figures (so a value of 0.14 would indicate a relative error of \(\pm 14\)%).

In the interest of time, we consider a drastically simplified model where \(S_i(t)\) is the logarithm of the total population of territory \(i\) in decade \(t\) (so \(S_i(t) = \log{N_i(t)}\)), and similarly \(R_i(t)\) is the logarithm of the number of slaves exported.

We suppose that \[ S_i(t+1) = g + S_i(t) + \eta_i(t+1) \] and \[ R_i(t) = c + S_i(t) + \phi_i(t) \] But what we observe is \[ X_i(t) = R_i(t) + \epsilon_i(t) \] where \(\eta_i\), \(\phi_i\) and \(\epsilon_i\) are all IID mean-zero variables. Assume that \(\eta_i\) and \(\phi_i\) have the same variance for all \(i\), but that \(\epsilon_i\) can have a (potentially) different variance for each \(i\).

  1. Load the data, and make lsial a data frame containing the logged slave export figures, and the logged final (1890) population, but the unlogged margin of error.

  2. (5 points) Explain how the equation for \(S_i(t+1)\) in terms of \(S_i(t)\) represents a population which tends to grow exponentially over time. What is the growth rate, in terms of the parameters?

  3. (5 points) Explain how the equation for \(R_i(t)\) in terms of \(S_i(t)\) represents a situation where the trend is for a fixed proportion of the total population to be exported as slaves. What is that proportion, in terms of the parameters?

  4. (5 points) Explain how \(\sigma_\epsilon\) is related to the margin-of-error variable in the data set. Hint: For \(|a| \ll 1\), \(\log{b(1+a)} \approx \log{b} + a\).

  5. We will first use linear smoothing to estimate \(S_i(t)\) from the observations \(X_i(t)\).
    1. (4 points) Show that \(\Cov{S_i(t), X_i(t)} = \Var{S_i(t)}\).
    2. (1 point) Show that \(\Expect{X_i(t)} = \Expect{S_i(t)} + c\)
    3. (5 points) Show that the optimal linear estimator of \(S_i(t)\) from \(X_i(t)\) is \(\Expect{S_i(t)} + \frac{\Var{S_i(t)}}{\Var{S_i(t)} + \sigma^2_{\phi} + \sigma^2_{\epsilon}}\left(X_i(t) - \Expect{S_i(t)} - c\right)\)
  6. Now consider a more general linear estimation of \(S_i(t)\), for arbitrary \(t\), which uses all the \(X_i(1), \ldots X_i(n-1)\) and a final \(S_i(n)\), for arbitrary \(t\). Assume \(h > 0\).
    1. (1 point) Show that \(\Cov{S_i(t), S_i(t+h)} = \Var{S_i(t)}\)
    2. (2 points) Show that \(\Cov{S_i(t), X_i(t+h)} = \Var{S_i(t)}\)
    3. (2 points) Show that \(\Cov{X_i(t), X_i(t+h)} = \Var{S_i(t)}\)
    4. (2 points) Show that \(\Var{S_i(t)} = \Var{S_i(0)} + t \sigma^2_{\eta}\)
    5. (1 point) Show that \(\Var{X_i(t)} = \Var{S_i(0)} + t \sigma^2_{\eta} + \sigma^2_{\phi} + \sigma^2_{\epsilon}\)
    6. (1 point) Show that \(\Expect{S_i(t)} = \Expect{S_i(0)} + gt\)
    7. (1 point) Show that \(\Expect{X_i(t)} = \Expect{S_i(0)} + gt + c\)
  7. (10 points) Explain what the following code does. Explain both how the code works, and what its purpose is.

smoother <- function(t, obs, vs0, sig.eta, sig.phi, sig.eps, g, c) {
    n <- length(obs)
    x <- head(obs, -1)
    sfinal <- tail(obs, 1)
    vs <- vs0 + (1:n)*sig.eta^2
    vx <- vs + sig.phi^2 + sig.eps^2
    vobs <- outer(1:n, 1:n, function(x,y) { vs[pmin(x,y)] })
    diag(vobs) <- vx
    diag(vobs)[n] <- vs[n]
    C <- matrix(vs[pmin(t,1:n)], nrow=n, ncol=1)
    beta <- solve(vobs) %*% C
    es <- t*g
    ex <- (1:n)*g + c
    eobs <- ex
    eobs[n] <- n*g
    alpha <- es - eobs %*% beta
    return(alpha + obs %*% beta)
}

simultaneous.smoother <- function(obs, vs0, sig.eta, sig.phi, sig.eps, g, c) {
    n <- length(obs)
    sapply(1:n, smoother, obs=obs, vs0=vs0, sig.eta=sig.eta, sig.phi=sig.phi,
           sig.eps=sig.eps, g=g, c=c)
}    
  1. (10 points) In the code above, obs and sig.eps can be obtained directly from the data files. Suppose the average population growth rate is \(0.2\)% per year, that on average \(1\%\) of the population is captured and exported, that \(\Var{S(0)} = 0\), that \(\sigma_{\eta} = 0.05\) and \(\sigma_{\phi}=0.15\). Report estimates for \(S_i(1), \ldots S_i(n)\), where \(i=\) Akan. Plot \(e^{S_i(t)}\) against \(t\). Does the result look sensible? What happens if you change \(c\)?

  2. (5 points) The following code estimates \(\Var{S_i(0)}\) and \(\sigma_{\eta}\). What does it need to assume about \(\Var{S_i(0)}\) for this to be a reasonable estimate (beyond the correctness of the model)?

vst <- apply(lsial[1:24,], 1, function(x) {var(as.numeric(x) - as.numeric(lsial["exportErrorMargin",])) })
decades <- seq(from=1650, to=1880, by=10)
vst.vs.t <- lm(vst ~ decades)
vs0 <- coefficients(vst.vs.t)[1]
sig.eta <- sqrt(coefficients(vst.vs.t)[2])
  1. (10 points) Make a scatter-plot to go with the regression from the previous problem. Explain why the linear regression may not be a very good model for the whole data, but might work better for the earlier part of the data. Re-estimate \(\Var{S_i(0)}\) and \(\sigma_{\eta}\) accordingly.

  2. (10 points) Plot estimates for population (not logged population) of Southern Dahomey, Sierra Leone, Akan, and Western Nigeria.

  3. (10 points) Consider the following simulator code.

internal.jitter <- function(data) {
    jitter <- function(x) { rnorm(n=length(x)-2, mean=head(x,-2),
                                  sd=x["exportErrorMargin"]) }
    data[1:(nrow(data)-2),] <- apply(data, 2, jitter)
    return(data)
}

What part of the data-generating model does this implement? If you used it to generate new data sets and repeated the estimation of the population on them, would this give you a good idea of the uncertainty in the estimates, or would it over- or under- estimate the uncertainty in some way?

Rubric (10 points)

The text is laid out cleanly, with clear divisions between problems and sub-problems. The writing itself is well-organized, free of grammatical and other mechanical errors, and easy to follow. Questions which ask for a plot or table are answered with both the figure itself and the command (or commands) use to make the plot. Plots are carefully labeled, with informative and legible titles, axis labels, and (if called for) sub-titles and legends; they are placed near the text of the corresponding problem. All quantitative and mathematical claims are supported by appropriate derivations, included in the text, or calculations in code. Numerical results are reported to appropriate precision. Code is properly integrated with a tool like R Markdown or knitr, and both the knitted file and the source file are submitted. The code is indented, commented, and uses meaningful names. All code is relevant, without dangling or useless commands. All parts of all problems are answered with coherent sentences, and raw computer code or output are only shown when explicitly asked for.