Inference for Markov Chains

36-467/667, Fall 2020

5 November 2020 (Lecture 19)

\[ \newcommand{\Prob}[1]{\mathbb{P}\left( #1 \right)} \newcommand{\Expect}[1]{\mathbb{E}\left[ #1 \right]} \newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]} \newcommand{\transition}{\mathbf{q}} \newcommand{\loglike}{L_n} \newcommand{\TrueTransition}{\transition^*} \newcommand{\InitDist}{p_{\mathrm{init}}} \newcommand{InvDist}{p^*} \]

In previous episodes…

Inference for Markov Chains: Two types of data

Likelihood for Markov Chains

The Sufficient Statistics

(sum over time vs. sum over state pairs)

The Maximum Likelihood Estimate

Parameterized Markov Chains

Inference for Parameterized Markov Chains

  1. The allowed transitions are the same for all \(\theta\)
    • (technical convenience)
  2. \(\transition_{ij}(\theta)\) has continuous \(\theta\)-derivatives up to order 3
    • (authorizes Taylor expansions to 2nd order)
    • (can sometimes get away with just 2nd partials)
  3. The matrix \(\partial \transition_{ij}/\partial \theta_u\) always has rank \(r\)
    • (no redundancy in the parameter space)
  4. The chain is ergodic without transients for all \(\theta\)
    • ( trajectories are representative samples)

Inference for Parameterized Markov Chains

  1. MLE \(\hat{\theta}\) exists
  2. \(\hat{\theta} \rightarrow \theta^*\) (consistency)
  3. Asymptotic normality: \[ \sqrt{n}\left(\hat{\theta} - \theta^*\right) \rightsquigarrow \mathcal{N}(0,I^{-1}(\theta^*)) \] with expected or Fisher information matrix \[ I_{uv}(\theta) \equiv \sum_{ij}{\frac{p_i(\theta)}{\transition_{ij}(\theta)}\frac{\partial \transition_{ij}}{\partial \theta_u}\frac{\partial \transition_{ij}}{\partial \theta_v}} = -\sum_{ij}{p_i(\theta)\transition_{ij}(\theta)\frac{\partial^2 \log{\transition_{ij}(\theta)}}{\partial \theta_u \partial \theta_v}} \] (equality is not obvious)

Error of the MLE

Error estimates based on \(I(\theta^*)\) are weird: if you knew \(\theta^*\), why would you be calculating errors?

  1. Plug-in: Use \(I(\hat{\theta})\)
  2. Use the observed informatio matrixn \[ J_{uv} = -\sum_{ij}{\frac{N_{ij}}{n}\frac{\partial^2 \log{\transition_{ij}(\hat{\theta})}}{\partial \theta_u \partial \theta_v}} \]
    • (Guttorp’s Eq. 2.207, but he’s missing the sum over state pairs)
  3. Use a model-based bootstrap: simulate the fitted chain, re-estimate, repeat
    • (Could use a block bootstrap to resample blocks if you think it’s stationary)

The Observed Information

\[ J_{uv} = -\frac{1}{n} \frac{\partial^2 \loglike(\hat{\theta})}{\partial \theta_u \partial \theta_v} \]

General Comment on Likelihood Inference for Markov Chains

Higher-order Markov Chains

First- vs. Higher- order Markov Chains

Hypothesis Testing

Likelihood-ratio testing is simple, for nested hypotheses:

Hypothesis Testing (cont’d)

  1. Calculate log likelihood ratio \(\Lambda\) on real data, call this \(\lambda\)
  2. Simulate from \(\hat{\theta}_{\mathrm{small}}\), get fake data \(Y(1), \ldots Y(n)\)
  3. Estimate \(\tilde{\theta}_{\mathrm{small}}\), \(\tilde{\theta}_{\mathrm{big}}\) from \(Y(1), \ldots Y(n)\)
  4. Calculate \(\tilde{\Lambda}\) from \(\tilde{\theta}_{\mathrm{small}}\), \(\tilde{\theta}_{\mathrm{big}}\)
  5. Repeat (2)–(4) \(b\) times to get sample of \(\tilde{\Lambda}\)
  6. \(p\)-value = \(\#\left\{\tilde{\Lambda} \geq \lambda\right\}/b\)
    • Some people add 1 to numerator and denominator to avoid ever reporting 0

Calculating Degrees of Freedom

Aggregates (Kalbfleisch and Lawless 1984)

Continuous-Valued Processes in Discrete Time

Nonparametric Density Estimation

Nonparametric Conditional Density Estimation

Example: Lynxes

data(lynx)
lynx.lagged <- data.frame(t0=head(lynx,-1), t1=tail(lynx,-1))
plot(t1~t0, data=lynx.lagged, xlab="X(t)", ylab="X(t+1)", main="Canadian Lynxes")

Examples: Lynxes

library(np)
lynx.trans <- npcdens(t1 ~ t0, data=lynx.lagged)
plot(lynx.trans, view="fixed", phi=45, theta=30,
     xlab="X(t)", ylab="X(t+1)", zlab="p(X(t+1)=y|X(t)=x)")

Summary

Backup: Smoothing the MLE for Large State Spaces

Backup: Smoothing the MLE for Language Models

Backup: Markov Chains with Countably-Infinite State Spaces

Backup: Markov Chains in Continuous Time

Backup: Further reading

References

Billingsley, Patrick. 1961. Statistical Inference for Markov Processes. Chicago: University of Chicago Press.

Guttorp, Peter. 1995. Stochastic Modeling of Scientific Data. London: Chapman; Hall.

Hall, Peter, Jeff Racine, and Qi Li. 2004. “Cross-Validation and the Estimation of Conditional Probability Densities.” Journal of the American Statistical Association 99:1015–26. http://www.ssc.wisc.edu/~bhansen/workshop/QiLi.pdf.

Kalbfleisch, J. D., and J. F. Lawless. 1984. “Least-Squares Estimation of Transition Probabilities from Aggregate Data.” The Canadian Journal of Statistics 12:169–82. http://www.jstor.org/stable/3314745.

Pereira, Fernando, Yoram Singer, and Naftali Z. Tishby. 1995. “Beyond Word \(n\)-Grams.” In Proceedings of the Third Workshop on Very Large Corpora, edited by David Yarowsky and Kenneth Church, 95–106. Columbus, Ohio: Association for Computational Linguistics. http://arxiv.org/abs/cmp-lg/9607016.

White, Halbert. 1994. Estimation, Inference and Specification Analysis. Cambridge, England: Cambridge University Press.