Inference for Markov Chains

36-467/36-667

27 November 2018

\[ \newcommand{\Prob}[1]{\mathbb{P}\left( #1 \right)} \newcommand{\Expect}[1]{\mathbb{E}\left[ #1 \right]} \newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]} \newcommand{\transition}{\mathbf{q}} \newcommand{\loglike}{L_n} \newcommand{\TrueTransition}{\transition^*} \newcommand{\InitDist}{p_{\mathrm{init}}} \newcommand{InvDist}{p^*} \]

In previous episodes…

Inference for Markov Chains: Two Types of Data

Likelihood for Markov Chains

The Sufficient Statistics

The Maximum Likelihood Estimate

Parameterized Markov Chains

Either way, \(\transition\) is really \(\transition(\theta)\), with \(\theta\) the \(r\)-dimensional vector of parameters \[ \frac{\partial \loglike}{\partial \theta_u} = \sum_{ij}{\frac{\partial \loglike}{\partial \transition_{ij}}\frac{\partial \transition_{ij}}{\partial \theta_u}} \]

Inference for Parameterized Markov Chains

  1. The allowed transitions are the same for all \(\theta\)
    • (technical convenience)
  2. \(\transition_{ij}(\theta)\) has continuous \(\theta\)-derivatives up to order 3
    • (authorizes Taylor expansions to 2nd order)
    • (can sometimes get away with just 2nd partials)
  3. The matrix \(\partial \transition_{ij}/\partial \theta_u\) always has rank \(r\)
    • (no redundancy in the parameter space)
  4. The chain is ergodic without transients for all \(\theta\)
    • ( trajectories are representative samples)

Inference for Parameterized Markov Chains

  1. MLE \(\hat{\theta}\) exists
  2. \(\hat{\theta} \rightarrow \theta^*\) (consistency)
  3. Asymptotic normality: \[ \sqrt{n}\left(\hat{\theta} - \theta^*\right) \rightsquigarrow \mathcal{N}(0,I^{-1}(\theta^*)) \] with \[ I_{uv}(\theta) \equiv \sum_{ij}{\frac{p_i(\theta)}{\transition_{ij}(\theta)}\frac{\partial \transition_{ij}}{\partial \theta_u}\frac{\partial \transition_{ij}}{\partial \theta_v}} = -\sum_{ij}{p_i(\theta)\transition_{ij}(\theta)\frac{\partial^2 \log{\transition_{ij}(\theta)}}{\partial \theta_u \partial \theta_v}} \] (2nd equality is not obvious)

Error of the MLE

Error estimates based on \(I(\theta^*)\) are weird: if you knew \(\theta^*\), why would you be calculating errors?

  1. Plug-in: Use \(I(\hat{\theta})\)
  2. Use the observed information \[ J_{uv} = -\sum_{ij}{\frac{n_{ij}}{n}\frac{\partial^2 \log{\transition_{ij}(\hat{\theta})}}{\partial \theta_u \partial \theta_v}} \]
    • (Guttorp’s Eq. 2.207, but he’s missing the sum over state pairs)
  3. Use a parametric bootstrap
    • (Could use a block bootstrap if you think it’s stationary)

The Observed Information

\[ J_{uv} = -\frac{1}{n} \frac{\partial^2 \loglike(\hat{\theta})}{\partial \theta_u \partial \theta_v} \]

General Comment on Likelihood Inference for Markov Chains

Higher-order Markov Chains

Hypothesis Testing

Likelihood-ratio testing is simple, for nested hypotheses:

Hypothesis Testing (cont’d)

  1. Calculate log likelihood ratio \(\Lambda\) on real data, call this \(\lambda\)
  2. Simulate from \(\hat{\theta}_{\mathrm{small}}\), get fake data \(Y_1^n\)
  3. Estimate \(\tilde{\theta}_{\mathrm{small}}\), \(\tilde{\theta}_{\mathrm{big}}\) from \(Y_1^n\)
  4. Calculate \(\tilde{\Lambda}\) from \(\tilde{\theta}_{\mathrm{small}}\), \(\tilde{\theta}_{\mathrm{big}}\)
  5. Repeat (2)–(4) \(B\) times to get sample of \(\tilde{\Lambda}\)
  6. \(p\)-value = \(\#\left\{\tilde{\Lambda} \geq \lambda\right\}/B\)

n## Examples

Aggregates

Aggregates (Kalbfleisch and Lawless 1984)

Continuous-Valued Processes in Discrete Time

Nonparametric Density Estimation

Nonparametric Conditional Density Estimation

Example: Lynxes

data(lynx)
lynx.lagged <- data.frame(t0=head(lynx,-1), t1=tail(lynx,-1))
plot(t1~t0, data=lynx.lagged, xlab="X(t)", ylab="X(t+1)", main="Canadian Lynxes")

Examples: Lynxes

library(np)
lynx.trans <- npcdens(t1 ~ t0, data=lynx.lagged)
plot(lynx.trans, view="fixed", phi=45, theta=30,
     xlab="X(t)", ylab="X(t+1)", zlab="p(X(t+1)=y|X(t)=x)") # Default is a rotating visualization

Summary

Backup: Smoothing the MLE for Large State Spaces

Backup: Smoothing the MLE for Language Models

Backup: Markov Chains with Countably-Infinite State Spaces

Backup: Markov Chains in Continuous Time

Backup: Further reading

References

Billingsley, Patrick. 1961. Statistical Inference for Markov Processes. Chicago: University of Chicago Press.

Guttorp, Peter. 1995. Stochastic Modeling of Scientific Data. London: Chapman; Hall.

Hall, Peter, Jeff Racine, and Qi Li. 2004. “Cross-Validation and the Estimation of Conditional Probability Densities.” Journal of the American Statistical Association 99:1015–26. http://www.ssc.wisc.edu/~bhansen/workshop/QiLi.pdf.

Kalbfleisch, J. D., and J. F. Lawless. 1984. “Least-Squares Estimation of Transition Probabilities from Aggregate Data.” The Canadian Journal of Statistics 12:169–82. http://www.jstor.org/stable/3314745.

Pereira, Fernando, Yoram Singer, and Naftali Z. Tishby. 1995. “Beyond Word \(n\)-Grams.” In Proceedings of the Third Workshop on Very Large Corpora, edited by David Yarowsky and Kenneth Church, 95–106. Columbus, Ohio: Association for Computational Linguistics. http://arxiv.org/abs/cmp-lg/9607016.

White, Halbert. 1994. Estimation, Inference and Specification Analysis. Cambridge, England: Cambridge University Press.