Previously

Markov processes are ones where conditioning on the whole past is equivalent to conditioning on the current observation: \[ \Prob{X(t+1)|X(1:t)} = \Prob{X(t+1)|X(t)} \]
This makes \(X(t)\) the state of the process at time \(t\), the variable which fixes the distribution of future events
The Markov property makes it easy to:
- Do probability calculations
- Do prediction
- Do inference
The Markov property commits us to \(X(t+1) \perp X(1:t-1) | X(t)\)

Markov Property vs. Measurement Noise

What if we don’t observe the state perfectly?
\(S(t)\) is a Markov process
We only observe \(X(t) = S(t) + \epsilon(t)\), where \(\epsilon(t)\) is IID noise
Will \(X(t)\) also be Markov?

A Concrete Example

\(S(t)\) is a Markov chain with states \(-1\) and \(+1\)
\(\Prob{S(t+1)=+1|S(t)=+1} = \Prob{S(t+1)=-1|S(t)=-1} \equiv q > 0.5\)
We observe not \(S(t)\) but \(X(t) = S(t) + \epsilon(t)\)
Noise \(\epsilon(t) \sim \mathcal{N}(0,1)\) and is IID
Is this \(X(t)\) a Markov process?

The Current State “Screens Off” the Past from the Future

\(\Prob{X(t+1)=x(t+1)|S(t)=s, X(1:t)=x(1:t)}\) \[\begin{eqnarray} & = & \sum_{r}{\Prob{X(t+1)=x(t+1)|S(t+1)=r, S(t)=s, X(1:t)=x(1:t)}\Prob{S(t+1)=r|S(t)=s,X(1:t)=x(1:t)}}\\ & = & \sum_{r}{\Prob{X(t+1)|S(t+1)=r}\Prob{S(t+1)=r|S(t)=s}} \end{eqnarray}\]

Relating \(X(t)\) to Its Past through the State

\(\Prob{X(t+1)=x(t+1)|X(1:t)=x(1:t)}\) \[\begin{eqnarray} & = & \sum_{s}{\Prob{X(t+1)=x(t+1)|S(t)=s,X(1:t)=x(1:t)}\Prob{S(t)=s|X(1:t)=x(1:t)}}\\ & = & \sum_{s}{\left(\sum_{r}{\Prob{X(t+1)|S(t+1)=r}\Prob{S(t+1)=r|S(t)=s}}\right) \Prob{S(t)=s|X(1:t)=x(1:t)}} \end{eqnarray}\]

\(\Rightarrow\) Given \(X(t)\), the only way \(X(t-1)\) can matter for \(X(t+1)\) is by changing the conditional distribution of \(S(t)\)

Relating \(X(t)\) to \(S(t)\)

Any value of \(X(t)\) could have come from either value of \(S(t)\)
\(X(t) \gg 0\) is unlikely if \(S(t)=-1\) but not impossible
\(X(t) \approx 0\) is about equally like for either value of \(S(t)\)

History Matters for \(X(t)\)

If \(X(t) = 3.8\) then it’s really likely that \(S(t)=+1\) and conditioning on earlier \(X\)s can’t really change that
Earlier values of \(X\) give information about \(S(t)\) when \(X(t)\) isn’t too extreme, and so they give information about \(X(t+1)\)
In a Markov process, \(X(t-1) \perp X(t+1)|X(t)\)
That isn’t true here
\(\therefore\) the observed process isn’t Markov

Hidden Markov Models / State-Space Models

Hidden or latent (= unobserved) state \(S(t)\), which follows a Markov process
Transition probabilities: \[ q(s,r) \equiv \Prob{S(t+1)=r|S(t)=s} \]
- Need a distribution for initial state as well
- A.k.a. transition model, process model, state model, plant model
Observable or manifest \(X(t)\) is a noisy function of the current state: \[ X(t) = f(S(t), \epsilon(t)) \] for some function \(f\) and IID noise \(\epsilon\)
- Function \(f\) needn’t be additive and might be very nonlinear
Leads to observation probabilities or observation model \[ g(s,x) \equiv \Prob{X(t)=x|S(t)=s} \]

(for continuous variables, replace PMFs by pdfs as needed)

Some Basic Calculations for HMMs

Probability of a state sequence from Markov property: \[\begin{eqnarray} \Prob{S(1:n) = s(1:n)} & = & \Prob{S(1)=s(1)} \prod_{t=2}^{n}{S(t)=s(t)|S(1:t-1)=s(1:t-1)}\\ & = & \Prob{S(1)=s(1)} \prod_{t=2}^{n}{S(t)=s(t)|S(t-1)=s(t-1)}\\ & = & \Prob{S(1)=s(1)} \prod_{t=2}^{n}{q(s(t-1), s(t))} \end{eqnarray}\]
Probability of an observed sequence, given a state sequence: \[\begin{eqnarray} \Prob{X(1:n)=x(1:n)|S(1:n)=s(1:n)} & = & \prod_{t=1}^{n}{\Prob{X(t)=x(t)|S(t)=s(t)}}\\ & = & \prod_{t=1}^{n}{g(s(t), x(t))} \end{eqnarray}\]
Joint probability of states and observations: \[\begin{eqnarray} \Prob{X(1:n)=x(1:n), S(1:n)=s(1:n)} & = & \Prob{S(1:n)=s(1:n)}\Prob{X(1:n)=x(1:n)|S(1:n)=s(1:n)}\\ & = & \Prob{S(1)=s(1)} \prod_{t=2}^{n}{q(s(t-1), s(t))}\prod_{t=1}^{n}{g(s(t), x(t))} \end{eqnarray}\]
Probability of an observation sequence: \[\begin{eqnarray} \Prob{X(1:n)=x(1:n)} & = & \sum_{s(1:n)}{\Prob{X(1:n)=x(1:n), S(1:n)=s(1:n)}}\\ & = & \sum_{s(1:n)}{ \Prob{S(1)=s(1)} \prod_{t=2}^{n}{q(s(t-1), s(t))}\prod_{t=1}^{n}{g(s(t), x(t))}} \end{eqnarray}\]

Why We Want the Probability of an Observation Sequence

\[ \Prob{X(1:n)=x(1:n)} = \sum_{s(1:n)}{ \Prob{S(1)=s(1)} \prod_{t=2}^{n}{q(s(t-1), s(t))}\prod_{t=1}^{n}{g(s(t), x(t))}} \]

When we predict, we want \[ \Prob{X(t)=x|X(1:t-1)=x(1:t-1)} = \frac{\Prob{X(1:t)=x(1:t)}}{\Prob{X(1:t-1)=x(1:t-1)}} \]
\(\therefore\) Predicting comes down to find the probabilities of sequences and dividing
\(\Prob{X(1:n)=x(1:n);\theta}=\) the likelihood of the data with parameter value \(\theta\) so we want this for inference as well
But our equation is a sum over state histories, and that’s bad
- Even 2 states means \(2^n\) possible values of \(s(1:n)\)
- Never mind continuous states

A Recursion Trick (a.k.a. “The Forward Algorithm”) [1]

If we knew \(S(1)\), it’d be easy to predict \(X(1)\): \[ \Prob{X(1)=x|S(1)=s} = g(s, x) \]
Leads to the marginal distribution of \(X(1)\), what we should predict for the first observation: \[ \Prob{X(1)=x} = \sum_{s}{\Prob{X(1)=x|S(1)=s}\Prob{S(1)=s}} = \sum_{s}{g(s, x) \Prob{S(1)=s}} \]
Now estimate the state at time 1 after seeing that \(X(1)\) is actually \(x(1)\): \[\begin{eqnarray} \Prob{S(1)=s|X(1)=x(1)} & = & \frac{\Prob{S(1)=s, X(1)=x(1)}}{\Prob{X(1)=x(1)}}\\ & = & \frac{\Prob{S(1)=s}\Prob{X(1)=x(1)|S(1)=s}}{\Prob{X(1)=x(1)}}\\ & = & \frac{\Prob{S(1)=s} g(s, x(1))}{\sum_{s^{\prime}}{\Prob{S(1)=s^{\prime}} g(s^{\prime}, x(1))}} \equiv F_1(s) \end{eqnarray}\]

A Recursion Trick (a.k.a. “The Forward Algorithm”) [2]

We have the conditional distribution of \(S(1)\), extrapolate it forward: \[\begin{eqnarray} \Prob{S(2)=r|X(1)=x(1)} & = & \sum_{s}{\Prob{S(2)=r, S(1)=s|X(1)=x(1)}}\\ & = & \sum_{s}{\Prob{S(2)=r|S(1)=s, X(1)=x(1)}\Prob{S(1)=s|X(1)=x(1)}}\\ & = & \sum_{s}{\Prob{S(2)=r|S(1)=s}\Prob{S(1)=s|X(1)=x(1)}}\\ & = & \sum_{s}{q(s,r) F_1(s)} \end{eqnarray}\]
The extrapolated state lets us predict \(X(2)\): \[\begin{eqnarray} \Prob{X(2)=x|X(1)=x(1)} & = & \sum_{s}{\Prob{X(2)=x, S(2)=s|X(1)=x(1)}}\\ & = & \sum_{s}{\Prob{X(2)=x|S(2)=s, X(1)=x(1)}\Prob{S(2)=s|X(1)=x(1)}}\\ & = & \sum_{s}{\Prob{X(2)=x|S(2)=s}\Prob{S(2)=s|X(1)=x(1)}}\\ & = & \sum_{s}{g(s, x)\Prob{S(2)=s|X(1)=x(1)}} \end{eqnarray}\]
After we see \(X(2)\), use both it and \(X(1)\) to update \(S(2)\): \[\begin{eqnarray} F_2(s) & \equiv & \Prob{S(2)=s|X(2)=x(2), X(1)=x(1)}\\ & = & \frac{\Prob{S(2)=s, X(2)=x(2)|X(1)=x(1)}}{\Prob{X(2)=x(2)|X(1)=x(1)}}\\ & = & \frac{\Prob{X(2)=x(2)|S(2)=s, X(1)=x(1)}\Prob{S(2)=s|X(1)=x(1)}}{\Prob{X(2)=x(2)|X(1)=x(1)}}\\ & = & \frac{\Prob{X(2)=x(2)|S(2)=s}\Prob{S(2)=s|X(1)=x(1)}}{\Prob{X(2)=x(2)|X(1)=x(1)}}\\ & = & \frac{g(s, x(2))\Prob{S(2)=s|X(1)=x(1)}}{\Prob{X(2)=x(2)|X(1)=x(1)}} \end{eqnarray}\]

A Recursion Trick (a.k.a. “The Forward Algorithm”) [3]

The Forward Algorithm

Begin with \(\Prob{S(t)=s|X(1:t)=x(1:t)} \equiv F_t(s)\)
Extrapolate the state forward: \[ \Prob{S(t+1)=s|X(1:t)=x(1:t)} = \sum_{s^{\prime}}{q(s^{\prime}, s)\Prob{S(t)=s^{\prime}|X(1:t)=x(1:t)}} = \sum_{s^{\prime}}{q(s^{\prime}, s) F_t(s^{\prime})} \]
Predict the next observation: \[ \Prob{X(t+1)=x|X(1:t)=x(1:t)} = \sum_{s}{g(s,x) \Prob{S(t+1)=s|X(1:t)=x(1:t)}} \]
Condition \(S(t+1)\) on \(X(t+1)\): \[ \Prob{S(t+1)=s|X(1:t+1)=x(1:t+1)} = \frac{g(s,x(t+1)) \Prob{S(t+1)=s|X(1:t)=x(1:t)}}{\Prob{X(t+1)=x|X(1:t)=x(1:t)}} \]
End with \(F_{t+1}(s) \equiv \Prob{S(t+1)=s|X(1:t+1)=x(1:t+1)}\), or begin the cycle over again with \(t+2\).

Back to the Likelihood

We’ve gotten predictions along the way with the forward algorithm
The likelihood comes by multiplying: \[ \Prob{X(1:n)=x(1:n)} = \Prob{X(1)=x(1)}\prod_{t=2}^{n}{\Prob{X(t)=x(t)|X(1:t-1)=x(1:t-1)}} \]

Actually Doing the Calculations

We can do everything exactly when \(S(t)\) and \(X(t)\) are both discrete
- The original “forward algorithm”
We can do everything exactly when \(S(t)\) and \(X(t)\) are both Gaussian, and everything’s linear
- “Kalman filter”
Otherwise we need numerical approximations

The Particle Filter: Words

The key thing is to get \(F_t(s) = \Prob{S(t)=s|X(1:t)=x(1:t)}\)
In the Monte Carlo filter, or particle filter, we approximate \(F_t(s)\) by a large number of random samples from the right distribution

The Particle Filter

At \(t=1\), make \(m\) independent draws \(S^*_1, S^*_2, \ldots S^*_m\) from \(\Prob{S(1)}\)
- These are the particles
For each \(S^*_i\), calculate \(g(S^*_i, x(t)) \equiv w_{i}\)
- \(\Prob{X(t)=x(t)|X(1:t-1)=x(1:t-1)} \approx m^{-1} \sum_{i=1}^{m}{w_i}\)
Resample the particles: draw \(m\) of them, with replacement, from \(S^*\), with probabilities \(\propto w_i\); call these \(\tilde{S}_1, \tilde{S}_2, \ldots \tilde{S}_m\)
Set \(\hat{F}_t\) to the sample distribution of the \(\tilde{S}\)
- The distribution of the resampled particles approximates \(\Prob{S(t)|X(1:t)=x(1:t)}\)
Increment \(t\) by 1, and evolve the particles forward in time: for each \(\tilde{S}_i\), draw \(S^*_i\) from \(q(\tilde{S}_i, S^*_i)\)
- Distribution of the new \(S^*\) approximates \(\Prob{S(t+1)|X(1:t)=x(1:t)}\).
Go to (2)

The Particle Filter: Code

particle.filter <- function(x, rinitial, rtransition, dobs, m) {
    n <- length(x)
    particles <- replicate(m, rinitial())
    particle.history <- matrix(0, nrow = m, ncol = n)
    for (t in 1:n) {
        weights <- dobs(state = particles, observation = x[t])
        particles <- sample(particles, size = m, replace = TRUE, prob = weights)
        particle.history[, t] <- particles
        particles <- sapply(particles, rtransition)
    }
    return(particle.history)
}

The Particle Filter: For Our Little Demo

sim.pf <- particle.filter(x=sim$x,
                          rinitial=function() { sample(-1,+1, size=1) },
                          rtransition=function (x) {
                              stay <- sample(c(TRUE, FALSE), size=1, prob=c(0.75, 0.25))
                              if (stay) { return(x) }
                              if (!stay) { return(-x) }
                          },
                          dobs=function(state, observation) {
                              dnorm(x=observation, mean=state, sd=1)
                          },
                          m=100)

The Particle Filter: Run on Our Little Demo

Issues with the Particle Filter

Monte Carlo error: really want \(m\) to be as large as possible
- But it costs computing time
- Error is \(O(1/\sqrt{m})\) so diminishing returns
“Particle collapse”: if one \(w_i \gg w_j\) for all the other \(j\), then (almost) all new particles are copies of \(S^*_i\) and the approximation is bad
- Lots of hacks around this
What if we don’t know transitions \(q\) or observation probabilities \(g\)?

Parameter Estimation

Be explicit about parameters in the notation: it’s \(q(s,s^{\prime};\theta)\) and \(g(s,x;\theta)\); how do we find \(\theta\)?
We could use the likelihood: use forward algorithm to get \(\prod_{t}{\Prob{X(t)=x(t)|X(1:t-1) = x(1:t-1);\theta}}\), and maximize
- Or, as usual, maximize the log
- It’s a complicated function of \(\theta\) but it is just a function of \(\theta\)
Alternate between estimate states and estimating parameters
- Start with a guess about \(q\) and \(g\), estimate \(S(1:t)\)
- Using \(S(1:t)\), estimate \(q\) and \(g\)
- Iterate to convergence
Calculating the likelihood is still computationally demanding, but simulating is easy \(\Rightarrow\) Use all our simulation-based-inference tricks

What About Space? What About Time?

Hidden Markov random field: \(S(r)\) is a Markov random field, \(X(r) = f(S(r), \epsilon(r))\)
- \(X(r)\) isn’t Markov
- Simulation: Gibbs for the \(S\) field, then draw from \(X(r)|S(r)\) independently
- Likelihood: There are recursive approaches, though rather more complicated than the forward algorithm
Spatio-temporal hidden Markov random fields: \(S(r,t)\) is a spatio-temporal random field, \(X(r,t) = f(S(r,t), \epsilon(r,t))\)
OR: dynamic factor models
- \(S(t)\) is a low-dimensional vector
- \(S(t)\) evolves according to a Markov process (often assumed linear)
- \(X(r,t) = f_r(S(t), \epsilon(r,t))\), so \(X(r,t) \perp X(q,t) | S(t)\)
  - This measurement model is also often assumed to be linear

Summing Up

If we observe a Markov process with noise, the observables aren’t Markov
We often postulate such hidden Markov models, because they let us fit non-Markov processes but still have some of the nice structure of Markov models
Estimating the distribution of the hidden state \(S(t)\) given the history of observations \(X(1:t)\) is crucial
- The forward algorithm lets us do this recursively
- The particle filter lets us do this recursively by simulation
Inference then proceeds either by likelihood (found using the forward algorithm) or maybe by simulation-based inference
The spatial and spatio-temporal variants are easy to formulate, but harder to work with

Many more details in the handout on the website

Backup: Applying Indirect Inference to Our Demo

We need:
1. Simulator for the generative model
2. Estimator for the auxiliary model
3. Code to optimize the generative model so auxiliaries fit to simulations match the auxiliary fit to the data
We’ll use AR(2) as the auxiliary
- One more auxiliary parameter than we really need to estimate \(q = \Prob{S(t+1)=+1|S(t)=+1} = \Prob{S(t+1)=-1|S(t)=-1}\)

Backup: Finding the Auxiliary Parameter Estimates

aux.estimator <- function(x) {
    ar2.fit <- ar.ols(x, aic = FALSE, order = 2)
    return(ar2.fit$ar)
}
aux.from.sim <- function(q, s, n = length(sim$x)) {
    estimates <- replicate(s, aux.estimator(noisy.rpmchain(n = n, q = q)$x))
    return(rowMeans(estimates))
}

Backup: Can We Go From Auxiliary Parameters to Generative Parameters?

Backup: Optimizing the Generative Model so Simulation-Auxiliaries Match Data-Auxiliaries

toy.indirect.inference <- function(x, starting.q, s) {
    n <- length(x)
    aux.data <- aux.estimator(x)
    discrepency <- function(param) {
        q <- param
        aux.mean <- aux.from.sim(q = q, s = s, n = n)
        return(sum((aux.data - aux.mean)^2))
    }
    fit <- optim(par = starting.q, fn = discrepency, method = "Brent", lower = 0, 
        upper = 1)
    return(fit$par)
}

Run:

(toy.ii <- toy.indirect.inference(x = sim$x, starting.q = 0.5, s = 500))

## [1] 0.7092326

Remember we set the true \(q\) to \(0.75\)
This isn’t a fluke:

summary(replicate(30, toy.indirect.inference(x = sim$x, starting.q = 0.5, s = 500)))

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.6929  0.7037  0.7108  0.7110  0.7167  0.7332

+ All the variation here is from simulation to simulation

Hidden Markov Models and State Estimation

Housekeeping