Introduction to Hidden markov Model

Application to movement ecology

Marie-Pierre Etienne

ENSAI - CREST

https://marieetienne.github.io/MAF/

December 9, 2024

Ecological context

Movement Ecology Paradigm

[Nat+08] presents individual movement as the result of:

Image Description
Movement drivers by Nathan, Getz, Revilla et al. [Nat+08].
  • Motion capacities
  • Internal state
  • Environment

Movement informs on internal states and habitat preferences

From Movement to Movement Data

Movement Path 1

From Movement to Movement Data

Movement Path 2

From Movement to Movement Data

Movement Path 3

From Movement to Movement Data

Movement Path P3

Description

Movement description

Continuous Process and Sampling

A continuous process sampled at some discrete, potentially irregular times.

Time series with values in \(\mathbb{R}^2\) (on Earth…).

\[ \begin{array}{|c|c|c|c|} \hline \text{Time} & \text{Location} & \text{Turning Angle} & \text{Speed} \\ \hline t_{0} & (x_0, y_0) & \text{NA} & \text{NA} \\ t_{1} & (x_1, y_1) & \text{NA} & sp_1 \\ t_{2} & (x_2, y_2) & ang_2 & sp_2 \\ \vdots & \vdots & \vdots & \vdots \\ t_{n} & (x_n, y_n) & ang_n & sp_n \\ \hline \end{array} \]

Ecological question: Identifying various movement patterns

Rarely in a supervised context

Fishing points in Mada
Artisanal fishing trips in Toliara, Madagascar

Mostly unsupervised

at the beginning is
Peruvian booby data courtesy of Sophie Bertrand.

From Movement Data to Movement Model

Often analysed using discrete time model and typically Hidden Markov Model

at the beginning is
Movement decomposition

Heterogeneity in movement pattern interpretated as different internal states

at the beginning is
Peruvian booby data courtesy of Sophie Bertrand
at the beginning is
Movement decomposition

Heterogeneity in movement pattern interpretated as different internal states

at the beginning is
Peruvian booby data courtesy of Sophie Bertrand
at the beginning is
Movement decomposition

Accounting for internal states

  • Change point detection approach. based on [Pic+07]
at the beginning is
Segmentation illustration

\[\left .\begin{array}{c} Y_{1k} \sim f(\theta_{1\ell})\\ Y_{2k} \sim f(\theta_{2\ell}) \end{array} \right\rbrace \mbox{ if } k \mbox{ in region }I_{\ell}=[\tau_{\ell-1}+1,\tau_{\ell}]\]

The best segmentation for a given number of change points L: \[argmin_{0< \tau_1 < \ldots < \tau_L} C(\boldsymbol{Y}) \]

Brute force complexity : \(O(N^L)\),

Dynamic Programming algorithm complexity : \(O(N^2)\) or even less with pruning strategy

Limitation in the choice for \(f.\)

Latent (Hidden) variables models

A simple Latent variables models

  • A parametric model (\(M, \boldsymbol{\theta}\)) produces \({\boldsymbol{Y}}\) and \({\boldsymbol{Z}}\).

  • The only observed data are \({\boldsymbol{Y}}\) while \({\boldsymbol{Z}}\) are hidden variables.

Questions are

  • Parameters: Is it still possible to estimate \(\boldsymbol{\theta}\) ?
  • Information on \({\boldsymbol{Z}}\): is it possible to “reconstruct” the unobserved data \({\boldsymbol{Z}}\) ?

Bayes formula is the key :

\[{\mathbb{P}}({\boldsymbol{Y}}, {\boldsymbol{Z}})={\mathbb{P}}({\boldsymbol{Y}}\vert {\boldsymbol{Z}}){\mathbb{P}}({\boldsymbol{Z}})={\mathbb{P}}({\boldsymbol{Z}}\vert {\boldsymbol{Y}}) {\mathbb{P}}({\boldsymbol{Y}})\]

Mixture model as a simple Latent variables model

Assume that the numer of states \(K\) is known

  • Modelling \(Z\): \(\pi_k={\mathbb{P}}(Z_i=k), \quad k=1,\ldots, K, \quad \sum_k \pi_k=1\)
  • \(Z_i \overset{i.i.d}{\sim} \mathcal{M}(1, \boldsymbol{\pi}), \quad P(Z_{ik}=1)=\pi_k\)
  • Modelling \(Y\)}: The \(Y_i's\) are assumed to be independent conditionnaly to \({\boldsymbol{Z}}\) : \((Y_i\vert Z_i = k) \overset{i.i.d}{\sim} f_{\gamma_k}().\)

Model parameters \(\boldsymbol{\theta}=(\boldsymbol{\pi}, \boldsymbol{\gamma})\)

Mixture model as a simple Latent variables model

We could assume that

  • A step \(Y_k\) is a descriptor of movement between position at time \(t_{k}\) and \(t_{k+1}\)
  • Conditionally on some hidden state \(Z_k = \ell\), the distribution of \(Y_k\) has some certain parametric distribution, let’s say Gaussian

Question: If \(K=2\), what are \(\boldsymbol{\pi}\) and \(\boldsymbol{\gamma}\) ?

Mixture model Properties

  • Couples \(\{(Y_i, Z_i)\}\) are i.i.d.

  • Label switching: the model is invariant for any permutation of the labels \(\{1, \dots, K\}\) \(\Rightarrow\) the mixture model has [\(K!\) equivalent definitions]{.rouge].

  • Distribution of a \(Y_i\): \[{\mathbb{P}}(Y_i)=\sum_{k=1}^K {\mathbb{P}}(Y_i, Z_i=k)= \class{bleu}{{\mathbb{P}}(Z_i=k)} \class{orange}{{\mathbb{P}}(Y_i | Z_i=k)}\]

  • Distribution of \({\boldsymbol{Y}}\):

\[\begin{align} {\mathbb{P}}({\boldsymbol{Y}}; \boldsymbol{\theta}) & = \prod_{i=1}^n \sum_{k=1}^K P(Y_i , Z_i=k;\boldsymbol{\theta}) \\ &= \prod_{i=1}^n \sum_{k=1}^K \class{bleu}{{\mathbb{P}}(Z_i=k; \boldsymbol{\pi})} \class{orange}{{\mathbb{P}}(Y_i | Z_i=k; \boldsymbol{\gamma})} \\ &= \prod_{i=1}^n \sum_{k=1}^K \class{bleu}{\pi_k} \class{orange}{f_{\gamma_k}(Y_i)} \end{align}\]

Adding temporal structure : Hidden Markov Model

Modelling the dependence in state sequence: If an animal is feeding at time \(i\), he has more chance to be feeding at time \(i+1\) than if he was travelling at time \(i\). \[{\mathbb{P}}(Z_{i+1}=1 \vert Z_{i}=1) \ne {\mathbb{P}}(Z_{i+1}=1 \vert Z_{i}=2)\]

Markov Chain definition

\({\boldsymbol{Z}}\) is a Markov chain if \[P(Z_{i+1} \vert Z_{1:i}) = P(Z_{i+1} \vert Z_{i})\]

where \(Z_{1:i}\) stands for \(Z_{1}, \ldots, Z_i\).

\({\boldsymbol{Z}}\) is completely defined by the initial distribution \(\boldsymbol{\nu}\), \(\nu_k={\mathbb{P}}(Z_1=k)\) and the transition matrix \(\boldsymbol{\Pi}\)

  • Hidden States \({\boldsymbol{Z}}\) model: \({\boldsymbol{Z}}\) is assumed to follow a Markov Chain model with unknown initial distribution \(\boldsymbol{\nu}\) and transition matrix \(\boldsymbol{\Pi}\).

  • Observations \({\boldsymbol{Y}}\) model: The \(Y_i's\) are assumed to be independent conditionnaly on \({\boldsymbol{Z}}\) : \((Y_i\vert Z_i = k) \overset{i.i.d}{\sim} f_{\gamma_k}().\)

  • Model parameters are \(\boldsymbol{\theta}=(\boldsymbol{\nu}, \boldsymbol{\Pi}, \boldsymbol{\gamma})\)

Define all the key elements of a hidden Markov model with

  • \(K= 3\) hidden states
  • Conditionnaly on \({\boldsymbol{Z}}\), \({\boldsymbol{Y}}\) are normally distributed (with \(K\) different expeactation and variance)
  • Simulate Such a model with your prefered language

Statistical inference of incomplete data models

Goal :

\[(\widehat{\boldsymbol{\gamma}},\widehat{\boldsymbol{\Pi}}, \widehat{\boldsymbol{\nu}}) = \arg\max_{\boldsymbol{\gamma}, \boldsymbol{\Pi}, {\boldsymbol{\nu}}} \log P({\boldsymbol{Y}}; \boldsymbol{\gamma}, \boldsymbol{\Pi}, \boldsymbol{\nu})\]

\[\log {\mathbb{P}}({\boldsymbol{Y}}; \boldsymbol{\theta}) = \log \sum_{k_1, k_N \in K^N }{\mathbb{P}}({\boldsymbol{Y}}, {\boldsymbol{Z}}= (k_1, \ldots,k_N); \boldsymbol{\theta})\] * No analytical estimators.

  • It is not always possible to compute since this sum typically involves \(K^n\) terms : \(2^{100}\approx10^{30}\)

  • Brute force algorithm is not the way

A convenient notation \(Z_{ik}=1\) if \(Z_i=k\)

\[\begin{align} \log {\mathbb{P}}({\boldsymbol{Y}}, {\boldsymbol{Z}}; \boldsymbol{\theta}) & = \class{bleu}{\sum_k Z_{1k} \log \nu_k }\\ &\class{bleu}{+ \sum_{i > 1} \sum_{k, \ell} Z_{i-1,k}Z_{i,\ell} \log \pi_{k\ell}} \\ & + \class{orange}{\sum_i \sum_k Z_{ik} \log f(y_i ; \gamma_k)} \end{align}\]

Idea : replace \(Z_{ik}\) by its best guess

\[\begin{align} {\mathbb{E}}\left( \log {\mathbb{P}}({\boldsymbol{Y}}, {\boldsymbol{Z}}; \boldsymbol{\theta})\vert Y_{1:N}\right) & = \class{bleufonce}{\sum_k{\mathbb{E}}\left( Z_{1k} \vert Y_{1:N}\right) \log \nu_k }\\ & \class{bleufonce}{+ \sum_{i > 1} \sum_{k, \ell} {\mathbb{E}}\left( Z_{i-1,k}Z_{i,\ell}\vert Y_{1:N}\right) \log \pi_{k\ell}} \\ & + \class{rouge}{\sum_i \sum_k {\mathbb{E}}\left(Z_{ik}\vert Y_{1:N}\right) \log f(X_i ; \gamma_k)} \end{align}\]

Expexctation Maximization (EM) algorithm

mod_hier

Bayes Formula

\[\begin{align} {\mathbb{P}}({\boldsymbol{Y}}, {\boldsymbol{Z}};\boldsymbol{\theta}) & = {\mathbb{P}}({\boldsymbol{Y}}\vert {\boldsymbol{Z}}; \boldsymbol{\theta}) P({\boldsymbol{Z}}; \boldsymbol{\theta}),\\ & = {\mathbb{P}}({\boldsymbol{Z}}\vert {\boldsymbol{Y}}; \boldsymbol{\theta}) {\mathbb{P}}({\boldsymbol{Y}}; \boldsymbol{\theta}). \end{align}\]

Therefore, \[\begin{align} \log {\mathbb{P}}({\boldsymbol{Y}}; \boldsymbol{\theta}) & = \log \left \lbrace {\mathbb{P}}({\boldsymbol{Y}}, {\boldsymbol{Z}};\boldsymbol{\theta}) / {\mathbb{P}}({\boldsymbol{Z}}\vert {\boldsymbol{Y}}; \boldsymbol{\theta}) \right\rbrace\\ & = \log {\mathbb{P}}({\boldsymbol{Y}}, {\boldsymbol{Z}};\boldsymbol{\theta}) - \log {\mathbb{P}}({\boldsymbol{Z}}\vert {\boldsymbol{Y}}; \boldsymbol{\theta}) \\ \end{align}\]

For a given \(\boldsymbol{\theta}_0\), we may compute \({\mathbb{P}}_{\boldsymbol{\theta}_0}=P({\boldsymbol{Z}}\vert \boldsymbol{\theta}_0, {\boldsymbol{Y}})\) and \[\begin{align} \log {\mathbb{P}}({\boldsymbol{Y}}; \boldsymbol{\theta}) &= \class{orange}{{\mathbb{E}}_{\boldsymbol{\theta}_0}(\log P({\boldsymbol{Y}}, {\boldsymbol{Z}};\boldsymbol{\theta})\vert {\boldsymbol{Y}})} - {\mathbb{E}}_{\boldsymbol{\theta}_0}(\log {\mathbb{P}}({\boldsymbol{Z}}\vert {\boldsymbol{Y}}; \boldsymbol{\theta})\vert {\boldsymbol{Y}})\\ & = \class{orange}{Q(\boldsymbol{\theta}, \boldsymbol{\theta}_0)} - H(\boldsymbol{\theta}, \boldsymbol{\theta}_0) \end{align}\]

Expectation - Maximization algorithm

  • Phase E : Calculate \[Q(\boldsymbol{\theta},\boldsymbol{\theta}^{k})\] for every \(\boldsymbol{\theta}\).
  • Phase M : Define
    \[\boldsymbol{\theta}^{k+1}=argmax\, Q(\boldsymbol{\theta},\boldsymbol{\theta}^{k})\]

Example : EM for Gaussian HMM

Let’s compute it together

Exercise

  • Simulate the following HMM

K=2, \(\pi_11=0.9\), \(pi_{22}=0.8\)

\[Y_i \vert Z_i = 1 \sim\mathcal{N}(2, 1)\]

\[Y_i \vert Z_i = 2 \sim\mathcal{N}(0, 1.5)\]

  • Derive an EM algorithm to estimate the parameters from the simulated dataset.

Baum Welch algorithm

Initialisation of \(\boldsymbol{\theta}^{(0)}=(\Pi, \gamma_1, ..., \gamma_K)^{(0)}\).

While the convergence is not reached

  • [E-step] Calculation of

    \[\begin{align} \tau^{(\ell)}_{i}(k)&={\mathbb{E}}[Z_{ik}| {\boldsymbol{Y}}, \boldsymbol{\theta}^{(\ell-1)}] = {\mathbb{P}}(Z_i = k | {\boldsymbol{Y}}, \boldsymbol{\theta}^{(\ell-1)})\\ \xi^{(\ell)}_{i}(k,h) &= {\mathbb{E}}[Z_{i-1,k}Z_{ih}| {\boldsymbol{Y}}, \boldsymbol{\theta}^{(\ell-1)}]\\ \end{align}\]

    Smart algorithm Forward-Backward algorithm

  • [M-step] Maximization in \(\boldsymbol{\theta}=(\boldsymbol{\pi}, \boldsymbol{\gamma})\) of

\[\sum_k \tau^{(\ell)}_{1}(k)\log \nu_k + \sum_{i > 1} \sum_{k, h} \xi^{(\ell)}_{i}(k,h) \log \pi_{kh}+ \sum_i \sum_k \tau^{(\ell)}_{i}(k) \log f(y_i; \gamma_k)\]

Transition estimation \[\hat{\Pi}_{kh} = \frac{\sum_{i=1}^{N-1}{{\mathbb{E}}[Z_{i-1,k}Z_{ih}\vert Y_{0:n}]}}{\sum_{i=1}^{N-1}{{\mathbb{E}}[Z_{i-1,k}]}}= \frac{\sum_{i=1}^{N-1}\xi^{(\ell)}_{i}(k,h)}{\sum_{i=1}^{N-1}\tau^{(\ell)}_{i}(k)}\]

If Gaussian emissions: \[\begin{align} \hat{\mu}_k & = \frac{\sum_{i=1}^N{\mathbb{E}}[Z_{i,k}\vert Y_{1:N}] Y_i}{\sum_{i=1}^{N}{{\mathbb{E}}[Z_{i,k}\vert Y_{1:N}]}} = \frac{\sum_{i=1}^N \tau^{(\ell)}_{i}(k) Y_i}{\sum_{i=1}^{N}{\tau^{(\ell)}_{i}(k)}}\\ \hat{\sigma}_k^2 & = \frac{\sum_{i=1}^N \tau^{(\ell)}_{i}(k) (Y_i-\hat{\mu_i})^2 }{\sum_{i=1}^{N}{\tau^{(\ell)}_{i}(k)}} \end{align}\]

Problem: Computing \(\tau^{(\ell)}_{i}(k)= {\mathbb{E}}[Z_{i,k}\vert Y_{1:N}]\) and \(\xi^{(\ell)}_{i}(k,h)={\mathbb{E}}[Z_{i-1,k}Z_{ih}\vert Y_{1:N}]\)

recursion to compute \(\tau^{(\ell)}_{i}(k)\) and \(\xi^{(\ell)}_{i}(k,h)}={\mathbb{E}}[Z_{i-1,k}Z_{ih}\vert Y_{1:N}]\)

\[E(Z_{ik}\vert {\boldsymbol{Y}}) \quad \text{and} \quad E(Z_{ik}Z_{j\ell}\vert {\boldsymbol{Y}}).\]

Forward \[\alpha_n(i)={\mathbb{P}}_{\boldsymbol{\theta}}\left( Y_{0:n}, Z_n=i\right)\]

  • Initialisation \(\alpha_0(i)=\nu(i)f_{\gamma_i}(y_0)\)

  • for \(n=1,\ldots, N-1\) \[\alpha_{n+1}(i)= \left(\sum_{k=1}^K \alpha_n(k) \Pi_{ki}\right) f_{\tau_i}(y_{n+1})\]

  • Termination \[{\mathbb{P}}(Y_{0:N})=\sum_{k=1}^K(\alpha_N(k))\]

Backward \[\beta_n(i)={\mathbb{P}}_{\boldsymbol{\theta}}\left( Y_{n+1:N}\vert Z_n=i\right)\]

  • Initialisation

  • for \(n=1,\ldots, N-1\) \[\beta_{n}(i)= \left(\sum_{k=1}^K \Pi_{ik} f_{\gamma_k}(y_{n+1})\beta_{n+1}(k) \right) \]

  • Termination \[{\mathbb{P}}(Y_{0:N})=\sum_{j,k=1}^K(\beta_0(j) f_{\gamma_k}(y_0) \Pi_{kj} \nu(k))\]

State decoding : Viterbi Algorithm