Bayesian calculus

# Bayesian calculus
## Practical methods
### Marie-Pierre Etienne, Etienne Rivot
### <a href="https://marieetienne.github.io/bayesian_and_ecology" class="uri">https://marieetienne.github.io/bayesian_and_ecology</a>
### M2 Mode - Data science

---

---
name: intro
# Introduction

---
template: intro
## Previously in Bayes course

---

A posterior distribution is defined  by

$$ [\theta \vert y ] = \frac{[y \vert \theta ][\theta]}{[ y ]} \mbox{ with } [ y ] = \int_{\theta} [y \vert \theta ][\theta] d\theta.$$
--

`$$[\theta \vert y ] \varpropto [y \vert \theta ][\theta][ y ]$$`

### But nothing can stop us!!

`$$[\theta \vert y ] = ?$$`
---
name: analytic
# Analytical posterior: the magic
<img class="logopos_right" src="magic.png" style="width:4%">

## Binomial example

### Model presentation

.pull-left[
* Data model: `$$\class{bleu}{Y} \sim \mathcal{B}(\class{orange}{n}, \class{bleu}{p}), \quad n \mbox{ known }$$`
* Parameter: `$\class{bleu}{p}$`.
* Specifying an uniform prior `$$\class{bleu}{p} \sim \mathcal{U}(0,1)$$`
]

--
.pull-right[

]

`$$[p\vert y ] =  ?$$`
---
template: analytic

## Binomial example

### Posterior determination

`$$[p\vert y ] =  \frac{[y\vert p] [p]}{[y]}$$`
--

- Likelihood: `$[y\vert p] = {n \choose y} p^y (1-p)^{n-y}$`
- prior: `$[p] = 1_{0\leq p \leq 1}$`
- .rouge[predictive]: `$[y] = \int_{0}^1 {n \choose y} p^y (1-p)^{n-y}$`

`$[y]$` is a constant but not necessarily easy to compute.

---
template: analytic

## Binomial example

### Posterior identification

Posterior shape:

$$ [p\vert y] \propto K p^y (1-p)^{n-y} 1_{0\leq p \leq 1},$$
with `$K$` a constant (in terms of `$p$`) to be determined.

#### Beta distribution 
`$X$` follows a Beta distribution with shape parameter `$a$` and shape 2 parameter `$b$`, if  `$X$` admits a density `$d_X$` given by 
`$$d_X(t)=\left\lbrace
\begin{array}{l}
 \frac{\Gamma(a+b)  }{\Gamma(a)\Gamma(b)} t^{a-1} (1-t)^{b-1} \quad t\in [0,1] ,\cr
0 \quad \mbox{otherwise}\cr
\end{array}\right.$$`

See [Cheatsheet distribution](https://marieetienne.github.io/slides/LoiProbas.pdf)

---
template: analytic
## Binomial example

### Posterior identification

.pull-left[
* Data model: `$$\class{vert}{Y} \sim \mathcal{B}(\class{orange}{n}, \class{bleu}{p}), \quad n \mbox{ known }$$`
* Parameter: `$\class{bleu}{p}$`.
* Specifying an uniform prior `$$\class{bleu}{p} \sim \mathcal{B}eta(1,1)$$`
]

]

$$ p\vert y \sim \mathcal{B}eta(\class{vert}{y}+1, \class{orange}{n}-\class{vert}{y}+1)  $$
---
template: analytic

## Binomial example

### Information within the data

`$$\class{vert}{Y} \sim \mathcal{B}(\class{orange}{n}, \class{bleu}{p}), \quad n \mbox{ known ;} \quad \class{bleu}{p} \sim \mathcal{B}eta(1,1)$$`

`$n=10, y=4$`

$$  p\vert y \sim \mathcal{B}eta(\class{vert}{4}+1, \class{orange}{10}-\class{vert}{4}+1) $$

]

`$n=100, y=40$`

<img src="post_beta_100_40.png" alt="Posterior beta" style="width:70%" class = "centerimg" />
  
$$ p\vert y \sim \mathcal{B}eta(\class{vert}{40}+1, \class{orange}{100}-\class{vert}{40}+1)  $$

]

---
template: analytic
## Binomial example

### Some aditional remarks

- A uniform distribution is a `$\mathcal{B}eta(1,1)$` distribution.
- Conjugacy:
`$$p\sim \mathcal{B}eta(1,1),\quad Y\sim\mathcal{B}(n,p), \quad p\vert y \sim \mathcal{B}eta(y+1,n-y+1)$$`

--
* Interpretation of parameter `$a$` and `$b$` as a number of success and failures.
* What is the value of `$[y]$` in this case ?

---
template: analytic

## Regression model

--
.pull-left[
* Model : `$\class{vert}{Y_k}=\class{bleu}{\beta_0} + \class{bleu}{\beta_1} \class{orange}{x_k} +E_k, \quad E_k\overset{ind}{\sim}\mathcal{N}(0,\class{orange}{\sigma^2})$`
* Parameter: `$\class{bleu}{\beta_0}, \class{bleu}{\beta_1}$`.
* Parameter assumed to be known: `$\class{orange}{\sigma^2}$`
* Priors on  `$\class{bleu}{\theta}=(\class{bleu}{\beta_0},\class{bleu}{\beta_1})$`
`$$[\class{bleu}{\beta_0},\class{bleu}{\beta_1}] = \mathcal{N}( \class{orange}{\mu_{prior}},  \class{orange}{\Lambda_{prior}}),$$`
with `$\Lambda_{prior}$` denoting the precision matrix, i.e `$\Sigma_{prior}^{-1}$`.

]

--
.pull-right[
<img src="img/DAG_norm.png" width="70%" />
]

---
template: analytic

## Regression model

### Posterior distribution

Prior:

`$$[\class{bleu}{\theta}] \propto \exp\left\lbrace  -\frac{1}{2} (\class{bleu}{\theta}- \class{orange}{\mu_{prior}})^\intercal    \class{orange}{\Lambda_{prior}} (\class{bleu}{\theta}- \class{orange}{\mu_{prior}})\right\rbrace,$$`

Likelihood:
$$[\class{vert}{y}\vert \class{bleu}{\theta}] \propto \exp\left\lbrace  -\frac{1}{2  \class{orange}{\sigma^2}} ( \class{vert}{y}- \class{orange}{X}\class{bleu}{\theta})^\intercal ( \class{vert}{y}- \class{orange}{X}\class{bleu}{\theta}) \right\rbrace    $$

Posterior shape:

`$$[\class{bleu}{\theta} \vert  \class{vert}{y}] \propto\exp\left\lbrace  -\frac{1}{2} (\class{bleu}{\theta}- \class{orange}{\mu_{prior}})^\intercal    \class{orange}{\Lambda_{prior}} (\class{bleu}{\theta}- \class{orange}{\mu_{prior}})-\frac{1}{2  \class{orange}{\sigma^2}} ( \class{vert}{y}- \class{orange}{X}\class{bleu}{\theta})^\intercal ( \class{vert}{y}- \class{orange}{X}\class{bleu}{\theta})\right\rbrace ,$$`

---
template: analytic

## Regression model

### Posterior identification

After some calculus

`$$[\class{bleu}{\beta_0},\class{bleu}{\beta_1}\vert  \class{vert}{y}] \sim \mathcal{N}(\mu_{post}, \Lambda_{post})$$`
with

`$$\Lambda_{post} =  \left ( \frac{\class{orange}{X^{\intercal}}\class{orange}{X}}{\class{orange}{\sigma^2}}+\class{orange}{\Lambda_{prior}}\right )$$`

`$$\mu_{post} =\left (  \frac{\class{orange}{X^{\intercal}}\class{orange}{X}}{\class{orange}{\sigma^2}}+\class{orange}{\Lambda_{prior}}\right )^{-1}\left(\frac{\class{orange}{X^{\intercal}}\class{bleu}{Y}}{\class{orange}{\sigma^2}} +\class{orange}{\Lambda_{prior}}\class{orange}{\mu_{prior}} \right)$$`

Again a conjugacy property.

With a normal prior, the posterior is also normal.

---
name: sampling 
# Sampling from posterior distribution

--
## Why was Bayesian approach so uncommon

- Analytical posterior distributions are rarely accessible

- Computing the normalizing constant in general

- Deterministic numerical integration  is possible but tends to behave poorly when the number of parameters increase
  
  - Stochastic integration methods : Monte Carlo methods

- Avoiding the need of the normalizing constant : sampling instead of integrating
  - an old idea which became possible with increasing computing performance

---
template: sampling

## Why a sample is mostly enough ?

### Quantities of interest

.center[
`$$E[\class{bleu}{p}\vert  \class{vert}{y}] =  ?,  \quad CI_{0.95}(\class{bleu}{p}\vert  \class{vert}{y}) =  ?, \quad E[\class{bleu}{h(p)}\vert  \class{vert}{y}] =  ?$$`
]

<img src="figMC1.png" width="60%" />
]

---
template: sampling

## Why a sample is mostly enough ?

### Quantities of interest

.center[
`$$E[\class{bleu}{p}\vert  \class{vert}{y}] =  ?,  \quad CI_{0.95}(\class{bleu}{p}\vert  \class{vert}{y}) =  ?, \quad E[\class{bleu}{h(p)}\vert  \class{vert}{y}] =  ?$$`
]

--
.pull-left[

### Assuming we got a sample

* `$E[\class{bleu}{p}\vert  \class{vert}{y}] \approx \frac{1}{n} \sum_{k=1}^n p_i$`
* `$CI_{0.95}(\class{bleu}{p}\vert  \class{vert}{y}) \approx [p_{(n*0.025)}, p_{(n*0.975)} ] ?$`
* `$E[\class{bleu}{h(p)}\vert  \class{vert}{y}] \approx \frac{1}{n} \sum_{k=1}^n h(p_i)$`

]

--
.pull-reight[

```
##       Sum    Theory     MC100    MC1000
##      Mean 0.3571429 0.3504429 0.3541633
## 5%  CIInf 0.2166169 0.2182523 0.2087176
## 95% CISup 0.5094782 0.5189990 0.5085549
```

---
template: sampling

## Importance sampling

### Key idea : mathematical point of view

`$$E_{d_X}(\class{bleu}{h(X)})  =\int_u \class{bleu}{h(u)} \, d_{X}(u) du$$`
--

`\begin{align*}
E_{d_X}(\class{bleu}{h(X)}) & = \int_u \class{bleu}{h(u)} \, d_{X}(u) du \cr
                                & = \int_u  h(u)\, \frac{d_{X}(u)}{\class{vert}{d_Z(u)}  } \class{vert}{d_Z(u)} du\cr 
                                & =\int_u \class{rouge}{ h(u)\, \frac{d_{X}(u)}{d_Z(u)}  } d_Z(u) du \cr 
                                & = E_{\class{orange}{d_Z}}\left(\class{rouge}{ h(Z) \frac{d_{X}(Z)}{d_Z(Z)}}\right)
\end{align*}`

---

## Importance sampling algorithm

1. Sample from proposal distribution : `$\class{orange}{(z_i)_{i=1,\ldots N}}$`.

--
2. For every particle `$i$`, Compute weight 
`$$\class{rouge}{w_i} = {d_X(z_i)/d_Z(z_i)}$$`

--
3. Normalize weight
`$$\class{rouge}{\tilde{w}_i} = \frac{w_i}{\sum_{i=1}^{N} w_i}$$`

--
`$(\class{orange}{z_i},\class{rouge}{\tilde{w}_i})$` is a weighted sample from `$d_X$`.

--
4. Resample to get unweighted sample `$\class{bleu}{(x_i)_{1\leq i \leq n}}$`.

Sample in `$\class{oange}{(z_i)}$` with replacement with a probability `$\class{rouge}{\tilde{w}_i}$` to get  `$\class{bleu}{(x_i)_{1\leq i \leq n}}$`.

---

## Importance sampling algorithm remarks

* If `$d_X$` is known up to constant, (for instance . `$[p\vert y ]\propto [y\vert p] [p]$`)

`$d_X = K \tilde{d}_X,$`

Weights are defined as `$\omega_i = \tilde{d}_X(z_i) / d_Z(z_i)$`

Normalized weights are 
`$$\tilde{\omega}_i =\frac{\omega_i}{\sum_i\omega_i} = \frac{w_i}{\sum_{i=1}^{N} w_i}$$`

---
template: sampling 
### IS algorithm : graphical point of view 
1. Step 1 : sample from proposal ( `$N=100$` )

---
template: sampling 
### IS algorithm : graphical point of view 
2. Step 2 : compute weight ( `$N=100$` )

---
template: sampling 
### IS algorithm : graphical point of view 
3. Step 3 : Resample to get unweighted sample ( `$N=100$` )

---
template: sampling 
### IS algorithm : graphical point of view 
3. Step 3 : Resample to get unweighted sample ( `$N=100$` )

---
template: sampling 
### IS algorithm : graphical point of view  
3. Step 3 : Resample to get unweighted sample ( `$N=1000$` )

---
template: sampling 
### IS algorithm : graphical point of view  
3. Step 3 : Resample to get unweighted sample ( `$N=1000$` )

---
template: sampling

### Importance sampling conclusion

* Simple and appealing

* .rouge[But,  it is crucial to have a good proposition distribution.]
Which is not so straightforward

---
template: sampling

## Monte Carlo Markov Chain algorithm (MCMC)

###  Markov chain definition
A Markov chain is a sequence of random variables `$(X_1, \ldots,X_n)$` verifying the Markov property. --
`$$[X_{i+1}\vert \class{rouge}{X_{1:i}}] = [X_{i+1}\vert \class{rouge}{X_i}].$$`
--

#### Some examples
- Ex1: Head or tail: +1 if tail, -1 if head. `$X_n$` being the fortune at time `$n$`, `$(X_n)$` is a Markov chain.
- Ex2: Movement on a grid: at time `$n$`, the player moves on one adjacent cell with equal probability. `$X_n$` being the position at time `$n$`, `$(X_n)$` is a Markov chain.

---
template: sampling

## Monte Carlo Markov Chain algorithm (MCMC)
### Markov chain example - Ex1

Random walk 
`$$X_{n+1} = X_{n} + E_{n+1}, \quad E_{n+1}\overset{ind}{\sim}\mathcal{U}(\{-1,1\})$$`
`$(X_n)$` is a Markov chain.

---
template: sampling

## Monte Carlo Markov Chain algorithm (MCMC)

### Markov chain example - Counterexample

Random walk 
`$$X_{n+1} = X_{n} + E_{n+1}, \quad E_{n+1}\overset{ind}{\sim}\mathcal{U}(\{-1,1\})$$`

Supremum of a random walk 
`$Z_n = max_{k=1}^n X_k,$` `$(Z_n)$` is not a Markov chain.

---
template: sampling

## Monte Carlo Markov Chain algorithm (MCMC)

### Markov chain vocabulary

---
template: sampling

## Monte Carlo Markov Chain algorithm (MCMC)

### Markov chain example - Ex1

Random walk 
`$$X_0=0, \quad X_{n+1} = X_{n} + E_{n+1}, \quad E_{n+1}\overset{ind}{\sim}\mathcal{U}(\{-1,1\})$$`
`$(X_n)$` is a Markov chain.

with 
- State space 
--
 `$\mathcal{S}=\mathbb{Z},$`

- Initial distribution 
--
`$\nu_0=1_{\left\lbrace x=0\right \rbrace},$`

- Transition distribution
--
 `$\pi(x, x+1) = 1- \pi(x, x-1) = .5$`

---
template: sampling

## Monte Carlo Markov Chain algorithm (MCMC)

### Markov chain properties

Let `$X_n$` be an homogeneous Markov chain on state space `$\mathcal{S}$` with initial distribution `$\nu_0$` and transition `$\pi$`.

`$$\begin{aligned}
\mathbb{P}(X_1 =x) & = \int_{x_0}    \mathbb{P}(X_0=x_0, X_1 = x) d_{x_0}\\
& = \int_{x_0} \pi(x_0, x) \nu_{0}(x_0) d_{x_0}\\
\end{aligned}$$`

On note `$\nu_0 \pi$` la loi de `$X_1$`.

`$$\begin{aligned}
\mathbb{P}(X_n =x) & = \int_{x_0} \int_{x_1}  \ldots \int_{x_{n-1}}\mathbb{P}(X_0=x_0, X_1 = x_1, \ldots, X_{n-1}=x_{n-1}, X_n = x) d_{x_0}d_{x_1} \ldots d_{x_{n-1}}\\
& =\int_{x_0} \int_{x_1}  \ldots \int_{x_{n-1}} \nu_0(x_0) \pi(x_0, x_1)  \ldots \pi(x_{n-1}, x)  d_{x_0}d_{x_1} \ldots d_{x_{n-1}}\\
\end{aligned}$$`

---
template: sampling

## Monte Carlo Markov Chain algorithm (MCMC)

### Markov chain properties

.bleu[Definition]: `$\nu$` is a stationary distribution if and only if 
`$$X_i \sim \nu \Longrightarrow X_{i+1} \sim \nu$$`

.pull-left[
Example On/Off:
Let `$p = (p_0, p_1),$` and let `$X_n$` be defined by 
`$$X_0 \sim \mathcal{B}(p_{init}), \quad X_{n+1}\vert X_n \sim \mathcal{B}(p_{X_n})$$`
]

]

---
template: sampling

## Monte Carlo Markov Chain algorithm (MCMC)

###  Markov chain properties

.rouge[Ergodic property]
If a Markov chain `$(X_i)$` is irreducible, aperiodic and recurrent then there is exists a unique stationnary distribution `$\pi$` and
$$[X_n] \underset{n\to \infty}{\longrightarrow } \pi . $$

If a Markov chain `$(X_i)$` is reversible ( `$[X_i][X_{i+1}\vert X_i] = [X_{i+1}] [X_i\vert X_{i+1}]$` ) then this Markov chain admits a stationary distribution.

### Consequences of the ergodic theorem

If `$(X_n)$` is a Markov chain with stationary distribution, for any initial `$\nu_0$`, `$[X_n]$` is close to the stationary distribution.

---
template: sampling

## Monte Carlo Markov Chain algorithm (MCMC)

---
template: sampling

## Monte Carlo Markov Chain algorithm (MCMC)

### Consequences of the ergodic theorem

]

---
template: sampling

## Monte Carlo Markov Chain algorithm (MCMC)

### Metropolis Hastings algorithm

.rouge[Key idea:] building a reversible Markov chain with `$[\theta\vert y]$` as stationnary distribution

1. Initialization   `$\theta^{(0)}$` an admissible initial value

2. For i in 1:nIter
  + Propose a new candidate value `$\theta_c^{(i)}$` sampled from a proposal distribution `$g(. \vert \theta^{(i-1)})$`
  + Compute Metropolis Hastings ratio
    $$ r_i=\class{bleu}{\frac{[y\vert \theta_c^{(i)}] [\theta_c^{(i)} ] }{[y\vert \theta^{(i-1)}] [\theta^{(i-1)} ] } } \class{rouge}{\frac{g(\theta^{(i-1)} \vert \theta_c^{(i)})}{g(\theta_c^{(i)} \vert \theta^{(i-1)})}}$$
    + Define 
    $$ \theta^{(i)} =\left \lbrace
    \begin{array}{l}
    \theta_c^{(i)} \mbox{ with probability } min( r_i,1) \cr
    \theta^{(i-1)} \mbox{ with probability } 1-min( r_i,1) \cr
    \end{array}\right.$$`

---
template: sampling

## Sampling from  a bivariate normal

Target distribution : \class{orange}{ A bivariate normal distribution}

![](index_files/figure-html/unnamed-chunk-6-1.png)

---
name: fish
# Finally an example
<img class="logopos_right" src="fish_logo.png" style="width:4%">

--
## Capture efficiency estimation

20 fishes have been released around a trap, and 5 of them are caught.

1. Propose a model of the experiment
2. Define the parameter(s) of interest
3. Characterize the prosterior distribution
  - analytically
  - by simulation
  
--

## Capture efficiency estimation using jags

see [capture.R](capture.R)

---
template: fish

## Capture Recapture

Experimental design:

- `$M=50$` fishes are marked and released in a lake.

- a trap is set in the lake
  - `$R=25$` fish caught are marked
  - `$R=133$` fish caught are unmarked
  
How to estimate the total size of the population ?