class: center, middle, inverse, title-slide # Bayesian calculus ## Practical methods ### Marie-Pierre Etienne, Etienne Rivot ###
https://marieetienne.github.io/bayesian_and_ecology
### M2 Mode - Data science --- <script type="text/x-mathjax-config"> MathJax.Hub.Config({ TeX: { extensions: ["color.js", "HTML.js"] }}); </script> <!-- F1D763 --> <!-- F7A913 --> <!-- C94326 --> <!-- 1F908E --> <!-- 33658A --> <!-- # Why do I enjoy research in statistics ? --> <!-- -- --> --- name: intro # Introduction --- template: intro ## Previously in Bayes course <img src="img/bayes_prior_post.png" width="70%" /> --- template: intro ## Lecture's goal A posterior distribution is defined by $$ [\theta \vert y ] = \frac{[y \vert \theta ][\theta]}{[ y ]} \mbox{ with } [ y ] = \int_{\theta} [y \vert \theta ][\theta] d\theta.$$ -- .care[ [y] is mostly unavailable.] `$$[\theta \vert y ] \varpropto [y \vert \theta ][\theta][ y ]$$` -- ### But nothing can stop us!! `$$[\theta \vert y ] = ?$$` --- name: analytic # Analytical posterior: the magic <img class="logopos_right" src="magic.png" style="width:4%"> -- ## Binomial example ### Model presentation .pull-left[ * Data model: `$$\class{bleu}{Y} \sim \mathcal{B}(\class{orange}{n}, \class{bleu}{p}), \quad n \mbox{ known }$$` * Parameter: `\(\class{bleu}{p}\)`. * Specifying an uniform prior `$$\class{bleu}{p} \sim \mathcal{U}(0,1)$$` ] -- .pull-right[ <img src="img/DAG_binom_1.png" width="70%" /> ] -- `$$[p\vert y ] = ?$$` --- template: analytic ## Binomial example ### Posterior determination .rouge[What we know] `$$[p\vert y ] = \frac{[y\vert p] [p]}{[y]}$$` -- - Likelihood: `\([y\vert p] = {n \choose y} p^y (1-p)^{n-y}\)` - prior: `\([p] = 1_{0\leq p \leq 1}\)` - .rouge[predictive]: `\([y] = \int_{0}^1 {n \choose y} p^y (1-p)^{n-y}\)` -- `\([y]\)` is a constant but not necessarily easy to compute. --- template: analytic ## Binomial example ### Posterior identification Posterior shape: $$ [p\vert y] \propto K p^y (1-p)^{n-y} 1_{0\leq p \leq 1},$$ with `\(K\)` a constant (in terms of `\(p\)`) to be determined. -- #### Beta distribution `\(X\)` follows a Beta distribution with shape parameter `\(a\)` and shape 2 parameter `\(b\)`, if `\(X\)` admits a density `\(d_X\)` given by `$$d_X(t)=\left\lbrace \begin{array}{l} \frac{\Gamma(a+b) }{\Gamma(a)\Gamma(b)} t^{a-1} (1-t)^{b-1} \quad t\in [0,1] ,\cr 0 \quad \mbox{otherwise}\cr \end{array}\right.$$` See [Cheatsheet distribution](https://marieetienne.github.io/slides/LoiProbas.pdf) --- template: analytic ## Binomial example ### Posterior identification .pull-left[ * Data model: `$$\class{vert}{Y} \sim \mathcal{B}(\class{orange}{n}, \class{bleu}{p}), \quad n \mbox{ known }$$` * Parameter: `\(\class{bleu}{p}\)`. * Specifying an uniform prior `$$\class{bleu}{p} \sim \mathcal{B}eta(1,1)$$` ] .pull-right[ <img src="img/DAG_binom_1.png" width="70%" /> ] -- $$ p\vert y \sim \mathcal{B}eta(\class{vert}{y}+1, \class{orange}{n}-\class{vert}{y}+1) $$ --- template: analytic ## Binomial example ### Information within the data `$$\class{vert}{Y} \sim \mathcal{B}(\class{orange}{n}, \class{bleu}{p}), \quad n \mbox{ known ;} \quad \class{bleu}{p} \sim \mathcal{B}eta(1,1)$$` -- .pull-left[ #### Example 1 `\(n=10, y=4\)` <img src="post_beta_10_4.png" alt="Posterior beta" style="width:70%" class = "centerimg" /> $$ p\vert y \sim \mathcal{B}eta(\class{vert}{4}+1, \class{orange}{10}-\class{vert}{4}+1) $$ ] -- .pull-right[ #### Example 2 `\(n=100, y=40\)` <img src="post_beta_100_40.png" alt="Posterior beta" style="width:70%" class = "centerimg" /> $$ p\vert y \sim \mathcal{B}eta(\class{vert}{40}+1, \class{orange}{100}-\class{vert}{40}+1) $$ ] --- template: analytic ## Binomial example ### Some aditional remarks - A uniform distribution is a `\(\mathcal{B}eta(1,1)\)` distribution. - Conjugacy: `$$p\sim \mathcal{B}eta(1,1),\quad Y\sim\mathcal{B}(n,p), \quad p\vert y \sim \mathcal{B}eta(y+1,n-y+1)$$` -- * Interpretation of parameter `\(a\)` and `\(b\)` as a number of success and failures. * What is the value of `\([y]\)` in this case ? --- template: analytic ## Regression model -- .pull-left[ * Model : `\(\class{vert}{Y_k}=\class{bleu}{\beta_0} + \class{bleu}{\beta_1} \class{orange}{x_k} +E_k, \quad E_k\overset{ind}{\sim}\mathcal{N}(0,\class{orange}{\sigma^2})\)` * Parameter: `\(\class{bleu}{\beta_0}, \class{bleu}{\beta_1}\)`. * Parameter assumed to be known: `\(\class{orange}{\sigma^2}\)` * Priors on `\(\class{bleu}{\theta}=(\class{bleu}{\beta_0},\class{bleu}{\beta_1})\)` `$$[\class{bleu}{\beta_0},\class{bleu}{\beta_1}] = \mathcal{N}( \class{orange}{\mu_{prior}}, \class{orange}{\Lambda_{prior}}),$$` with `\(\Lambda_{prior}\)` denoting the precision matrix, i.e `\(\Sigma_{prior}^{-1}\)`. ] -- .pull-right[ <img src="img/DAG_norm.png" width="70%" /> ] --- template: analytic ## Regression model ### Posterior distribution Prior: `$$[\class{bleu}{\theta}] \propto \exp\left\lbrace -\frac{1}{2} (\class{bleu}{\theta}- \class{orange}{\mu_{prior}})^\intercal \class{orange}{\Lambda_{prior}} (\class{bleu}{\theta}- \class{orange}{\mu_{prior}})\right\rbrace,$$` Likelihood: $$[\class{vert}{y}\vert \class{bleu}{\theta}] \propto \exp\left\lbrace -\frac{1}{2 \class{orange}{\sigma^2}} ( \class{vert}{y}- \class{orange}{X}\class{bleu}{\theta})^\intercal ( \class{vert}{y}- \class{orange}{X}\class{bleu}{\theta}) \right\rbrace $$ -- Posterior shape: `$$[\class{bleu}{\theta} \vert \class{vert}{y}] \propto\exp\left\lbrace -\frac{1}{2} (\class{bleu}{\theta}- \class{orange}{\mu_{prior}})^\intercal \class{orange}{\Lambda_{prior}} (\class{bleu}{\theta}- \class{orange}{\mu_{prior}})-\frac{1}{2 \class{orange}{\sigma^2}} ( \class{vert}{y}- \class{orange}{X}\class{bleu}{\theta})^\intercal ( \class{vert}{y}- \class{orange}{X}\class{bleu}{\theta})\right\rbrace ,$$` --- template: analytic ## Regression model ### Posterior identification After some calculus `$$[\class{bleu}{\beta_0},\class{bleu}{\beta_1}\vert \class{vert}{y}] \sim \mathcal{N}(\mu_{post}, \Lambda_{post})$$` with `$$\Lambda_{post} = \left ( \frac{\class{orange}{X^{\intercal}}\class{orange}{X}}{\class{orange}{\sigma^2}}+\class{orange}{\Lambda_{prior}}\right )$$` `$$\mu_{post} =\left ( \frac{\class{orange}{X^{\intercal}}\class{orange}{X}}{\class{orange}{\sigma^2}}+\class{orange}{\Lambda_{prior}}\right )^{-1}\left(\frac{\class{orange}{X^{\intercal}}\class{bleu}{Y}}{\class{orange}{\sigma^2}} +\class{orange}{\Lambda_{prior}}\class{orange}{\mu_{prior}} \right)$$` -- Again a conjugacy property. With a normal prior, the posterior is also normal. --- name: sampling # Sampling from posterior distribution <img class="logopos_right" src="compute.png" style="width:5%"> -- ## Why was Bayesian approach so uncommon - Analytical posterior distributions are rarely accessible -- - Computing the normalizing constant in general - Deterministic numerical integration is possible but tends to behave poorly when the number of parameters increase - Stochastic integration methods : Monte Carlo methods -- - Avoiding the need of the normalizing constant : sampling instead of integrating - an old idea which became possible with increasing computing performance --- template: sampling ## Why a sample is mostly enough ? ### Quantities of interest .center[ `$$E[\class{bleu}{p}\vert \class{vert}{y}] = ?, \quad CI_{0.95}(\class{bleu}{p}\vert \class{vert}{y}) = ?, \quad E[\class{bleu}{h(p)}\vert \class{vert}{y}] = ?$$` ] -- .pull-left[ <img src="figMC1.png" width="60%" /> ] --- template: sampling ## Why a sample is mostly enough ? ### Quantities of interest .center[ `$$E[\class{bleu}{p}\vert \class{vert}{y}] = ?, \quad CI_{0.95}(\class{bleu}{p}\vert \class{vert}{y}) = ?, \quad E[\class{bleu}{h(p)}\vert \class{vert}{y}] = ?$$` ] .center[ .rouge[Monte Carlo methods]] -- .pull-left[ ### Assuming we got a sample <img src="figMC2.png" width="60%" /> * `\(E[\class{bleu}{p}\vert \class{vert}{y}] \approx \frac{1}{n} \sum_{k=1}^n p_i\)` * `\(CI_{0.95}(\class{bleu}{p}\vert \class{vert}{y}) \approx [p_{(n*0.025)}, p_{(n*0.975)} ] ?\)` * `\(E[\class{bleu}{h(p)}\vert \class{vert}{y}] \approx \frac{1}{n} \sum_{k=1}^n h(p_i)\)` ] -- .pull-reight[ .center[ .rouge[ Monte Carlo Approximation ] ] ] ``` ## Sum Theory MC100 MC1000 ## Mean 0.3571429 0.3504429 0.3541633 ## 5% CIInf 0.2166169 0.2182523 0.2087176 ## 95% CISup 0.5094782 0.5189990 0.5085549 ``` --- template: sampling ## Importance sampling ### Key idea : mathematical point of view `$$E_{d_X}(\class{bleu}{h(X)}) =\int_u \class{bleu}{h(u)} \, d_{X}(u) du$$` -- `\begin{align*} E_{d_X}(\class{bleu}{h(X)}) & = \int_u \class{bleu}{h(u)} \, d_{X}(u) du \cr & = \int_u h(u)\, \frac{d_{X}(u)}{\class{vert}{d_Z(u)} } \class{vert}{d_Z(u)} du\cr & =\int_u \class{rouge}{ h(u)\, \frac{d_{X}(u)}{d_Z(u)} } d_Z(u) du \cr & = E_{\class{orange}{d_Z}}\left(\class{rouge}{ h(Z) \frac{d_{X}(Z)}{d_Z(Z)}}\right) \end{align*}` --- template: sampling ## Importance sampling algorithm 1. Sample from proposal distribution : `\(\class{orange}{(z_i)_{i=1,\ldots N}}\)`. -- 2. For every particle `\(i\)`, Compute weight `$$\class{rouge}{w_i} = {d_X(z_i)/d_Z(z_i)}$$` -- 3. Normalize weight `$$\class{rouge}{\tilde{w}_i} = \frac{w_i}{\sum_{i=1}^{N} w_i}$$` -- `\((\class{orange}{z_i},\class{rouge}{\tilde{w}_i})\)` is a weighted sample from `\(d_X\)`. -- 4. Resample to get unweighted sample `\(\class{bleu}{(x_i)_{1\leq i \leq n}}\)`. Sample in `\(\class{oange}{(z_i)}\)` with replacement with a probability `\(\class{rouge}{\tilde{w}_i}\)` to get `\(\class{bleu}{(x_i)_{1\leq i \leq n}}\)`. --- template: sampling ## Importance sampling algorithm remarks * If `\(d_X\)` is known up to constant, (for instance . `\([p\vert y ]\propto [y\vert p] [p]\)`) `\(d_X = K \tilde{d}_X,\)` Weights are defined as `\(\omega_i = \tilde{d}_X(z_i) / d_Z(z_i)\)` Normalized weights are `$$\tilde{\omega}_i =\frac{\omega_i}{\sum_i\omega_i} = \frac{w_i}{\sum_{i=1}^{N} w_i}$$` --- template: sampling ### IS algorithm : graphical point of view 1. Step 1 : sample from proposal ( `\(N=100\)` ) <img src="IS1.png" alt="IS1" style="height:60%" class = "centerimg" /> --- template: sampling ### IS algorithm : graphical point of view 2. Step 2 : compute weight ( `\(N=100\)` ) <img src="IS2.png" alt="IS2" style="height:60%" class = "centerimg" /> --- template: sampling ### IS algorithm : graphical point of view 3. Step 3 : Resample to get unweighted sample ( `\(N=100\)` ) <img src="IS3.png" alt="IS3" style="height:60%" class = "centerimg" /> --- template: sampling ### IS algorithm : graphical point of view 3. Step 3 : Resample to get unweighted sample ( `\(N=100\)` ) <img src="IS4.png" alt="IS4" style="height:60%" class = "centerimg" /> --- template: sampling ### IS algorithm : graphical point of view 3. Step 3 : Resample to get unweighted sample ( `\(N=1000\)` ) <img src="IS3_1000.png" alt="IS3_1000" style="height:60%" class = "centerimg" /> --- template: sampling ### IS algorithm : graphical point of view 3. Step 3 : Resample to get unweighted sample ( `\(N=1000\)` ) <img src="IS4_1000.png" alt="IS4_1000" style="height:60%" class = "centerimg" /> --- template: sampling ### Importance sampling conclusion * Simple and appealing * .rouge[But, it is crucial to have a good proposition distribution.] Which is not so straightforward --- template: sampling ## Monte Carlo Markov Chain algorithm (MCMC) ### Markov chain definition A Markov chain is a sequence of random variables `\((X_1, \ldots,X_n)\)` verifying the Markov property. -- `$$[X_{i+1}\vert \class{rouge}{X_{1:i}}] = [X_{i+1}\vert \class{rouge}{X_i}].$$` -- #### Some examples - Ex1: Head or tail: +1 if tail, -1 if head. `\(X_n\)` being the fortune at time `\(n\)`, `\((X_n)\)` is a Markov chain. - Ex2: Movement on a grid: at time `\(n\)`, the player moves on one adjacent cell with equal probability. `\(X_n\)` being the position at time `\(n\)`, `\((X_n)\)` is a Markov chain. --- template: sampling ## Monte Carlo Markov Chain algorithm (MCMC) ### Markov chain example - Ex1 Random walk `$$X_{n+1} = X_{n} + E_{n+1}, \quad E_{n+1}\overset{ind}{\sim}\mathcal{U}(\{-1,1\})$$` `\((X_n)\)` is a Markov chain. <img src="RW1.png" alt="RW1" style="width:40%" class = "centerimg" /> --- template: sampling ## Monte Carlo Markov Chain algorithm (MCMC) ### Markov chain example - Counterexample Random walk `$$X_{n+1} = X_{n} + E_{n+1}, \quad E_{n+1}\overset{ind}{\sim}\mathcal{U}(\{-1,1\})$$` Supremum of a random walk `\(Z_n = max_{k=1}^n X_k,\)` `\((Z_n)\)` is not a Markov chain. <img src="SuppRW1.png" alt="SuppRW1" style="width:40%" class = "centerimg" /> --- template: sampling ## Monte Carlo Markov Chain algorithm (MCMC) ### Markov chain vocabulary .rouge[State space] `\(\mathcal{S}\)`: the set of possible values for `\((X_n)\)`. .rouge[Initial distribution] `\(\nu_0\)`: the distribution for `\(X_0.\)` .rouge[Transition distribution] `\(\pi^{(i)}\)`: the distribution for `\(X_{i+1} \vert X_{i}\)` .rouge[Homogeneity] if `\(\pi^{(i)}\)` does not depend on `\(i\)` the MC is said homogeneous. --- template: sampling ## Monte Carlo Markov Chain algorithm (MCMC) ### Markov chain example - Ex1 Random walk `$$X_0=0, \quad X_{n+1} = X_{n} + E_{n+1}, \quad E_{n+1}\overset{ind}{\sim}\mathcal{U}(\{-1,1\})$$` `\((X_n)\)` is a Markov chain. with - State space -- `\(\mathcal{S}=\mathbb{Z},\)` - Initial distribution -- `\(\nu_0=1_{\left\lbrace x=0\right \rbrace},\)` - Transition distribution -- `\(\pi(x, x+1) = 1- \pi(x, x-1) = .5\)` --- template: sampling ## Monte Carlo Markov Chain algorithm (MCMC) ### Markov chain properties Let `\(X_n\)` be an homogeneous Markov chain on state space `\(\mathcal{S}\)` with initial distribution `\(\nu_0\)` and transition `\(\pi\)`. .rouge[ Distribution of `\(X_1\)`?] -- `$$\begin{aligned} \mathbb{P}(X_1 =x) & = \int_{x_0} \mathbb{P}(X_0=x_0, X_1 = x) d_{x_0}\\ & = \int_{x_0} \pi(x_0, x) \nu_{0}(x_0) d_{x_0}\\ \end{aligned}$$` -- On note `\(\nu_0 \pi\)` la loi de `\(X_1\)`. -- .rouge[ Distribution of `\(X_n\)`?] `$$\begin{aligned} \mathbb{P}(X_n =x) & = \int_{x_0} \int_{x_1} \ldots \int_{x_{n-1}}\mathbb{P}(X_0=x_0, X_1 = x_1, \ldots, X_{n-1}=x_{n-1}, X_n = x) d_{x_0}d_{x_1} \ldots d_{x_{n-1}}\\ & =\int_{x_0} \int_{x_1} \ldots \int_{x_{n-1}} \nu_0(x_0) \pi(x_0, x_1) \ldots \pi(x_{n-1}, x) d_{x_0}d_{x_1} \ldots d_{x_{n-1}}\\ \end{aligned}$$` --- template: sampling ## Monte Carlo Markov Chain algorithm (MCMC) ### Markov chain properties .bleu[Definition]: `\(\nu\)` is a stationary distribution if and only if `$$X_i \sim \nu \Longrightarrow X_{i+1} \sim \nu$$` -- .pull-left[ Example On/Off: Let `\(p = (p_0, p_1),\)` and let `\(X_n\)` be defined by `$$X_0 \sim \mathcal{B}(p_{init}), \quad X_{n+1}\vert X_n \sim \mathcal{B}(p_{X_n})$$` ] .pull-right[ <img src="sim1.png" alt="Simulation" style="width:90%" class = "centerimg" /> ] --- template: sampling ## Monte Carlo Markov Chain algorithm (MCMC) .rouge[Distribution of `\(X_1\)`, `\(X_2\)`, ... ?] ### Markov chain properties .rouge[Ergodic property] If a Markov chain `\((X_i)\)` is irreducible, aperiodic and recurrent then there is exists a unique stationnary distribution `\(\pi\)` and $$[X_n] \underset{n\to \infty}{\longrightarrow } \pi . $$ If a Markov chain `\((X_i)\)` is reversible ( `\([X_i][X_{i+1}\vert X_i] = [X_{i+1}] [X_i\vert X_{i+1}]\)` ) then this Markov chain admits a stationary distribution. ### Consequences of the ergodic theorem If `\((X_n)\)` is a Markov chain with stationary distribution, for any initial `\(\nu_0\)`, `\([X_n]\)` is close to the stationary distribution. --- template: sampling ## Monte Carlo Markov Chain algorithm (MCMC) .rouge[Back to the example on/off:] <img src="sim1_tot.png" alt="Simulation" style="height:30%" class = "centerimg" /> --- template: sampling ## Monte Carlo Markov Chain algorithm (MCMC) .rouge[Back to the example on/off:] ### Consequences of the ergodic theorem .pull-left[ <img src="distrib_90.png" alt="Simulation" style="width:90%" class = "centerimg" /> ] .pull-right[ <img src="distrib_100.png" alt="Simulation" style="width:90%" class = "centerimg" /> ] --- template: sampling ## Monte Carlo Markov Chain algorithm (MCMC) ### Metropolis Hastings algorithm .rouge[Key idea:] building a reversible Markov chain with `\([\theta\vert y]\)` as stationnary distribution -- 1. Initialization `\(\theta^{(0)}\)` an admissible initial value -- 2. For i in 1:nIter + Propose a new candidate value `\(\theta_c^{(i)}\)` sampled from a proposal distribution `\(g(. \vert \theta^{(i-1)})\)` + Compute Metropolis Hastings ratio $$ r_i=\class{bleu}{\frac{[y\vert \theta_c^{(i)}] [\theta_c^{(i)} ] }{[y\vert \theta^{(i-1)}] [\theta^{(i-1)} ] } } \class{rouge}{\frac{g(\theta^{(i-1)} \vert \theta_c^{(i)})}{g(\theta_c^{(i)} \vert \theta^{(i-1)})}}$$ + Define $$ \theta^{(i)} =\left \lbrace \begin{array}{l} \theta_c^{(i)} \mbox{ with probability } min( r_i,1) \cr \theta^{(i-1)} \mbox{ with probability } 1-min( r_i,1) \cr \end{array}\right.$$` --- template: sampling ## Sampling from a bivariate normal Target distribution : \class{orange}{ A bivariate normal distribution} ![](index_files/figure-html/unnamed-chunk-6-1.png)<!-- --> --- name: fish # Finally an example <img class="logopos_right" src="fish_logo.png" style="width:4%"> -- ## Capture efficiency estimation 20 fishes have been released around a trap, and 5 of them are caught. 1. Propose a model of the experiment 2. Define the parameter(s) of interest 3. Characterize the prosterior distribution - analytically - by simulation -- ## Capture efficiency estimation using jags see [capture.R](capture.R) --- template: fish ## Capture Recapture Experimental design: - `\(M=50\)` fishes are marked and released in a lake. - a trap is set in the lake - `\(R=25\)` fish caught are marked - `\(R=133\)` fish caught are unmarked How to estimate the total size of the population ?