--- title: "STAT 491 - Lecture 7" output: pdf_document --- ```{r setup, include=FALSE} library(knitr) knitr::opts_chunk$set(echo = TRUE) set.seed(02112018) ``` # Markov Chain Monte Carlo (MCMC) and JAGS In the previous section, we saw that a beta distribution was a conjugate prior for a sampling model, meaning that the posterior was also a beta distribution. This prior specification allows easy posterior computations because the integration in the denominator of Bayes rule \vfill In many situations, this type of prior is not available and we need to use other means to understand the posterior distribution $p(\theta|y)$. MCMC is a tool for taking samples from the posterior when \vfill ## Approximating a Distribution with a Large Sample Previously we computed the mean and variance of distributions using Monte Carlo techniques. Now consider taking sample to visualize an entire distribution. ```{r, echo=F} x <- seq(0,1, by =.01) a <- 10 b <- 27 par(mfrow=c(2,2)) x.max <- max(dbeta(x,a,b)) + 1 plot(x, dbeta(x,a,b), type='l', lwd=2, ylim = c(0, x.max), ylab= '',xlab='', main = paste('Beta(',a,',',b,')')) sample.10 <- rbeta(10,a,b) hist(sample.10, freq = F,breaks='FD', ylim=c(0,x.max), xlab='',ylab='', main = "Histogram with 10 samples", xlim=c(0,1)) sample.100 <- rbeta(100,a,b) hist(sample.100, freq = F,breaks='FD', ylim=c(0,x.max), xlab='',ylab='', main = "Histogram with 100 samples", xlim=c(0,1)) sample.1000 <- rbeta(1000,a,b) hist(sample.1000, freq = F,breaks='FD', ylim=c(0,x.max), xlab='',ylab='', main = "Histogram with 1000 samples", xlim=c(0,1)) ``` As the sample size gets larger, the histrogram representation begins to look more like the true distribution. Additionally the moments of the sampled distribution approach that of the true distribution. The true mean is $\frac{a}{a+b}$ = `r round(a / (a+b),3)` and quantiles can be computed using `qbeta(.025,a,b)` = `r round(qbeta(.025,a,b),3)` and `qbeta(.975,a,b)` = `r round(qbeta(.975,a,b),3)` - with 10 samples: the mean is \vfill - with 100 samples: the mean is \vfill - with 1000 samples: the mean is \vfill \newpage ## The Metropolis Algorithm In many cases we cannot sample directly from a posterior distribution using for example `rbeta()`, so we need to use Markov Chain Monte Carlo (MCMC). \vfill ### Politican stumbles across the Metropolis algorithm DBDA provides a nice intuitive overview of a special kind of an MCMC algorithm called the Metropolis algorithm. - The elected politician lives on a chain of islands and wants to stay in the public eye by traveling from island-to-island. \vfill - At the end of day the politician needs to decide whether to: 1. \vfill 2. \vfill 3. \vfill - The politician's goal is to visit all islands proportional to their population, so most time is spent on the most populated islands. Unfortunately his office doesn't know the total population of the island chain. When visiting (or proposing to vist) an island, the politician can ask the local mayor for the population of the island. \vfill - The politician has a simple heuristic for deciding whether to travel to a proposed island, \vfill - *Q:* after flipping the coin, what criteria should the politician use to determine whether to visit a neighboring island or stay at the current island? \vfill - *Q:* now consider two cases and determine what decision the politician should make: 1. the neighboring island \vfill 2. the neighboring island \vfill - The politician calculates $prob.move =$ \vfill - This heuristic actually works in the long run, as the probability that a politician is on any one island is exactly that of the relative population of the island. \vfill \newpage ```{r} par(mfcol=c(2,2)) # set up islands and relative population num.islands <- 5 relative.population <- c(.1,.1,.4,.3,.1) barplot(relative.population, names.arg = as.character(1:5), main='Relative Population') # initialize politician num.steps <- 10000 island.location <- rep(1,num.steps) # start at first island # algorithm for (i in 2:num.steps){ direction <- sample(c('right','left'),1) if (direction == 'right'){ proposed.island <- island.location[i-1] + 1 if (proposed.island == 6) { island.location[i] <- island.location[i-1] #no island 6 exists, stay at island 5 } else { prob.move <- relative.population[proposed.island] / relative.population[island.location[i-1]] if (runif(1) < prob.move){ # move island.location[i] <- proposed.island } else{ #stay island.location[i] <- island.location[i-1] } } } if (direction == 'left'){ proposed.island <- island.location[i-1] - 1 if (proposed.island == 0) { island.location[i] <- island.location[i-1] #no island 0 exists, stay at island 1 } else { prob.move <- relative.population[proposed.island] / relative.population[island.location[i-1]] if (runif(1) < prob.move){ # move island.location[i] <- proposed.island } else{ #stay island.location[i] <- island.location[i-1] } } } } barplot(table(island.location) / num.steps, names.arg = as.character(1:5), main='10,000 Politician Steps') plot(island.location[1:15], ylim=c(1,5),ylab='Island Number',xlab='Step Number', pch=16,type='b', main='First 15 steps') plot(island.location[1:100], ylim=c(1,5),ylab='Island Number',xlab='Step Number', pch=16,type='b', main='First 100 steps') ``` ### More details about the Markov chain This procedure that we have described is a Markov chain. As such we can consider a few probabilities, let $l(i)$ be the politician's location at time $i$ and assume the politician begins at island 5.: - $Pr[l(1) = 5] =$ \vfill - $Pr[l(2) = 1] =$ \vfill - $Pr[l(2) = 2] =$ \vfill - $Pr[l(2) = 3] =$ \vfill - $Pr[l(2) = 4] =$ \vfill - $Pr[l(2) = 5] =$ \vfill \vfill We can also think about transition probabilities from state $i$ to state $j$ ```{r, echo=F} current.state <- paste('at.',1:5, sep='') to.1 <- c(1 - 1/2 * min(relative.population[2]/relative.population[2], 1), 1/2 * min(relative.population[1]/relative.population[2], 1), 0, 0, 0) to.2 <- c(1/2 * min(relative.population[2]/relative.population[1], 1), 1 - 1/2 * min(relative.population[1]/relative.population[2], 1) - 1/2 * min(relative.population[3]/relative.population[2], 1), 1/2 * min(relative.population[2]/relative.population[3], 1), 0, 0) to.3 <- c(0, 1/2 * min(relative.population[3] / relative.population[2], 1), 1 - 1/2 * min(relative.population[2] / relative.population[3], 1) - 1/2 * min(relative.population[4] / relative.population[3], 1), 1/2 * min(relative.population[3] / relative.population[4], 1), 0) to.4 <- c(0, 0, 1/2 * min(relative.population[4] / relative.population[3], 1), 1 - 1/2 * min(relative.population[3] / relative.population[4], 1) - 1/2 * min(relative.population[5] / relative.population[4], 1), 1/2 * min(relative.population[4] / relative.population[5], 1)) to.5 <- c(0, 0, 0, 1/2 * min(relative.population[5] / relative.population[4], 1), 1 - 1/2 * min(relative.population[4] / relative.population[5], 1)) to.1 <- to.2 <- to.3 <- to.4 <- to.5 <- rep('',5) kable(data.frame(current.state, to.1, to.2, to.3, to.4, to.5), digits = 3, caption = 'transition probabilities for 5 island traveling politician example') ``` \newpage ### More details about Metropolis - The process to determine the next location to propose is known as \vfill - Given a proposed cite, \vfill - If a proposed cite is lower than the current position, \vfill The key elements of this process are: 1. Generate a random value \vfill 2. Evaluate the the target distribution \vfill 3. Generate a random variable \vfill The ability to complete these three steps allows indirect sampling from the target distribution, even if it cannot be done directly (viz. `rnorm()`). \vfill Generally our target distribution will be the posterior distribution, $p(\theta|\mathcal{D})$. \vfill Furthermore, this process does not require a normalized distribution, which will mean we don't have to compute $\mathcal{D}$ in the denominator of Bayes rule as it will be the same for any $\theta$ and $\theta'$. Hence evaluating the target distribution will amount to evaluating $p(\mathcal{D}|\theta) \times p(\theta)$. \vfill The traveling politician example has: - \vfill - \vfill - \vfill This procedure also works more generally for: - \vfill - \vfill - \vfill \newpage ### Metropolis Sampler for Beta Prior and Bernoulli likelihood We will soon see learn about JAGS for fitting Bayesian models, but these algorithms can also be written directly in R code. This will be demonstrated on the willow tit dataset and the MCMC results will be compared with the analytical solution. In most cases, analytical solutions for the posterior are not possible and MCMC is typically used to make inferences from the posterior. ```{r, fig.align= 'center', fig.width=5} # set prior parameters for beta distribution a.prior <- 1 b.prior <- 1 # read in data birds <- read.csv('http://www.math.montana.edu/ahoegh/teaching/stat491/data/willowtit2013.csv') y <- birds$birds N <- nrow(birds) # count number of trials z <- sum(birds$birds) # initialize algorithm num.sims <- 10000 sigma.propose <- .1 # standard deviation of normal random walk proposal distribution theta.accept <- rep(0, num.sims) theta.current <- rep(1, num.sims) theta.propose <- rep(1, num.sims) for (i in 2:num.sims){ # Step 1, propose new theta while(theta.propose[i] <= 0 | theta.propose[i] >= 1){ theta.propose[i] <- theta.current[i-1] + rnorm(n = 1, mean = 0, sd = sigma.propose) } # Step 2, compute p.move - note this is on a log scale log.p.theta.propose <- sum(dbinom(y, 1, theta.propose[i], log = T)) + dbeta(theta.propose[i], a.prior, b.prior, log = T) log.p.theta.current <- sum(dbinom(y, 1, theta.current[i-1], log = T)) + dbeta(theta.current[i-1], a.prior, b.prior, log = T) log.p.move <- log.p.theta.propose - log.p.theta.current # Step 3, accept with probability proportional to p.move - still on log scale if (log(runif(1)) < log.p.move){ theta.current[i] <- theta.propose[i] theta.accept[i] <- 1 } else{ theta.current[i] <- theta.current[i-1] } } par(mfcol=c(1,1)) plot(theta.current[1:20], type = 'b', pch=18, ylim=c(0,1), ylab = expression(theta), main = 'First 20 proposals', xlab='step number') points(theta.propose[1:20], pch=1, col='red', cex=2) legend('topright', legend = c('propose','accept'),col=c('red','black'), lty =c(NA,1), pch = c(1,18)) ``` Now after viewing the first twenty steps, consider all steps. ```{r, fig.align='center', fig.width=5} plot(theta.current, type = 'l', ylim=c(0,1), ylab = expression(theta), main = 'Trace Plot', xlab='step number') ``` \newpage Now look at a histogram depiction of the distribution. ```{r} par(mfrow=c(1,1)) library(ggplot2) library(gridExtra) df <- data.frame(theta.current) hist.mcmc <- ggplot(df) + geom_histogram(aes(x=theta.current,y=..density..), bins = 250) + xlab(expression(theta)) + ylab(expression(paste('p(',theta,')',sep=''))) + ggtitle('MCMC Distribution') + xlim(0,1) + ylim(0,15) theta <- seq(0.01,0.99, by = .01) p.theta <- dbeta(theta, a.prior + z, b.prior + N -z) true.df <- data.frame(theta, p.theta) curve.true <- ggplot(true.df) + geom_polygon(aes(x=theta, y=p.theta)) + xlab(expression(theta)) + ylab(expression(paste('p(',theta,')',sep=''))) + ggtitle('True Distribution') + ylim(0,15) grid.arrange(hist.mcmc, curve.true, nrow=2) ``` In this case, we see that the distributions look very similar. In general with MCMC there are three goals: 1. The values \vfill 2. The chain \vfill 3. The chain \vfill \newpage ### JAGS JAGS is a software package for conducting MCMC. We will run this through R, but note you also need to download JAGS to your computer. You will not be able to reproduce this code or run other JAGS examples if JAGS has not been installed. There are a few common examples for running JAGS code, which will be illustrated below: 1. Load the data and place it in a list object. The list will eventually be passed to JAGS. ```{r} library(rjags) library(runjags) birds <- read.csv('http://www.math.montana.edu/ahoegh/teaching/stat491/data/willowtit2013.csv') y <- birds$birds N <- nrow(birds) # count number of trials z <- sum(birds$birds) dataList = list(y = y, Ntotal = N) ``` 2. Specify the model as a text variable. While the code looks vaguely familiar, it to is executed in JAGS. The model statement contains the likelihood piece, $p(y|\theta)$, written as a loop through the $N$ Bernoulli observations and the prior, $p(\theta)$. Finally the model is bundled as a .txt object. ```{r} modelString = " model { for ( i in 1:Ntotal ) { y[i] ~ dbern( theta ) # likelihood } theta ~ dbeta( 1 , 1 ) # prior } " writeLines( modelString, con='TEMPmodel.txt') ``` 3. Initialize the chains by specifying a starting point. This is akin to stating which island the politician will start on. It is often advantageous to run a few chains with different starting points to verify that they have the same end results. ```{r} initsList <- function(){ # function for initializing starting place of theta # RETURNS: list with random start point for theta return(list(theta = runif(1))) } ``` 4. Generate MCMC chains. Now we call the JAGS code to run the MCMC. The `jags.model()` function takes: - a file containing the model specification - the data list - the list containing the initialized starting points - the function also permits running multiple chains, `n.chain`, - `n.adapt` works to tune the algorithm. ```{r} jagsModel <- jags.model( file = "TEMPmodel.txt", data = dataList, inits =initsList, n.chains =3, n.adapt = 500) update(jagsModel, n.iter = 500) ``` The `update` statement results in what is called the burn in period, which is essentially tuning the algorithm and those samples are ultimately discarded. Now we can run the algorithm for a little longer (let the politician walk around). ```{r} codaSamples <- coda.samples( jagsModel, variable.names = c('theta'), n.iter =3334) ``` 5. Examine the results. Finally we can look at our chains to evaluate the results. ```{r, fig.align='center',fig.width==5.5,fig.height=8} HPDinterval(codaSamples) summary(codaSamples) ``` ```{r, fig.align='center',fig.width=5.5,fig.height=6.5} par(mfcol=c(2,1)) traceplot(codaSamples) densplot(codaSamples) ``` \newpage ### STAN Stan is an alternative to JAGS for fitting MCMC. Stan implements a slightly different approach for proposing new locations, known as Hamiltonian Monte Carlo. We may look at Stan later in the class, but the programming is more involved than JAGS. Stan uses C++ as the base code and requires downloading a C compiler and linking it to R. With stan code, the data and parameters are defined separately. Stan also permits vectorized operations, such as `y~bernoulli(theta)`. ```{r, eval=F} library(rstan) # specify model modelString = " data { int N; int y[N] ; // y is a length-N vector of integers } parameters { real theta ; } model { theta ~ beta(1,1) ; y ~ bernoulli(theta) ; } " stanDSO <- stan_model(model_code = modelString) # reuse bird dataset dataList <- list(y = y, N = N) # run code in stan stanFit <- sampling(object=stanDSO, data=dataList, chains=3, iter=1000, warmup =200, thin=1) #convert to coda object mcmcCoda <- mcmc.list( lapply(1:ncol(stanFit), function(x) {mcmc(as.array(stanFit)[,x,]) } ) ) ```