--- title: "Lecture 8" output: pdf_document --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) library(dplyr) library(ggplot2) ggplotColours <- function(n=6, h=c(0, 360) +15){ if ((diff(h)%%360) < 1) h[2] <- h[2] - 360/n hcl(h = (seq(h[1], h[2], length = n)), c = 100, l = 65) } ``` ## Hierarchical Modeling This chapter focuses on comparison of means across groups and more generally Bayesian hierarchical modeling. Hierarchical modeling is defined by datasets with a multilevel structure, such as: \vfill \vfill The most basic form of this type of data consists of two-levels, groups and individuals within groups. \vfill Recall, observations are exchangeable if $p(y_1, \dots, y_n) = p(y_{\pi_1}, \dots, y_{\pi_n})$. Consider where $Y_1, \dots, Y_n$ are test scores from randomly selected students from a given STAT 216 instructor/course. If exchangeability holds for these values, then: \vfill The exchangeability can be interpreted that the random variables are independent samples from a population with a parameter, $\phi$. For instance in a normal model, $\phi = \{\theta,\sigma^2\}$ and the data are conditionally independent from a normal distribution $N(\theta,\sigma^2).$ \vfill In a hierarchical framework this can be extended to include the group number: \vfill The question now is how to we characterize the information between $\phi_1, \dots, \phi_m$? \vfill Is it reasonable to assume that the values are independent, \vfill \newpage Now consider the groups as samples from a larger population, then using the idea of exchangeability with group-specific parameters gives: \begin{eqnarray*} \phi_1, \dots, \phi_m | \psi \sim \text{i.i.d. } p(\phi|\psi). \end{eqnarray*} \vfill This is similar to the idea of a random effect model and gives the following hierarchical probability model: \begin{eqnarray*} y_{1,j}, \dots, y_{n_j,j} |\phi_j &\sim& p(y|\phi_j) \hspace{3cm} \\ \phi_1, \dots, \phi_m |\psi &\sim& p(\phi|\psi) \hspace{3.1cm} \\ \psi &\sim & p(\psi) \hspace{3.43cm} \end{eqnarray*} \vfill The distributions $p(y|\phi)$ and $p(\phi|\psi)$ represent sampling variability: - $p(y|\phi)$ represents - $p(\phi|\psi)$ represents #### Hierarchical normal model The hierarchical normal model is often used for modeling differing means across a population. \begin{eqnarray*} \phi_j =\{ \theta_j, \sigma^2 \}, p(y|\phi_j) &=& \\ \psi = \{\mu,\tau\}, p(\theta_j|\psi) &=& \end{eqnarray*} \vfill Note this model specification assumes constant variance for each within-group model, but this assumption can be relaxed. \vfill This model contains three unknown parameters that need priors, we will use the standard semi-conjugate forms: \begin{eqnarray*} \sigma^2 & \sim & \\ \tau^2 & \sim & \\ \mu &\sim & \end{eqnarray*} \vfill Given these priors, we need to derive the full conditional distributions in order to make draws from the posterior distribution. Note the joint posterior distribution, can be expressed as: \begin{eqnarray*} p(\tilde{\theta}, \mu, \tau^2, \sigma^2|\tilde{y}_1, \dots, \tilde{y}_n)& \propto& p(\mu, \tau^2, \sigma^2) \times p(\tilde{\theta}|\mu,\tau^2,\sigma^2) \times p(\tilde{y}_1, \dots, \tilde{y}_m|\tilde{\theta},\mu,\tau^2,\sigma^2)\\ & \propto & p(\mu)p(\sigma^2)p(\tau^2) \times \left(\prod_{j=1}^m p(\theta_j|\mu,\tau^2) \right) \times \left(\prod_{j=1}^m \prod_{i=1}^{n_j} p(y_{i,j}|\theta_j,\sigma^2) \right). \end{eqnarray*} \vfill \newpage #### Posterior Samples 1. Sampling $\mu$: $p(\mu|-) \propto p(\mu) \prod_{j=1}^m p(\theta_j|\mu,\tau^2).$ This is a familiar setting with two normal models, hence, the posterior is also a normal distribution. \vfill $\mu|- \sim normal\left( \frac{m\bar{\theta}/\tau^2 + \mu_0/ \gamma_0^2}{m/ \tau^2 + 1/ \gamma_0^2}, \left[ m/\tau^2 + 1/ \gamma_0^2 \right]^{-1} \right)$ \vfill 2. Sampling $\tau^2$: $p(\tau^2|-) \propto p(\tau^2) \prod_{j=1}^m p(\theta_j|\mu,\tau^2).$ Again this is similar to what we have seen before. \vfill $\tau^2| - \sim InvGamma \left(\frac{\eta_0 + m}{2},\frac{\eta_0 \tau_0^2 + \sum_j(\theta_j - \mu)^2}{2} \right)$ \vfill Now what about $\theta_1, \dots, \theta_m$? \vfill 3. Sampling $\theta_1, \dots, \theta_m$}. Consider a single $\theta_j$, then $\theta_j|- \propto p(\theta_j|\mu, \tau^2) \prod_{i=1}^{n_j} p(y_{i,j}|\theta_j,\sigma^2)$. \vfill $\theta_j|- \sim$ \vfill4. Sampling $\sigma^2$: $p(\sigma^2|-) \propto p(\sigma^2) \prod_{j=1}^m \prod_{i=1}^{n_j} p(y_{i,j}|\theta_j,\sigma)$. \vfill $\sigma^2|- \sim InvGamma \left( \frac{1}{2} \left[ \nu_0 + \sum_{j=1}^m n_j\right], \frac{1}{2} \left[\nu_0 \sigma_0^2 + \sum_{j=1}^m \sum_{i=1}^{n_j} (y_{i,j} - \theta_j)^2 \right] \right).$ \newpage #### Data Example Consider the dataset outline in Chapter 8, that focuses on math tests scores for students spread across 100 schools. Using the Gibbs sampling procedure described above we can fit this model, code courtesy of textbook. ```{r, eval = T} Y <- dget("https://www.stat.washington.edu/~hoff/Book/Data/data/Y.school.mathscore") ### weakly informative priors nu0 <- 1 sigmasq.0 <-100 eta.0<-1 tausq.0<-100 mu.0<-50 gammasq.0<-25 ### ### starting values m <- length(unique(Y[,1])) # number of schools n<-sv<-ybar<-rep(NA,m) for(j in 1:m) { ybar[j]<-mean(Y[Y[,1]==j,2]) sv[j]<-var(Y[Y[,1]==j,2]) n[j]<-sum(Y[,1] ==j) } theta<-ybar sigma2<-mean(sv) mu<-mean(theta) tau2<-var(theta) ### ### setup MCMC set.seed(1) S<-5000 THETA<-matrix( nrow=S,ncol=m) MST<-matrix( nrow=S,ncol=3) ### ### MCMC algorithm for (s in 1:S){ # sample new values of the thetas for (j in 1:m){ vtheta<-1/(n[j]/sigma2+1/tau2) etheta<-vtheta*(ybar[j]*n[j]/sigma2+mu/tau2) theta[j]<-rnorm(1,etheta,sqrt(vtheta)) } #sample new value of sigma2 nun<-nu0+sum(n) ss<-nu0*sigmasq.0; for(j in 1:m){ ss<-ss+sum((Y[[j]]-theta[j])^2) } sigma2<-1/rgamma(1,nun/2,ss/2) #sample a new value of mu vmu<- 1/(m/tau2+1/gammasq.0) emu<- vmu*(m*mean(theta)/tau2 + mu.0/gammasq.0) mu<-rnorm(1,emu,sqrt(vmu)) # sample a new value of tau2 etam<-eta.0+m ss<- eta.0*tausq.0 + sum( (theta-mu)^2 ) tau2<-1/rgamma(1,etam/2,ss/2) #store results THETA[s,]<-theta MST[s,]<-c(mu,sigma2,tau2) } ``` \vfill ```{r, echo = F} post.samples <- data.frame(post.mean = c(THETA[,46],THETA[,82]), school = as.factor(c(rep(46,5000), rep(82,5000) ))) Y_df <- Y %>% tbl_df() ggplot(data = post.samples, aes(post.mean, color = school)) + geom_density(aes(color = school)) + annotate("point", x = Y_df %>% filter(school == 46) %>% select(mathscore) %>% pull(), y = -.01, colour = ggplotColours(2)[1], size = 1.5) +annotate("point", x = Y_df %>% filter(school == 46) %>% select(mathscore) %>% pull() %>% mean(), y = -.01, colour = ggplotColours(2)[1], size = 5) + annotate("point", x = Y_df %>% filter(school == 82) %>% select(mathscore) %>% pull(), y = -.02, colour = ggplotColours(2)[2], size = 1.5) + annotate("point", x = post.samples %>% filter(school == 82) %>% select(post.mean) %>% pull() %>% mean(), y = -.02, colour = ggplotColours(2)[2], size = 5, shape =3) + annotate("point", x = Y_df %>% filter(school == 82) %>% select(mathscore) %>% pull() %>% mean(), y = -.02, colour = ggplotColours(2)[2], size = 5) + annotate("point", x = post.samples %>% filter(school == 46) %>% select(post.mean) %>% pull() %>% mean(), y = -.01, colour = ggplotColours(2)[1], size = 5, shape =3) + xlab('') ``` \vfill Interpret the figure created above. What are the small circle, large circles, and the plus signs? \vfill Why does this happen and is it a good thing? \vfill \newpage #### Shrinkage Recall the posterior mean can be represented as a weighted average, specifically: \begin{equation} E[\theta_j|\tilde{y}_j, \mu, \tau^2, \sigma^2] = \frac{\bar{y}_j n_j / \sigma^2 + \mu / \tau^2}{n_j/\sigma^2 + 1/ \tau^2}. \end{equation} In this case $\mu$ and $\tau^2$ are not chosen parameters from prior distributions, but rather they come from the between group model. So the posterior means for test scores at each school are pulled from the sample mean toward the overall group mean across all of the schools. This phenemenon is known as *shrinkage*. \vfill Schools with more students taking the exam see less shrinkage, as there is more weight on the data given more observations. So the figure we discussed before, shows more shrinkage for school 82 as there were fewer observations. \vfill So what about shrinkage, does it make sense? Is it a good thing? \vfill We will soon see that it is an extremely powerful tool and actually dominates the unbiased estimator (the MLE for each distribution). This suprising result is commonly known as Stein's Paradox. \vfill #### Hierarchical Modeling of Means and Variances The model we just described and fit was somewhat restrictive in that each school was known to have a common variance. It is likely that schools with a more heterogenous mix of students would have greater variance in the test scores. There are a couple of solutions, the first involves a set of i.i.d. priors on each $\sigma_j^2$ \begin{equation} \sigma_1^2, \dots, \sigma_m^2 \sim \text{i.i.d. } gamma(\nu_0/2, \nu_0 \sigma_0^2/2), \end{equation} however, this results in a full conditional distribution for $\sigma_j$ that only takes advantage of data from school $j$. In other words no information from the other schools is used to estimate that variance. \vfill Another option is to consider $\nu_0$ and $\sigma_0^2$ as parameters to be estimated in the hierarchical model. A common prior for $\sigma_0^2$ would be $p(\sigma_0^2) \sim Gamma(a,b)$. Unfortunately, there is not a semi-conjugate prior distribution for $\nu_0$. The textbook suggests a geometric distribution, where $p(\nu_0) \propto \exp(-\alpha \nu_0)$. Then the full conditional distribution allows a sampling procedure that enumerates over the domain of possible values. This procedure allows shrinkage for the variance terms as well. It is worth noting, that pooling variances is a common way to tackle this particular problem. \vfill