---
title: "Lecture: Advanced Bayesian Computing"
output: pdf_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = FALSE)
knitr::opts_chunk$set(fig.align = 'center')
knitr::opts_chunk$set(fig.width = 4)
knitr::opts_chunk$set(fig.height = 4)
```

## Importance Sampling and Sequential Monte Carlo

#### Accept - Reject Sampling
In many scenarios, sampling from a function $g(x)$ can be challenging (in that we cannot use `rg.function()` to sample from it. The general idea of accept-reject sampling is to simulation observations from another distribution $f(x)$ and accept the response if it falls under the distribution $g(x)$.

Formally the algorithm for the Accept-Reject Method follows as:
\begin{enumerate}
	\item Generate $X \sim f, U \sim Unif[0,1]$
	\item Accept $Y=X$ if $U \leq g(x)/M f(x)$,
\end{enumerate}
where $f(x)$ and $g(x)$ are normalized probability distributions and M is a constant $\geq 1$.


Suppose we want to draw samples from a normalized form of $g(x) = \frac{\sqrt{1-x^2}}{\pi/2}$. Then $f(x) = \frac{1}{2}$ for $x \in [-1,1]$ and $M = \frac{4}{\pi}$.

\newpage
```{r, echo = T}
num.sims <- 10000
x <- runif(num.sims,-1,1) # simulate samples
u.vals <- runif(num.sims) 
M <- 4/pi
f.x <- 1/2

accept.ratio <- (sqrt(1-x^2) / (pi/2)) / (f.x*M) # g/fM
accept.index <- (1:num.sims)[u.vals < accept.ratio]
g.x = x[accept.index]

hist(g.x, prob = T, ylim = c(0,1), breaks = 'FD')
```

Now we have samples from $g(x)$ and could, for example, compute $I=\int x g(x) dx$

Without the use of packages such as `rtnorm()` how would you draw a sample from a truncated normal distribution?

One possibility is to use accept-reject sampling. Simulate points from a normal distribution with the same mean and variance. Let M = $\frac{1}{\Phi(c; \mu,\sigma^2)}$, where $\Phi(.;\mu,\sigma^2)$ is the cdf function a normal random variable and c is the truncation point. Then all values of x greater than c are accepted and all values less than c are rejected. 

\newpage

#### Importance Sampling
Importance sampling is related to the idea of accept-reject sampling, but emphasizes focusing on the ``important'' parts of the distribution and uses all of the simulations. In accept-reject sampling, computing the value $M$ can be challenging in its own right. We can alleviate that problem with importance sampling.

Again the goal is to compute some integral, say $I=\int h(x) g(x) dx = E[h(x)].$ We cannot sample directly from $g(x)$ but we can evaluate the function. So the idea is to find a distribution that we can simulate observations from, $f(x)$, that ideally looks similar to $g(x)$.
The importance sampling procedure follows as:
\begin{enumerate}
	\item Draw $x_1, \dots, x_n$ from trial distribution $f(x)$.
	\item Calculate the \emph{importance weight}:\\
		$w_j = g(x_j)/f(x_j)$, for $j = 1, \dots, n$
	\item Approximate $I= \frac{w_1 h(x_1) + \dots + w_nh(x_n)}{w_1 + \dots + w_n}$
\end{enumerate} 


Example. Compute the mean of a mixture of truncated normal distributions. Let $X \sim (.4)N(4,3)_{+(0)} + (.6) N(1,.5)_{+(0)}$.

```{r}
x.seq <- seq(0, 10, by = .01)
library(truncnorm)
sample.dens <- .4 * dtruncnorm(x.seq, a = 0, b= Inf, mean = 4, sd = sqrt(3)) +
  .6 * dtruncnorm(x.seq, a = 0, b= Inf, mean = 1, sd = sqrt(.5))
plot(x.seq, sample.dens, type = 'l', xlab = 'x', ylab = '')
```

\newpage

Let the trial distribution be an Exponential(.5) distribution. Then the importance sampling follows as:
```{r, echo = T}
f <- rexp(100000, rate = .5)
w <- ( .6*dnorm(f,1,sqrt(.5))/ (1-pnorm(0,1,sqrt(.5))) +
+        .4*dnorm(f,4,sqrt(3))/ (1-pnorm(0,4,sqrt(3))) ) /
+       dexp(f,.5)

sum(w*f) / sum(w)
```


Note the `rtnorm()` function has this implementation using importance sampling with an exponential distribution.

#### Sequential Importance Sampling

Sequential Importance Sampling is commonly used in time series settings with state-space models.

For some context, consider the following model, describe the parameters in this model and explain what the model does. (This will look familiar to 536 students)

\begin{eqnarray*}
y_t &\sim& Bernoulli(\pi_t)\\
logit(\pi_t) &=&  x_{t} \beta_{t}\\
\beta_{t} &=& \beta_{t-1} + w_{t} \\
\end{eqnarray*}

where $w_{t} \sim N(0, W)$.


```{r, fig.cap = 'time varying probability in line and observed binary points as black dots' }
set.seed(11192018)
num.pts <- 500
beta <- rep(3,num.pts)
x <- runif(num.pts)
W <- .05

for (time.pts in 2:num.pts){
  beta[time.pts] <- beta[time.pts - 1] + rnorm(1, mean = 0, sd = sqrt(W))
}

pi.t <- exp( x * beta) / (1 + exp(x * beta) )
y <- rbinom(num.pts, 1 ,pi.t)
```

An algorithm known as a particle filter is implemented to learn the values of $\beta$.
For simplicity we will assume that $W$ is known, but this is not necessary.

A particle filter for time series data has a few common steps:

1. Prior distribution: For a model of this sort, we only need a prior distribution on $\beta$ at time 0. Let $\beta_0 \sim N(0, 4^2)$
\vfill
2. Evolution distribution: This controls the change in $\beta$ from time $t$ to $t+1$ and ultimately ends up as our proposal distribution.
\vfill
3. (Optionally) A predictive distribution for $y_{t+1}$ can be constructed here.
\vfill
4. Importance Weights and Resampling: Importance weights are calculated, that is how well do the particles from the evolution distribution match the next data point. The resampled distribution becomes the prior distribution for the next time point.

\newpage

##### prior distributions


```{r, fig.height=4, fig.width=4}
# Storage
num.particles <- 10000
beta.samples <- matrix(0,nrow=num.particles, ncol = num.pts + 1)

# initialize particles at time 0
beta.samples[,1] <- rnorm(num.particles, mean = 0, sd=4)

hist(beta.samples[,1], main = 'Prior at time 0', xlab = expression(beta), breaks = 'FD')
```

##### evolution distributions

```{r}
# Storage
beta.evolution <- matrix(0, num.particles, num.pts)
  
# initialize particles at time 0
beta.evolution[,1] <- beta.samples[,1] + rnorm(num.particles, mean = 0, sd=sqrt(W))

hist(beta.evolution[,1], main = 'Evolution Dist at time 1', xlab = expression(beta), breaks = 'FD')
```

\newpage

##### predictive distributions

```{r}
xb <-  beta.evolution[,1] * x[1]
probs <- exp(xb) / (1 + exp(xb))
y1 <- rbinom(num.particles, 1, probs)

par(mfcol=c(1,2))
hist(probs, main = 'distribution for pi', xlab = expression(pi[1]), breaks = 'FD')
hist(y1, breaks = c(-.5,.5,1.5),main = 'distribution for y', xlab = expression(y[1]), probability = TRUE)
```

##### compute importance weights and resample

```{r}
w <- dbinom(y[1], 1, prob = probs) / (dnorm(beta.evolution[,1],mean = beta.samples[,1], sd = sqrt(W))) 

w.norm <- w / sum(w)

sample.pos <- base::sample(num.particles, prob = w.norm, replace = T)

beta.samples[,2] <- beta.evolution[sample.pos,1]


hist(beta.samples[,2], main = 'Posterior after time 1', xlab = expression(beta), breaks = 'FD')
abline(v=beta[1], col = 'red')
```

\newpage

##### Example for all time points

```{r, fig.width = 6, fig.height = 6}

for (time.pts in 2:num.pts){
  # evolution
  beta.evolution[,time.pts] <- beta.samples[,time.pts] + rnorm(num.particles, mean = 0, sd=sqrt(W))

  # weights and resample
  xb <- beta.evolution[,time.pts] * x[time.pts]
  probs <- exp(xb) / (1 + exp(xb))
  w <- dbinom(y[time.pts],1, prob = probs) / (dnorm(beta.evolution[,time.pts],mean = beta.samples[,time.pts], sd = sqrt(W)) ) 
  w.norm <- w / sum(w)

  sample.pos <- base::sample(num.particles, prob = w.norm, replace = T)

  beta.samples[,time.pts + 1] <- beta.evolution[sample.pos,time.pts]
}  

beta.q <- apply(beta.samples, 2, quantile, prob=c(.025,.975))
par(mfcol = c(1,1))
plot(1:num.pts, beta, ylim=c(min(beta.samples), max(beta.samples)), pch = 16, xlab = 'time')
lines(1:num.pts, beta.q[1, -1], lty = 2)
lines(1:num.pts, beta.q[2, -1], lty = 2)

prob.samples <- exp(beta.samples[,-1] * matrix(x, nrow = num.particles, ncol = time.pts, byrow = T)) / (1  + exp(beta.samples[,-1] * matrix(x, nrow = num.particles, ncol = time.pts, byrow = T)))
prob.q <- apply(prob.samples, 2, quantile, prob = c(.025,.975))

```

\newpage

## Variational Bayes

First, let's define some notation. Assume that $y = \{y_1, \dots, y_n\}$ are observations and $\theta = \{\theta_1, \dots, \theta_p\}$ are parameters.
\vfill
In general in Bayesian statistics, the goal is to learn the posterior distribution:
$$p(\theta|x) = \frac{p(x|\theta) p(\theta)} { \int p(x|\theta) p(\theta) d \theta}$$.
\vfill
In many situations, the posterior is difficult to compute, either analytically or computationally. With MCMC we approximate the integral using algorithmic procedures.
\vfill
An alternative to MCMC, is to use variational methods to approximate the posterior distribution. Define a new distributional family, say $q(\theta|\nu),$ with it's own variational parameters $\nu$.
\vfill
Then, assuming $q$ is reasonably close to the posterior distribution, $p$,  then $q$ is used in place of $p$.
\vfill
The Kullback-Leibler (KL) divergence is used to measure the similarity between the two distributions.
\vfill
Typically with variational inference the variational distributions are assumed to factorize such that, $q(\theta_1, \dots, \theta_p) = \prod q(\theta_i)$, each variable is independent. Typically the true posterior does not permit this factorization, as dependence between parameters often cause some of the issues with difficulties in dealing with the posterior in the first place. 
\vfill
Coordinate ascent algorithms are iteratively update each variational distribution (similar to a Gibbs sampler). In fact, the variational distributions often have the same distributional family as the full conditional distributions.
\vfill
Variational methods are typically considerably faster than MCMC and scale to larger data sets. 
\vfill
For additional details, see an article by David Blei "Variational Inference: A Review for Statisticians" [https://arxiv.org/pdf/1601.00670.pdf](https://arxiv.org/pdf/1601.00670.pdf).