---
title: "STAT 491 - Lecture 13: Regression for Binary and Count Data"
output: pdf_document
---

```{r setup, include=FALSE}
set.seed(04192018)
knitr::opts_chunk$set(echo = TRUE)
library(knitr)
library(runjags)
library(rjags)
library(ggplot2)
```

## Dichotomous Predicted Variable
- This section focus on dichotomous predicted variables: such as whether a basketball player will get a hit or if a bird will be located in a spatial grid.
\vfill
- Traditionally, these types of methods are generally implemented with logistic regression.
\vfill
- The model can be written as:
\begin{eqnarray*}
y &\sim& Bernoulli(\mu)\\\
\mu &=& logistic(\beta_0 + \beta_1 x_1 + \beta_2 x_2),
\end{eqnarray*}
where the logistic function is $logistic(x) = 1 / (1 + exp(-x))$.
\vfill
- *Q:* What are the three components of a GLM and what are they in this specific setting?
    1. Sampling Model:  \vfill
    2. Linear Combination of Predictors:  \vfill
    3. (inverse) Link function: \vfill
    
- To fit this in a Bayesian framework, we need to specify priors. What parameters require priors and what distributions would be reasonable? \vfill
\newpage

### Lab Exercise
We will revisit the Swiss birds dataset for this lab to construct a logistic regression model for presence of the Willow Tit.

```{r}
swiss.birds <- read.csv('http://www.math.montana.edu/ahoegh/teaching/stat491/data/willowtit2013.csv')
kable(head(swiss.birds))
```

This dataset contains 242 sites and 6 variables:  

    - siteID, a unique identifier for the site, some were not sampled during this period 
    - elev, mean elevation of the quadrant in meters 
    - rlength, the length of the route walked by the birdwatcher, in kilometers 
    - forest, percent forest cover 
    - birds, binary variable for whether a bird is observed, 1 = yes 
    - searchDuration, time birdwatcher spent searching the site, in minutes
    
#### 1. (5 points) Model Specification
Clearly write out the model, using proper notation for the variables in bird dataset. You don't need use all variables, but you should state which are included. 
\vfill

#### 2. (5 points) Priors
Describe and justify the necessary priors for this model.
\vfill

#### 3. (5 points) Fit MCMC
Fit the JAGS code for this model. You will have to put this together following the specification in the previous examples, but the following statement can be used for the sampling model portion.

```{r, eval = F}
model {
  for (i in 1:Ntotal) {
    y[i] ~ dbern(mu[i])
    mu[i] <- ilogit(beta0 + sum( beta[1:Nx] * x[i,1:Nx] ))
  }
  # priors inserted here
  
}
```

#### 4. (5 points)  Summarize inferences from model
Talk about the model and discuss which and how predictor variables influence the observation of a bird. 


\newpage

### Interpretation of Logistic Regression Coefficients
Recall this model can be written as: \vfill
- When $x_i$ increases or decreases by 1 unit,  \vfill
- The logit function for $\mu$ can be expressed as $logit(\mu) = \log(\frac{\mu}{1-\mu})$. \vfill
- In this setting, $\mu$ is the probability of y = 1, so  \vfill
- This ratio: \vfill
- Suppose the logistic regression for the Swiss birds had the following coefficients: $\beta_0 = -6$, $\beta_{elev} = .002$, and $\beta_{forest} = .06$. Compute the probability for observing a bird in a quadrant with: elevation = 1500 meters and forest cover of 60 %. \vfill
    - How do we interpret the meaning for the coefficent $\beta_{forest}$?
    \vfill
    - The log-odds are different than a probability as in this case (conditional on the elevation) the probability increases to:
    $1 / (1 + exp(- (-6 + .002 * 1500 + .06 * 70)))$ = \vfill
    - Note that an unit increase in the log-odds does not have a unit increase in the probability.
    \newpage
    
## Count Predicted Variable
- Now reconsider the willow tit dataset and consider modeling not just the presence / absence of birds, but directly modeling the number of birds observed in each spatial region.

```{r, fig.align='center', fig.width=4, height=3}
swiss.birds <- read.csv('http://math.montana.edu/ahoegh/teaching/stat491/data/willowtit2013_count.csv')
kable(head(swiss.birds))
ggplot(aes(bird.count), data=swiss.birds) + geom_bar() + ylab('Frequency') + xlab('Number of Birds Observed') + 
  ggtitle('Bird Counts of Willow Tit')
```
\vfill
\newpage

1. Sampling Model: In general the Poisson model will be used as a sampling model for count data:
    -$y|\mu$. \vfill
    - the mean and variance of the Poisson \vfill
    - if this is not a reasonable assumption, the negative binomial distribution can be used
\vfill
2. Linear Combination of Predictors: 
\vfill
3. Link Function: The support for count data is values greater than or equal to zero.
    - \vfill
    -     \vfill
\newpage

### Homework Assignment

We will revisit the Swiss birds dataset for this lab to construct a Poisson regression model for abundance of the Willow Tit.

```{r}
#swiss.birds <- read.csv('http://math.montana.edu/ahoegh/teaching/stat491/data/willowtit2013_count.csv')
kable(head(swiss.birds))
```

This dataset contains 242 sites and 6 variables. The only difference from the lab is:  
    - bird.count, count variable for abndance of birds observed, rather than a binary outcome

#### 1. (5 points) Model Specification
Clearly write out the model, using proper notation for the variables in bird dataset. You don't need use all variables, but you should state which are included. 
\vfill

#### 2. (5 points) Priors
Describe and justify the necessary priors for this model.
\vfill

#### 3. (5 points) Fit MCMC
Fit the JAGS code for this model. You will have to put this together following the specification in the previous examples, but the following statement can be used for the sampling model portion.

```{r, eval = F}
model {
  for (i in 1:Ntotal) {
    y[i] ~ dpois(mu[i])
    mu[i] <- exp(beta0 + sum( beta[1:Nx] * x[i,1:Nx] ))
  }
  # priors inserted here
  
}
```


#### 4. (5 points)  Summarize inferences from model
Talk about the model and discuss which and how predictor variables influence the observation of a bird. Furthermore, discuss the differences between this model and the logistic regression setting.