Please use D2L to turn in both the PDF or Word output and your R Markdown file in.

For this lab we will again use the Seattle Housing dataset http://www.math.montana.edu/ahoegh/teaching/stat408/datasets/SeattleHousing.csv. Assume your goal is to model housing prices in Seattle using a normal distribution.

Q1.

a. (4 pts)

Write out the sampling model for this dataset, define all variables.

The housing prices are assumed to be normally distributed, so,

\[p(\{y_1, \dots, y_n\}|\mu, \sigma^2) = \prod_i p(y_i|\mu, \sigma^2)\] where \(p(y_i|\mu, \sigma^2) \sim N(\mu, \sigma^2)\), \(y_i\) is the sales price of house \(i\), \(\mu\) is the mean sales price of houses in Seattle, and \(\sigma^2\) is the variance of housing prices in Seattle.

b. (4 pts)

State and justify prior distributions for \(\mu\) and \(\sigma^2\). Plot both of these prior distributions.

Let \(\mu \sim N(500,000; 10,000,000,000)\). This distribution is centered at $500,000, the standard deviation is 100,000. This places most of the prior mass within \(\pm \$200,000\) from the prior mean.

set.seed(03012018)
M <- 500000
S.sq <- 1e10
S <- sqrt(S.sq)
x <- seq(0, 1000000, by = 10)
plot(x, dnorm(x, M, S), type='l', xlab = expression(mu), ylab = expression(paste('p(', mu,')')), main= expression(paste('Prior for ', mu)))
text(2e5,3e-6, 'Note scale on y-axis will greatly influence plot \n but all prior values are relatively close together', cex=.7)

Let \(\sigma \sim Unif(0,1e8)\) NOTE THIS IS A PRIOR ON \(\sigma\) not \(\sigma^2\). A standard deviation of 1,000,000 would imply that most of the homes are \(\pm\) $2,000,000 from the mean. So setting the upper threshold to be 10,000,000 is conservative.

C <- 1e8
x <- seq(0, C, length.out = 10000)
plot(x, dunif(x, min = 0, max = C), type='l', xlab = expression(sigma), ylab = expression(paste('p(', sigma,')')), main= expression(paste('Prior for ', sigma)))

c. (4 pts)

Using the priors above fit the model in JAGS. Include trace plots and summaries of your output.

Seattle <- read.csv('http://www.math.montana.edu/ahoegh/teaching/stat408/datasets/SeattleHousing.csv', stringsAsFactors = F)

## set up data
dataList = list(y = Seattle$price, Ntotal = nrow(Seattle), M = M, S = S, C = C)

## specify model
modelString = "model {
  for ( i in 1:Ntotal ) {
    y[i] ~ dnorm(mu, 1/sigma^2) # sampling model
  }
  mu ~ dnorm(M,1/S^2)
  sigma ~ dunif(0,C)
} "
writeLines( modelString, con='NORMmodel.txt')

## Starting points
initsList <- function(){
  # function for initializing starting place of theta
  # RETURNS: list with random start point for theta
  return(list(mu = rnorm(1, mean = M, sd = S), sigma = runif(1,0,C)))
}

## Run JAGS

library(rjags)
## Loading required package: coda
## Linked to JAGS 4.3.0
## Loaded modules: basemod,bugs
library(runjags)
jagsModel <- jags.model( file = "NORMmodel.txt", data = dataList, 
                         inits =initsList, n.chains = 3, n.adapt = 10000)
## Compiling model graph
##    Resolving undeclared variables
##    Allocating nodes
## Graph information:
##    Observed stochastic nodes: 869
##    Unobserved stochastic nodes: 2
##    Total graph size: 882
## 
## Initializing model
update(jagsModel, n.iter = 10000)
num.mcmc <- 10000
codaSamples <- coda.samples( jagsModel, variable.names = c('mu', 'sigma'), n.iter = num.mcmc)

par(mfcol=c(1,2))
traceplot(codaSamples)

densplot(codaSamples)

d. (4 pts)

Print and interpret an HDI for \(\mu\).

HPDinterval(codaSamples)
## [[1]]
##          lower    upper
## mu    585450.1 668952.9
## sigma 605489.1 665372.6
## attr(,"Probability")
## [1] 0.95
## 
## [[2]]
##          lower    upper
## mu    585795.1 668747.6
## sigma 606339.8 666007.7
## attr(,"Probability")
## [1] 0.95
## 
## [[3]]
##          lower    upper
## mu    585987.5 668793.1
## sigma 605819.3 666189.9
## attr(,"Probability")
## [1] 0.95

The highest density intervals contain the region that are most plausible based on the prior and sampling model (data). Thus the mean housing price in Seattle is likely between about $580,000 and $660,000, with 95% probability. Similarly the standard deviation of housing prices is likely between about 600,000 and 660,000. #### e. (2 pts) Calculate the \(Pr[\mu > 650,000]\).

prob <- mean(sapply(codaSamples, function(x) return(mean(x[,'mu'] > 650000))))

Across the three chains, the probability that \(\mu > 650,000\) is 0.1333667. This can be seen in a histogram representation

par(mfcol=c(1,1))
mu.samples <- sapply(codaSamples, function(x) return(x[,'mu'] ))
hist(mu.samples, breaks='FD', prob=T)
abline(v=650000, col='red',lwd=2)

f. (2 pts)

Calculate the probability of an individual house, say \(y^{*}\) costing more than \(\$650,000.\)

This calculation uses a posterior predictive distribution. For this distribution we compute the following interval by sampling \[p(y^*|y) = \int p(y^*|\theta) p(\theta|y) d\theta\] where \(p(\theta|y)\) is the posterior distribution, and \(y^*\) is a new data point that follows the same sampling model as the previously observed data, \(y\).

In practice, this is done by:

  • taking posterior samples from \(p(\theta|y)\), using JAGS in this case
  • including those posterior samples in the sampling model to simulate a distribution for a new data point.
mu.samples <- sapply(codaSamples, function(x) return(x[,'mu'] ))
sigma.samples <- sapply(codaSamples, function(x) return(x[,'sigma'] ))
y.star <- rnorm(length(mu.samples), mu.samples, sigma.samples)
hist(y.star, breaks='FD', prob=T)
abline(v=650000, col='red',lwd=2)

prob.house <- mean(y.star > 650000)

The probability of a single house costing more than $650,000 is calculated as 0.4881.

g. (4 pts)

Summarize the results from part d so that your aunt who is a realtor in Billings can understand.

We have constructed a model to estimate housing prices in Seattle from 2014-2015 based off of 869 homes. During this period the average housing price was likely between about $600,000 and $650,000. While this is the average cost of a home, there is a lot of variability in the housing prices. Generally, most of the houses would be within two standard deviations of the mean, so in this case a majority of the homes would be within $1,200,000 of the mean.

NOTE: not for your aunt, but notice that we see negative housing prices in part f. This is a result of a prior and sampling model that permit negative values. A log-normal sampling model would be more appropriate.