Please use D2L to turn in both the PDF or Word output and your R Markdown file in.
For this lab we will again use the Seattle Housing dataset http://www.math.montana.edu/ahoegh/teaching/stat408/datasets/SeattleHousing.csv. Assume your goal is to model housing prices in Seattle using a normal distribution.
Write out the sampling model for this dataset, define all variables.
The housing prices are assumed to be normally distributed, so,
\[p(\{y_1, \dots, y_n\}|\mu, \sigma^2) = \prod_i p(y_i|\mu, \sigma^2)\] where \(p(y_i|\mu, \sigma^2) \sim N(\mu, \sigma^2)\), \(y_i\) is the sales price of house \(i\), \(\mu\) is the mean sales price of houses in Seattle, and \(\sigma^2\) is the variance of housing prices in Seattle.
State and justify prior distributions for \(\mu\) and \(\sigma^2\). Plot both of these prior distributions.
Let \(\mu \sim N(500,000; 10,000,000,000)\). This distribution is centered at $500,000, the standard deviation is 100,000. This places most of the prior mass within \(\pm \$200,000\) from the prior mean.
set.seed(03012018)
M <- 500000
S.sq <- 1e10
S <- sqrt(S.sq)
x <- seq(0, 1000000, by = 10)
plot(x, dnorm(x, M, S), type='l', xlab = expression(mu), ylab = expression(paste('p(', mu,')')), main= expression(paste('Prior for ', mu)))
text(2e5,3e-6, 'Note scale on y-axis will greatly influence plot \n but all prior values are relatively close together', cex=.7)
Let \(\sigma \sim Unif(0,1e8)\) NOTE THIS IS A PRIOR ON \(\sigma\) not \(\sigma^2\). A standard deviation of 1,000,000 would imply that most of the homes are \(\pm\) $2,000,000 from the mean. So setting the upper threshold to be 10,000,000 is conservative.
C <- 1e8
x <- seq(0, C, length.out = 10000)
plot(x, dunif(x, min = 0, max = C), type='l', xlab = expression(sigma), ylab = expression(paste('p(', sigma,')')), main= expression(paste('Prior for ', sigma)))
Using the priors above fit the model in JAGS. Include trace plots and summaries of your output.
Seattle <- read.csv('http://www.math.montana.edu/ahoegh/teaching/stat408/datasets/SeattleHousing.csv', stringsAsFactors = F)
## set up data
dataList = list(y = Seattle$price, Ntotal = nrow(Seattle), M = M, S = S, C = C)
## specify model
modelString = "model {
for ( i in 1:Ntotal ) {
y[i] ~ dnorm(mu, 1/sigma^2) # sampling model
}
mu ~ dnorm(M,1/S^2)
sigma ~ dunif(0,C)
} "
writeLines( modelString, con='NORMmodel.txt')
## Starting points
initsList <- function(){
# function for initializing starting place of theta
# RETURNS: list with random start point for theta
return(list(mu = rnorm(1, mean = M, sd = S), sigma = runif(1,0,C)))
}
## Run JAGS
library(rjags)
## Loading required package: coda
## Linked to JAGS 4.3.0
## Loaded modules: basemod,bugs
library(runjags)
jagsModel <- jags.model( file = "NORMmodel.txt", data = dataList,
inits =initsList, n.chains = 3, n.adapt = 10000)
## Compiling model graph
## Resolving undeclared variables
## Allocating nodes
## Graph information:
## Observed stochastic nodes: 869
## Unobserved stochastic nodes: 2
## Total graph size: 882
##
## Initializing model
update(jagsModel, n.iter = 10000)
num.mcmc <- 10000
codaSamples <- coda.samples( jagsModel, variable.names = c('mu', 'sigma'), n.iter = num.mcmc)
par(mfcol=c(1,2))
traceplot(codaSamples)
densplot(codaSamples)
Print and interpret an HDI for \(\mu\).
HPDinterval(codaSamples)
## [[1]]
## lower upper
## mu 585450.1 668952.9
## sigma 605489.1 665372.6
## attr(,"Probability")
## [1] 0.95
##
## [[2]]
## lower upper
## mu 585795.1 668747.6
## sigma 606339.8 666007.7
## attr(,"Probability")
## [1] 0.95
##
## [[3]]
## lower upper
## mu 585987.5 668793.1
## sigma 605819.3 666189.9
## attr(,"Probability")
## [1] 0.95
The highest density intervals contain the region that are most plausible based on the prior and sampling model (data). Thus the mean housing price in Seattle is likely between about $580,000 and $660,000, with 95% probability. Similarly the standard deviation of housing prices is likely between about 600,000 and 660,000. #### e. (2 pts) Calculate the \(Pr[\mu > 650,000]\).
prob <- mean(sapply(codaSamples, function(x) return(mean(x[,'mu'] > 650000))))
Across the three chains, the probability that \(\mu > 650,000\) is 0.1333667. This can be seen in a histogram representation
par(mfcol=c(1,1))
mu.samples <- sapply(codaSamples, function(x) return(x[,'mu'] ))
hist(mu.samples, breaks='FD', prob=T)
abline(v=650000, col='red',lwd=2)
Calculate the probability of an individual house, say \(y^{*}\) costing more than \(\$650,000.\)
This calculation uses a posterior predictive distribution. For this distribution we compute the following interval by sampling \[p(y^*|y) = \int p(y^*|\theta) p(\theta|y) d\theta\] where \(p(\theta|y)\) is the posterior distribution, and \(y^*\) is a new data point that follows the same sampling model as the previously observed data, \(y\).
In practice, this is done by:
mu.samples <- sapply(codaSamples, function(x) return(x[,'mu'] ))
sigma.samples <- sapply(codaSamples, function(x) return(x[,'sigma'] ))
y.star <- rnorm(length(mu.samples), mu.samples, sigma.samples)
hist(y.star, breaks='FD', prob=T)
abline(v=650000, col='red',lwd=2)
prob.house <- mean(y.star > 650000)
The probability of a single house costing more than $650,000 is calculated as 0.4881.
Summarize the results from part d so that your aunt who is a realtor in Billings can understand.
We have constructed a model to estimate housing prices in Seattle from 2014-2015 based off of 869 homes. During this period the average housing price was likely between about $600,000 and $650,000. While this is the average cost of a home, there is a lot of variability in the housing prices. Generally, most of the houses would be within two standard deviations of the mean, so in this case a majority of the homes would be within $1,200,000 of the mean.
NOTE: not for your aunt, but notice that we see negative housing prices in part f. This is a result of a prior and sampling model that permit negative values. A log-normal sampling model would be more appropriate.