---
title: "Outline of HW 9 Solutions"
output: pdf_document
---
# Introduction (3 pts)

Exercise 24 on page 232 of your text where the respiratory rates (number of breaths per minute) for $n = 618$ children of different ages (in months).

* Description of the data collected (1 pt)
* Description of the research question(s) of interest (1 pt) 
* Description of the problem (1 pt)

# Statistical Procedures Used (10 pts)

* Descriptive plots of the data  

* Description of __all__ of the linear models that were be considered *and* why (2 pts)
* Description of how the "best" (final) model that was chosen (2 pts)
* Evaluation of model assumptions for the final model that was chosen (4 pts)

* Description of the hypotheses that will be tested *and* how they address the research question of interest 
    + Confidence intervals that will be constructed *and* why 


* Description of ESS F-test for testing model fit against a null model (1 pt)
* Description of ESS F-test for testing model fit between the linear and quadratic models (1 pt)
    + Hypotheses that will be tested *and* why  
    
* Description of how $R^2$ will be used *and* what it indicates about model fit      
    
* Description of prediction intervals *and* how they address the research question of interest     

# Summary of Statistical Findings (7 pts)  

* Statistical model used (1 pt)
* Plot of model fit with original data (1 pt)

* Results from ESS F-test for testing model fit against a null model (1 pt)
* Results from ESS F-test for testing model fit between the linear and quadratic models (1 pt) 
* Results from hypothesis test(s) for association between `Respiratory Rate` and `Age` (1 pt)  
* Confidence intervals for `$Age$` and `$Age^2$` and their associated interpretations

* Results from the model's $R^2$ *and* what it indicates about model fit      
 
* Discussion of prediction intervals and what they indicate to a physician (1 pt) 
    + Prediction interval plot (1 pt)

# Scope of Inference (4 pts)  

* How was the sample collected? (1 pt) 
    + What population can the statistical findings be inferred to? (1 pt)
    
* How was the explanatory variable assigned? (1 pt)
    + What type of relationship can you infer between the response and the explanatory variable? (1 pt)
    
* What is the conclusion you can reach about the linear model with log(respiratory rate) and age

\pagebreak 

# Appendix

## 1. Read in the data and plot it

\vspace{1cm}

```{r,fig.height = 3, tidy = TRUE}
library(Sleuth3)
d <- ex0824
summary(d)

plot(Rate ~ Age, ylab = "respiratory rate", xlab = "Age (months)", data = d)

# Curvature!   Is it modelled by log transforming the response?   No!
plot(log(Rate) ~ Age, data = d)
```

## 2. Regression models

\vspace{1cm}

```{r, tidy = TRUE}
source("http://www.math.montana.edu/parker/courses/STAT411/diagANOVA.r")

# SLR fits poorly as suggetsed by scatterplot
m.orig <- lm(Rate ~ Age, data = d)
diagANOVA(m.orig)

# Non-constant variance! Long right tail!
m.orig.quad <- lm(Rate ~ Age + I(Age^2), data = d)
diagANOVA(m.orig.quad)

# log-transform of response appears to deal with non-constant variance, but still curvature
m.logRate <- lm(log(Rate) ~ Age, data = d)
diagANOVA(m.logRate)


m.logRate.quad <- lm(log(Rate) ~ Age + I(Age^2), data = d)
```

## 3. Final model that log transforms rate and includes a quadratic for Age appears to not violate the model assumptions.

\vspace{1cm}

```{r, tidy = TRUE}
diagANOVA(m.logRate.quad)
# constant variance and Normality are no longer violated 
```

## 4. Extra sum of squares test comparing regression model to null model  

\vspace{1cm}

```{r}
summary(m.logRate.quad)
## F-statistic at the bottom compares the fitted model to the null (equal means) model 
## This test suggests the quadratic regression model is better than equal means
```

## 5. Extra sum of squares test suggests including the quadratic term

\vspace{1cm}

```{r, tidy = TRUE}
anova(m.logRate, m.logRate.quad)
## The smaller model (without quadratic term) goes first, then larger model
## This tests if the larger model is a better fit of the data, compared to the smaller model
```

## 7. Adding model fits to scatterplot

\vspace{1cm}

```{r, tidy = TRUE}
plot(Rate ~ Age, ylab = "Respiratory rate", xlab = "Age (months)", data = d)
abline(coef(m.orig), lwd = 2, col = "blue")

coefs <- coef(m.logRate.quad)

curve(exp(coefs[1] + coefs[2]*x + coefs[3]*x^2), col = 2, lty = 2, lwd = 3, add = TRUE)

legend("topright", legend = c("SLR", "exp(Quadratic Model)"), 
       col = c("blue", "red"), lty = c(1, 2), lwd = 2)
```

## 9. Hypothesis tests for coefficients on `Age`

There are two confidence intervals (for $\beta_1$ (see "Age" below) and $\beta_2$ (see "Age^2" below)) that suggest that there is a strong association between Rate and Age, as neither interval contains 0. 

The summary of these coefficients also tests if the slopes are _different_ from 0, conditional on the other term (linear or quadratic) being in the model. As both of these tests have very small p-values, we can conclude that there is an association between `Age` and `Rate`.   

\vspace{1cm}

```{r}
confint(m.logRate.quad)

summary(m.logRate.quad)$coefficients
```

\pagebreak 

## 11. "normal"" range via individual Prediction Interval plots
```{r, tidy = TRUE}
# Set up a new data.frame of X0 values that we want to construct PIs for
new <- data.frame(Age = seq(0, 36, by = 1))  

# Get the individual PIs for the individual predicted responses Y=Births at each X0 Year 
pred.ses <- predict(m.logRate.quad, newdata = new, se.fit = TRUE, interval = "prediction")

head(data.frame(new, pred.ses$fit))
# Inspect the predictions 

plot(Rate ~ Age, data = d)
lines(new$Age, exp(pred.ses$fit[ ,1]), lty = 1, lwd = 3)        # add the fitted line
lines(new$Age, exp(pred.ses$fit[ ,2]), lty = 2, lwd = 3, col = "blue") # lower ind. PL 
lines(new$Age, exp(pred.ses$fit[ ,3]), lty = 2, lwd = 3, col = "blue") # upper ind. PL
legend("topright", legend = c("Estimated Median \n(quadratic regression)", "95% Prediction Interval"), 
       col = c("black", "blue"), lty = c(1, 2), lwd = 3)
```

\pagebreak

# EXTRA CREDIT: Is median rate of breaths for 36 month-olds larger than adults (25 breaths per minute)?  

Using the previously determined "best" model, we can use the "shift trick" to estimate the mean number of log-breaths of a 36 month old child. 

The estimated number of log-breaths for a 36 month old are 3.245, with an associated standard error of 0.02819. We can compute the test statistic as follows, 

$$t=\frac{3.245 - \log(25)}{0.02819} = 0.9430924,$$ 

which follows a t-distribution with 615 df. To find the associated p-value, we find the probability of observing a test statistic __greater__ than 0.9430924, under a $t_{615}.$ The associated p-value is 0.177. 

Therefore, no, the median rate of breaths per minute for 36 month olds is not greater than 25 breaths/minute. 
\vspace{1cm}

```{r, tidy = TRUE}
m.logRate.quad.shift <- lm(log(Rate) ~ I(Age - 36) + I((Age - 36)^2), data = d)
## Intercept is the estimated log mean number of breaths for 36 month olds

exp(confint(m.logRate.quad.shift))
## Find back transformed CI (exponentiate) and see if 25 is included in the interval

test.stat <- (coef(m.logRate.quad.shift)[1] - log(25))/0.02819445
pt(test.stat, df = 615, lower.tail = FALSE)


## You would get a different answer if you did not use the above model, and instead used a one sample t-test comparison of the log(Rate) for observations that are 36 months with log(25)  

t.test(log(d$Rate[d$Age == 36]), mu = log(25), conf.level = 0.95, alternative = "greater")
## For this procedure, the test statistic is 1.1324 and it follows a t-distribution with 2 degrees of freedom (NOT 615!)

```