--- title: "Outline of HW 9 Solutions" output: pdf_document --- # Introduction (3 pts) Exercise 24 on page 232 of your text where the respiratory rates (number of breaths per minute) for $n = 618$ children of different ages (in months). * Description of the data collected (1 pt) * Description of the research question(s) of interest (1 pt) * Description of the problem (1 pt) # Statistical Procedures Used (10 pts) * Descriptive plots of the data * Description of __all__ of the linear models that were be considered *and* why (2 pts) * Description of how the "best" (final) model that was chosen (2 pts) * Evaluation of model assumptions for the final model that was chosen (4 pts) * Description of the hypotheses that will be tested *and* how they address the research question of interest + Confidence intervals that will be constructed *and* why * Description of ESS F-test for testing model fit against a null model (1 pt) * Description of ESS F-test for testing model fit between the linear and quadratic models (1 pt) + Hypotheses that will be tested *and* why * Description of how $R^2$ will be used *and* what it indicates about model fit * Description of prediction intervals *and* how they address the research question of interest # Summary of Statistical Findings (7 pts) * Statistical model used (1 pt) * Plot of model fit with original data (1 pt) * Results from ESS F-test for testing model fit against a null model (1 pt) * Results from ESS F-test for testing model fit between the linear and quadratic models (1 pt) * Results from hypothesis test(s) for association between `Respiratory Rate` and `Age` (1 pt) * Confidence intervals for `$Age$` and `$Age^2$` and their associated interpretations * Results from the model's $R^2$ *and* what it indicates about model fit * Discussion of prediction intervals and what they indicate to a physician (1 pt) + Prediction interval plot (1 pt) # Scope of Inference (4 pts) * How was the sample collected? (1 pt) + What population can the statistical findings be inferred to? (1 pt) * How was the explanatory variable assigned? (1 pt) + What type of relationship can you infer between the response and the explanatory variable? (1 pt) * What is the conclusion you can reach about the linear model with log(respiratory rate) and age \pagebreak # Appendix ## 1. Read in the data and plot it \vspace{1cm} ```{r,fig.height = 3, tidy = TRUE} library(Sleuth3) d <- ex0824 summary(d) plot(Rate ~ Age, ylab = "respiratory rate", xlab = "Age (months)", data = d) # Curvature! Is it modelled by log transforming the response? No! plot(log(Rate) ~ Age, data = d) ``` ## 2. Regression models \vspace{1cm} ```{r, tidy = TRUE} source("http://www.math.montana.edu/parker/courses/STAT411/diagANOVA.r") # SLR fits poorly as suggetsed by scatterplot m.orig <- lm(Rate ~ Age, data = d) diagANOVA(m.orig) # Non-constant variance! Long right tail! m.orig.quad <- lm(Rate ~ Age + I(Age^2), data = d) diagANOVA(m.orig.quad) # log-transform of response appears to deal with non-constant variance, but still curvature m.logRate <- lm(log(Rate) ~ Age, data = d) diagANOVA(m.logRate) m.logRate.quad <- lm(log(Rate) ~ Age + I(Age^2), data = d) ``` ## 3. Final model that log transforms rate and includes a quadratic for Age appears to not violate the model assumptions. \vspace{1cm} ```{r, tidy = TRUE} diagANOVA(m.logRate.quad) # constant variance and Normality are no longer violated ``` ## 4. Extra sum of squares test comparing regression model to null model \vspace{1cm} ```{r} summary(m.logRate.quad) ## F-statistic at the bottom compares the fitted model to the null (equal means) model ## This test suggests the quadratic regression model is better than equal means ``` ## 5. Extra sum of squares test suggests including the quadratic term \vspace{1cm} ```{r, tidy = TRUE} anova(m.logRate, m.logRate.quad) ## The smaller model (without quadratic term) goes first, then larger model ## This tests if the larger model is a better fit of the data, compared to the smaller model ``` ## 7. Adding model fits to scatterplot \vspace{1cm} ```{r, tidy = TRUE} plot(Rate ~ Age, ylab = "Respiratory rate", xlab = "Age (months)", data = d) abline(coef(m.orig), lwd = 2, col = "blue") coefs <- coef(m.logRate.quad) curve(exp(coefs[1] + coefs[2]*x + coefs[3]*x^2), col = 2, lty = 2, lwd = 3, add = TRUE) legend("topright", legend = c("SLR", "exp(Quadratic Model)"), col = c("blue", "red"), lty = c(1, 2), lwd = 2) ``` ## 9. Hypothesis tests for coefficients on `Age` There are two confidence intervals (for $\beta_1$ (see "Age" below) and $\beta_2$ (see "Age^2" below)) that suggest that there is a strong association between Rate and Age, as neither interval contains 0. The summary of these coefficients also tests if the slopes are _different_ from 0, conditional on the other term (linear or quadratic) being in the model. As both of these tests have very small p-values, we can conclude that there is an association between `Age` and `Rate`. \vspace{1cm} ```{r} confint(m.logRate.quad) summary(m.logRate.quad)$coefficients ``` \pagebreak ## 11. "normal"" range via individual Prediction Interval plots ```{r, tidy = TRUE} # Set up a new data.frame of X0 values that we want to construct PIs for new <- data.frame(Age = seq(0, 36, by = 1)) # Get the individual PIs for the individual predicted responses Y=Births at each X0 Year pred.ses <- predict(m.logRate.quad, newdata = new, se.fit = TRUE, interval = "prediction") head(data.frame(new, pred.ses$fit)) # Inspect the predictions plot(Rate ~ Age, data = d) lines(new$Age, exp(pred.ses$fit[ ,1]), lty = 1, lwd = 3) # add the fitted line lines(new$Age, exp(pred.ses$fit[ ,2]), lty = 2, lwd = 3, col = "blue") # lower ind. PL lines(new$Age, exp(pred.ses$fit[ ,3]), lty = 2, lwd = 3, col = "blue") # upper ind. PL legend("topright", legend = c("Estimated Median \n(quadratic regression)", "95% Prediction Interval"), col = c("black", "blue"), lty = c(1, 2), lwd = 3) ``` \pagebreak # EXTRA CREDIT: Is median rate of breaths for 36 month-olds larger than adults (25 breaths per minute)? Using the previously determined "best" model, we can use the "shift trick" to estimate the mean number of log-breaths of a 36 month old child. The estimated number of log-breaths for a 36 month old are 3.245, with an associated standard error of 0.02819. We can compute the test statistic as follows, $$t=\frac{3.245 - \log(25)}{0.02819} = 0.9430924,$$ which follows a t-distribution with 615 df. To find the associated p-value, we find the probability of observing a test statistic __greater__ than 0.9430924, under a $t_{615}.$ The associated p-value is 0.177. Therefore, no, the median rate of breaths per minute for 36 month olds is not greater than 25 breaths/minute. \vspace{1cm} ```{r, tidy = TRUE} m.logRate.quad.shift <- lm(log(Rate) ~ I(Age - 36) + I((Age - 36)^2), data = d) ## Intercept is the estimated log mean number of breaths for 36 month olds exp(confint(m.logRate.quad.shift)) ## Find back transformed CI (exponentiate) and see if 25 is included in the interval test.stat <- (coef(m.logRate.quad.shift)[1] - log(25))/0.02819445 pt(test.stat, df = 615, lower.tail = FALSE) ## You would get a different answer if you did not use the above model, and instead used a one sample t-test comparison of the log(Rate) for observations that are 36 months with log(25) t.test(log(d$Rate[d$Age == 36]), mu = log(25), conf.level = 0.95, alternative = "greater") ## For this procedure, the test statistic is 1.1324 and it follows a t-distribution with 2 degrees of freedom (NOT 615!) ```