Stat 401: Lab 6 - Linear Models in R: the lm function

Birth weight data
The lm function
General linear models in R

Birth Weight Data

We are going to use the birth weight data set we encountered in Lab 1, taken from Stat Labs by Nolan and Speed, originally from the Child Health and Development Studies conducted at the Oakland, CA, Kaiser Foundation Hospital. The variables are

bwt: baby's weight in ounces at birth
gestation: duration of pregnancy in days
parity: parity indicator (first born = 1, later birth = 0)
age: mother's age in years
height: mother's height in inches
weight: mother's weight in pounds (during pregnancy)
smoke: indicator for whether mother smokes (1=yes, 0=no)

Use the following command to load the data into your R session:

babies <- read.table("http://www.math.montana.edu/shancock/data/Bwt.dat", header=TRUE, sep=",")

Check that the data were read in correctly:

head(babies)
tail(babies)
dim(babies)	# For the dimension of a vector, use the function "length"
names(babies)
summary(babies)
is.data.frame(babies)

We will take bwt to be the response variable. Other variables are potential predictor variables.

Exercises

Create two or three plots to help visualize the relationship between the response variable (bwt) and one or two predictor variables. For example, side-by-side boxplots of bwt versus mother's smoking status, or a scatterplot of bwt versus gestation (seen in Lab 1).
Type plot(babies) for a matrix of pair-wise plots between all pairs of variables in the data set. Which of these plots are most useful?
Type cor(babies) and examine the correlation matrix between the variables. Which of the predictors is the most highly correlated with birth weight?

The lm Function

The lm R function stands for "linear model", and will fit a linear model given a response variable y and predictor variables x1, x2,..., xk. The syntax is as follows:

lm(formula = y ~ x1 + x2 + ...,  data = [name of data set])

The argument names "formula" and "data" are not necessary if you retain the order of the arguments. Let's start with a simple linear regression model of bwt on gestation.

mod <- lm(bwt ~ gestation, data = babies)	# R fits the model
mod						# View SLR fit
summary(mod)					# More details of SLR model fit

attributes(mod)
attributes(summary(mod))				# Ask R what is contained in the mod and summary(mod) objects

mod$coefficients
head(mod$fitted.values)
head(mod$residuals)
summary(mod)$r.squared
summary(mod)$sigma

plot(bwt ~ gestation, data = babies, xlab="Gestation Period (days)", ylab="Birth Weight (oz)")
abline(mod)		# Adds least-squares regression line to plot

plot(mod)		# Default residual plots

Once we have fit the regression model, we can also use R to calculate confidence intervals for the regression coefficients, the mean response, as well as prediction intervals. Confidence intervals for the regression parameters are found using the R command

confint(mod, level=0.95)

For confidence intervals for the mean response or prediction intervals, first we need to create a data.frame with the x-values of interest. The variables names in the new data.frame need to match the names of the original data set:

names(babies)

newdat <- data.frame(gestation = c(200, 250, 300))
newdat		# Print the new data.frame in the console.

A 95% confidence interval for the mean response uses the "predict" function with the argument "interval = "confidence"":

predict(mod, newdata = newdat, interval = "confidence", level=.95)

We can also ask R to report the standard errors of the estimated mean response:

predict(mod, newdata = newdat, interval = "confidence", level=.95, se.fit=TRUE)

For a 95% prediction interval for a future individual response, use the argument "interval = "prediction"":

predict(mod, newdata = newdat, interval = "prediction", level=.95)

For more information on the predict function for lm objects, look at the help file

?predict.lm

Exercises

Write out the fitted regression equation from the model summary output.
Fit a simple linear regression model of bwt to age. What is the predicted birthweight for a 28-year-old mother?

General Linear Models

In general, we would like to use more than one predictor to predict a baby's birth weight. For example, how does a mother's smoking status affect birth weight when we control for gestational age? or do smoking status and gestational age interact in their effects on birth weight, that is, does smoking have a different effect on birth weight depending on the gestational age? Let's explore these questions through a series of linear models.

mod2 <- lm(bwt ~ gestation + smoke, data=babies)    # Additive model (no interaction between gestation and smoke)
mod2
mod3 <- lm(bwt ~ gestation*smoke, data=babies)      # Interaction model
     # "gestation*smoke" is the same as "gestation + smoke + gestation:smoke"
mod3

Exercises

Write out the fitted linear model equation for mod2 and for mod3. How would we interpret each coefficient?

How can we determine which model fits these data better? We'll see model comparison hypothesis tests later, but for now, let's visualize the data with the fitted models on the same plot. Work through the following R code line-by-line, making sure you understand what each piece of code is doing:

## Visualize?
## Scatterplot of bwt ~ gestation
## Use different plotting symbols for smoke
plot(bwt ~ gestation, type="n", xlab="Gestation Period (days)", ylab = "Birth Weight", data=babies)
# type = "n" tells R to set up the plotting window but with
# nothing in the plot

# Now add points:
with(babies, points(bwt[smoke==1] ~ gestation[smoke==1], pch=15, col="red"))
with(babies, points(bwt[smoke==0] ~ gestation[smoke==0], pch=16, col="blue"))

# Add legend:
legend("topleft", c("Smoker","Nonsmoker"), pch=c(15,16), col=c("red","blue"))

## Add fitted regression lines:
# abline --> adds line with intercept, slope
# Additive model:
mod2$coef # Show fitted coefficients for additive model
abline(mod2$coef[1], mod2$coef[2], col="blue") # Non-smokers
abline(mod2$coef[1]+mod2$coef[3], mod2$coef[2], col="red") # Smokers
# Note parallel lines

# Interaction model:
mod3$coef
abline(mod3$coef[1], mod3$coef[2], col="blue", lty=2) # Non-smokers
abline(mod3$coef[1]+mod3$coef[3], mod3$coef[2]+mod3$coef[4], col="red", lty=2) # Smokers

# To see all possible point characters, try code at the end of this help file:
?points

# To see all possible colors:
colors()

We will see later that, since the two models are "nested" (one is a special case of the other), we can test H0: mod2 vs Ha: mod3 using the following F-test for model comparison

anova(mod2, mod3)

The small p-value tells us that the interaction model is a significantly better fit to the data than the additive model.