Stat 408 Regression and ANOVA

Statistics is all about assessing relationships and making predictions. We always need to include information about the quality of our conclusions, either with error estimates or confidence intervals.

Two Continuous Variables

The scatterplot is the standard method of comparing two continuous variables. We want to see how one variable changes with changes in the other. In some cases, there are reasons to think that one variable causes changes in the other. In other situations, we just think they are related without saying changes are causal.

Regression Basics
The model is yi = β0 + β1xi + εi where y is the response, x is the predictor, and ε is the error. The i subscript means they each belong to the ith observation. The model shown is that of a straight line plus vertical error off the line. In addition to assuming a line fits well, we also assume:

The first and third assumptions are often checked with plots, as we have been doing.

Using an example from SAS PROC REG help pages, we first look at a scatterplot of y versus x.

  data fitness;
     input Age Weight Oxygen RunTime RestPulse RunPulse MaxPulse @@;
     label Age      ='age(years)'
            Weight   ='weight(kg)'
            Oxygen   ='oxygen uptake(ml/kg/min)'
            RunTime  ='1.5 mile time(min)'
            RestPulse='rest pulse'
            RunPulse ='running pulse'
            MaxPulse ='maximum running pulse';
   datalines;
   44 89.47 44.609 11.37 62 178 182   40 75.07 45.313 10.07 62 185 185
   44 85.84 54.297  8.65 45 156 168   42 68.15 59.571  8.17 40 166 172
   38 89.02 49.874  9.22 55 178 180   47 77.45 44.811 11.63 58 176 176
   40 75.98 45.681 11.95 70 176 180   43 81.19 49.091 10.85 64 162 170
   44 81.42 39.442 13.08 63 174 176   38 81.87 60.055  8.63 48 170 186
   44 73.03 50.541 10.13 45 168 168   45 87.66 37.388 14.03 56 186 192
   45 66.45 44.754 11.12 51 176 176   47 79.15 47.273 10.60 47 162 164
   54 83.12 51.855 10.33 50 166 170   49 81.42 49.156  8.95 44 180 185
   51 69.63 40.836 10.95 57 168 172   51 77.91 46.672 10.00 48 162 168
   48 91.63 46.774 10.25 48 162 164   49 73.37 50.388 10.08 67 168 168
   57 73.37 39.407 12.63 58 174 176   54 79.38 46.080 11.17 62 156 165
   52 76.32 45.441  9.63 48 164 166   50 70.87 54.625  8.92 48 146 155
   51 67.25 45.118 11.08 48 172 172   54 91.63 39.203 12.88 44 168 172
   51 73.71 45.790 10.47 59 186 188   57 59.08 50.545  9.93 49 148 155
   49 76.32 48.673  9.40 56 186 188   48 61.24 47.920 11.50 52 170 176
   52 82.78 47.467 10.50 53 170 172
   ;
 PROC SGPLOT;
      scatter  y= Oxygen  x= RunTime ;
 run;
 PROC REG data = Fitness;
      model Oxygen = RunTime;
 run;

The PROC REG code does a simple linear regression. The model statement says that Oxygen is the response of interest, and we want to use RunTime as the predictor.

Output:


                                  The REG Procedure
                                    Model: MODEL1
                              Dependent Variable: Oxygen
                                Analysis of Variance
                                       Sum of           Mean
   Source                   DF        Squares         Square    F Value    Pr > F
   Model                     1      632.90010      632.90010      84.01    <.0001
   Error                    29      218.48144        7.53384
   Corrected Total          30      851.38154

                Root MSE              2.74478    R-Square     0.7434
                Dependent Mean       47.37581    Adj R-Sq     0.7345
                Coeff Var             5.79364

                                Parameter Estimates
                             Parameter       Standard
        Variable     DF       Estimate          Error    t Value    Pr > |t|

        Intercept     1       82.42177        3.85530      21.38      <.0001
        RunTime       1       -3.31056        0.36119      -9.17      <.0001
Interpretation:
In the first table, the F test has null hypothesis: β1 = 0 versus alternative β1 is nonzero. Do we reject the null?
The parameter estimates table gives our estimated coefficients, their standard errors, t-ratio (estimate / SE) and a p-value for the parameter being zero given all other terms are in the model.

Multiple Regression

Regress O2 uptake on all predictors.
 PROC SGscatter data = fitness;
      matrix  Oxygen Age Weight RunTime RunPulse RestPulse MaxPulse;
 run;
 PROC REG data = fitness;
      model Oxygen = Age Weight RunTime RunPulse RestPulse MaxPulse;
 run;
                         The REG Procedure
                         Model: MODEL1
         Dependent Variable: Oxygen oxygen uptake(ml/kg/min)

                        Analysis of Variance
                                     Sum of           Mean
 Source                   DF        Squares         Square    F Value    Pr > F

 Model                     6      722.54361      120.42393      22.43    <.0001
 Error                    24      128.83794        5.36825
 Corrected Total          30      851.38154


                 Root MSE              2.31695    R-Square     0.8487
                 Dependent Mean       47.37581    Adj R-Sq     0.8108
                 Coeff Var             4.89057


                         Parameter Estimates

                                           Parameter  Standard
 Variable     Label                  DF    Estimate      Error  t Value  Pr > |t|

 Intercept    Intercept               1   102.93448   12.40326     8.30    <.0001
 Age          age(years)              1    -0.22697    0.09984    -2.27    0.0322
 Weight       weight(kg)              1    -0.07418    0.05459    -1.36    0.1869
 RunTime      1.5 mile time(min)      1    -2.62865    0.38456    -6.84    <.0001
 RunPulse     running pulse           1    -0.36963    0.11985    -3.08    0.0051
 RestPulse    rest pulse              1    -0.02153    0.06605    -0.33    0.7473
 MaxPulse     maximum running pulse   1     0.30322    0.13650     2.22    0.0360

Now the first F tests is used to see if any of the predictors is needed in the model. The t-tests in the second table are each conditional on the other terms being in the model. It seems that not all the terms are needed. The diagnostic plots were discussed earlier. Much of stat computing is fitting models. Plot the raw data to ensure that models we're fitting are appropriate for the data. Plots are also essential for checking the assumptions.

Regression in R

  fitness <- read.table("fitness.dat", head=T)
  ## SLR
  fitness.fit1 <- lm( Oxygen ~ RunTime, fitness )
  par(mfrow=c(2,2))
  plot(fitness.fit1)
  dev.copy(png,"../plots/O2Rdiagnostics1.png",height=400,width=500); dev.off()  
  ### MLR
  fitness.fullfit <- lm( Oxygen ~ . , fitness )
  plot(fitness.fullfit)
  dev.copy(png,"../plots/O2Rdiagnostics2.png",height=400,width=500); dev.off()  
Use these to check for non-constant variance and lack of normality.

One continuous, one categorical variable with ANOVA

What plot do I use to compare GPA's of students in the 7 colleges of MSU? Is Engineering harder than Letters & Sciences? Do Nursing students get better grades than Business students?

PROC ANOVA is used to see if there are significant differences between means of the response based on some categorical predictor. The model is written as
yij = μ + τi + εij     or     yij = μi + εij
    where i indexes the group number and ij the jth individual within the ith group. Example:

 DM "log;clear;out;clear;";
 OPTIONS   ls = 78 ps = 66;
DATA gloves;               ** use columns because of missing values;
     input ID 3-4 Preobs 7-8 obs1 11-12 obs2 15-16 obs5 19-20 PreCount 23-24
               Count1 27-28 Count2 31-32 Count5 35-36 Years 39-40;
   datalines;
   1   2   7   1       1   6   1      15
   2   2   6  11   9   1   5  10   9   2
   3   5  13   8  15   5  13   7  14   3
   4   2   2   2   5   0   2   2   4  10
   5  12   2   3   3   0   2   3   0  20
   6   3   8   3   4   0   8   2   2   8
   7   4   4           4   4           9
   8   4   4   2       0   4   1       9
   9   2   3   1   2   0   2   1   1  15
  10   6   1   2       1   1   2       8
  11   3   4   8   2   0   3   6   0   8
  12   2   3   8   5   0   3   8   5   2
  13   1               0               5
  14   1   3           0   3          15
  15   1   2   1   1   1   2   1   1   3
  16   1           1   0           1  14
  17       2   3   1       2   3   1  14
  18       1   1   1       1   1   1   8
  19       2               1           3
  20       1               1           6
  21       1               1           3
  22           2               2       1
  23               1               0   6
;
run;

DATA glove2;
     set gloves;
     time = 0; observed = PreObs; Used = PreCount; Output;
     time = 1; observed = Obs1;   Used = Count1;   Output;
     time = 2; observed = Obs2;   Used = Count2;   Output;
     time = 5; observed = Obs5;   Used = Count5;   Output; 
     keep ID time observed used  years;
run;

DATA glove2; set glove2;
     ratio = used/observed;
run;
ODS graphics on / IMAGEFMT = png  IMAGENAME = "plots/glovesBoxPlot"
                  height = 4in width=5in;
PROC ANOVA data = glove2;     
  class time;              **  a classification (categorical) predictor;
  model ratio = time;
  means time ;
run; 
ODS graphics off;

Output:
                         The ANOVA Procedure

                           Class Level Information
                       Class         Levels    Values
                       time               4    0 1 2 5
                         Number of observations    92
NOTE: Due to missing values, only 63 observations can be used in this analysis.

                            The ANOVA Procedure (page 2)
Dependent Variable: ratio
                                      Sum of
Source                     DF        Squares    Mean Square   F Value   Pr > F
Model                       3     4.76127679     1.58709226     17.70   <.0001
Error                      59     5.29146518     0.08968585
Corrected Total            62    10.05274197

              R-Square     Coeff Var      Root MSE    ratio Mean
              0.473630      42.67786      0.299476      0.701713

Source                     DF       Anova SS    Mean Square   F Value   Pr > F
time                        3     4.76127679     1.58709226     17.70   <.0001

                            The ANOVA Procedure (page 3)

              Level of            ------------ratio------------
              time          N             Mean          Std Dev
              0            16       0.26041667       0.40354333
              1            19       0.92669173       0.14331184
              2            15       0.91338384       0.15468059
              5            13       0.67179487       0.42269991

The output says that there is a significant difference between mean of the ratio variable over time, which is expected after looking at the plot. The important part of the output is the line labeled "time" -- just under the "Source" label. The large F value and small P-value (Pr > F) tell us that the differences in means are significant. How would you describe the effects of the training session? Is there some evidence that nurses forget their training?

PROC GLM

The advantage of PROC GLM is that it fits both regression models (like PROC REG) and ANOVA models (like PROC ANOVA). It knows which is which through use of the CLASS statement. If a predictor is listed in the CLASS statement, then a model is fit to each level of the variable. Otherwise the assumption is that the predictor takes continuous values, and a regression model is fit. Default plots depend on the type of model used. A PROC GLM example from The Little SAS Book.
DATA BBall;
   INPUT Team $ Height @@;
datalines;
red  55 red  48 red  53 red  47 red  51 red  43
red  45 red  46 red  55 red  54 red  45 red  52
blue 46 blue 56 blue 48 blue 47 blue 54 blue 52
blue 49 blue 51 blue 45 blue 48 blue 55 blue 47
gray 55 gray 45 gray 47 gray 56 gray 49 gray 53
gray 48 gray 53 gray 51 gray 52 gray 48 gray 47
pink 53 pink 53 pink 58 pink 56 pink 50 pink 55
pink 59 pink 57 pink 49 pink 55 pink 56 pink 57
gold 53 gold 55 gold 48 gold 45 gold 47 gold 56
gold 55 gold 46 gold 47 gold 53 gold 51 gold 50
;

ODS graphics on / IMAGEFMT = png  IMAGENAME = "BBallHtsGLM"
                  height =4in width=5in;
PROC GLM data = BBall      PLOTS = diagnostics; 
     CLASS Team;
     MODEL Height = Team;
     MEANS Team / SCHEFFE; ** compare each pair of means;
     TITLE "Girls' Heights on BBall Teams";
     OUTPUT out = BBllResd  
            PREDICTED = yhats  rstudent = rresid r = resid ;
run;
ODS graphics off;
Again, the question of interest is about differences in means. Is there a significant difference in mean height between the six teams? What part of the output below gives you that conclusion?

Output
                        Girls' Heights on BBall Teams                       
                              The GLM Procedure
                           Class Level Information
               Class         Levels    Values
               Team               5    blue gold gray pink red

                   Number of Observations Read          60
                   Number of Observations Used          60

                              The GLM Procedure
Dependent Variable: Height

                                      Sum of
Source                     DF        Squares    Mean Square   F Value   Pr > F
Model                       4    228.0000000     57.0000000      4.14   0.0053
Error                      55    758.0000000     13.7818182
Corrected Total            59    986.0000000

             R-Square     Coeff Var      Root MSE    Height Mean
             0.231237      7.279190      3.712387       51.00000

Source                     DF      Type I SS    Mean Square   F Value   Pr > F
Team                        4    228.0000000     57.0000000      4.14   0.0053

Source                     DF    Type III SS    Mean Square   F Value   Pr > F
Team                        4    228.0000000     57.0000000      4.14   0.0053


                              The GLM Procedure
                          Scheffe's Test for Height
        NOTE: This test controls the Type I experimentwise error rate.
                   Alpha                              0.05
                   Error Degrees of Freedom             55
                   Error Mean Square              13.78182
                   Critical Value of F             2.53969
                   Minimum Significant Difference   4.8306

         Means with the same letter are not significantly different.

           Scheffe Grouping          Mean      N    Team
                          A        54.833     12    pink
                          A
                     B    A        50.500     12    gold
                     B    A
                     B    A        50.333     12    gray
                     B
                     B             49.833     12    blue
                     B
                     B             49.500     12    red      
Scheffe's test compares every pair of "treatments" to see if their means are "different". Look for Teams which do not share either letter, A or B. Blue is significantly shorter than pink, and red is significantly different from __________________(which others?) at the 0.05 significance level. We are 95% confident that we are not making an error in both of those statements simultaneously.
Do the same in R:
 BBall <- read.table("data/basketBallHts.dat", head=T)
 plot( height ~ team, BBall)  
 BBfit <- lm( height ~ team, BBall)  
        ##  team is a factor because it's character data
 plot(BBfit)
 require(multcomp)
 glht(BBfit, linfct = mcp(team = "Tukey"))


Author: Jim Robison-Cox
Last Updated: