Regression Basics
The model is yi = β0 +
β1xi + εi where y
is the response, x is the predictor, and ε is
the error. The i subscript means they each belong to the ith
observation. The model shown is that of a straight line plus
vertical error off the line. In addition to assuming a line fits
well, we also assume:
Using an example from SAS PROC REG help pages, we first look at a
scatterplot of y versus x.
data fitness;
input Age Weight Oxygen RunTime RestPulse RunPulse MaxPulse @@;
label Age ='age(years)'
Weight ='weight(kg)'
Oxygen ='oxygen uptake(ml/kg/min)'
RunTime ='1.5 mile time(min)'
RestPulse='rest pulse'
RunPulse ='running pulse'
MaxPulse ='maximum running pulse';
datalines;
44 89.47 44.609 11.37 62 178 182 40 75.07 45.313 10.07 62 185 185
44 85.84 54.297 8.65 45 156 168 42 68.15 59.571 8.17 40 166 172
38 89.02 49.874 9.22 55 178 180 47 77.45 44.811 11.63 58 176 176
40 75.98 45.681 11.95 70 176 180 43 81.19 49.091 10.85 64 162 170
44 81.42 39.442 13.08 63 174 176 38 81.87 60.055 8.63 48 170 186
44 73.03 50.541 10.13 45 168 168 45 87.66 37.388 14.03 56 186 192
45 66.45 44.754 11.12 51 176 176 47 79.15 47.273 10.60 47 162 164
54 83.12 51.855 10.33 50 166 170 49 81.42 49.156 8.95 44 180 185
51 69.63 40.836 10.95 57 168 172 51 77.91 46.672 10.00 48 162 168
48 91.63 46.774 10.25 48 162 164 49 73.37 50.388 10.08 67 168 168
57 73.37 39.407 12.63 58 174 176 54 79.38 46.080 11.17 62 156 165
52 76.32 45.441 9.63 48 164 166 50 70.87 54.625 8.92 48 146 155
51 67.25 45.118 11.08 48 172 172 54 91.63 39.203 12.88 44 168 172
51 73.71 45.790 10.47 59 186 188 57 59.08 50.545 9.93 49 148 155
49 76.32 48.673 9.40 56 186 188 48 61.24 47.920 11.50 52 170 176
52 82.78 47.467 10.50 53 170 172
;
PROC SGPLOT;
scatter y= Oxygen x= RunTime ;
run;
PROC REG data = Fitness;
model Oxygen = RunTime;
run;

Output:
The REG Procedure
Model: MODEL1
Dependent Variable: Oxygen
Analysis of Variance
Sum of Mean
Source DF Squares Square F Value Pr > F
Model 1 632.90010 632.90010 84.01 <.0001
Error 29 218.48144 7.53384
Corrected Total 30 851.38154
Root MSE 2.74478 R-Square 0.7434
Dependent Mean 47.37581 Adj R-Sq 0.7345
Coeff Var 5.79364
Parameter Estimates
Parameter Standard
Variable DF Estimate Error t Value Pr > |t|
Intercept 1 82.42177 3.85530 21.38 <.0001
RunTime 1 -3.31056 0.36119 -9.17 <.0001
Interpretation:
PROC SGscatter data = fitness;
matrix Oxygen Age Weight RunTime RunPulse RestPulse MaxPulse;
run;
PROC REG data = fitness;
model Oxygen = Age Weight RunTime RunPulse RestPulse MaxPulse;
run;
The REG Procedure
Model: MODEL1
Dependent Variable: Oxygen oxygen uptake(ml/kg/min)
Analysis of Variance
Sum of Mean
Source DF Squares Square F Value Pr > F
Model 6 722.54361 120.42393 22.43 <.0001
Error 24 128.83794 5.36825
Corrected Total 30 851.38154
Root MSE 2.31695 R-Square 0.8487
Dependent Mean 47.37581 Adj R-Sq 0.8108
Coeff Var 4.89057
Parameter Estimates
Parameter Standard
Variable Label DF Estimate Error t Value Pr > |t|
Intercept Intercept 1 102.93448 12.40326 8.30 <.0001
Age age(years) 1 -0.22697 0.09984 -2.27 0.0322
Weight weight(kg) 1 -0.07418 0.05459 -1.36 0.1869
RunTime 1.5 mile time(min) 1 -2.62865 0.38456 -6.84 <.0001
RunPulse running pulse 1 -0.36963 0.11985 -3.08 0.0051
RestPulse rest pulse 1 -0.02153 0.06605 -0.33 0.7473
MaxPulse maximum running pulse 1 0.30322 0.13650 2.22 0.0360

fitness <- read.table("fitness.dat", head=T)
## SLR
fitness.fit1 <- lm( Oxygen ~ RunTime, fitness )
par(mfrow=c(2,2))
plot(fitness.fit1)
dev.copy(png,"../plots/O2Rdiagnostics1.png",height=400,width=500); dev.off()
### MLR
fitness.fullfit <- lm( Oxygen ~ . , fitness )
plot(fitness.fullfit)
dev.copy(png,"../plots/O2Rdiagnostics2.png",height=400,width=500); dev.off()
Use these to check for non-constant variance and lack of normality.
PROC ANOVA is used to see if there are significant differences
between means of the response based on some categorical predictor.
The model is written as
yij = μ + τi + εij
or yij =
μi + εij
where i indexes the group number and ij the jth individual within the ith
group.
Example:
DM "log;clear;out;clear;";
OPTIONS ls = 78 ps = 66;
DATA gloves; ** use columns because of missing values;
input ID 3-4 Preobs 7-8 obs1 11-12 obs2 15-16 obs5 19-20 PreCount 23-24
Count1 27-28 Count2 31-32 Count5 35-36 Years 39-40;
datalines;
1 2 7 1 1 6 1 15
2 2 6 11 9 1 5 10 9 2
3 5 13 8 15 5 13 7 14 3
4 2 2 2 5 0 2 2 4 10
5 12 2 3 3 0 2 3 0 20
6 3 8 3 4 0 8 2 2 8
7 4 4 4 4 9
8 4 4 2 0 4 1 9
9 2 3 1 2 0 2 1 1 15
10 6 1 2 1 1 2 8
11 3 4 8 2 0 3 6 0 8
12 2 3 8 5 0 3 8 5 2
13 1 0 5
14 1 3 0 3 15
15 1 2 1 1 1 2 1 1 3
16 1 1 0 1 14
17 2 3 1 2 3 1 14
18 1 1 1 1 1 1 8
19 2 1 3
20 1 1 6
21 1 1 3
22 2 2 1
23 1 0 6
;
run;
DATA glove2;
set gloves;
time = 0; observed = PreObs; Used = PreCount; Output;
time = 1; observed = Obs1; Used = Count1; Output;
time = 2; observed = Obs2; Used = Count2; Output;
time = 5; observed = Obs5; Used = Count5; Output;
keep ID time observed used years;
run;
DATA glove2; set glove2;
ratio = used/observed;
run;
ODS graphics on / IMAGEFMT = png IMAGENAME = "plots/glovesBoxPlot"
height = 4in width=5in;
PROC ANOVA data = glove2;
class time; ** a classification (categorical) predictor;
model ratio = time;
means time ;
run;
ODS graphics off;
The ANOVA Procedure
Class Level Information
Class Levels Values
time 4 0 1 2 5
Number of observations 92
NOTE: Due to missing values, only 63 observations can be used in this analysis.
The ANOVA Procedure (page 2)
Dependent Variable: ratio
Sum of
Source DF Squares Mean Square F Value Pr > F
Model 3 4.76127679 1.58709226 17.70 <.0001
Error 59 5.29146518 0.08968585
Corrected Total 62 10.05274197
R-Square Coeff Var Root MSE ratio Mean
0.473630 42.67786 0.299476 0.701713
Source DF Anova SS Mean Square F Value Pr > F
time 3 4.76127679 1.58709226 17.70 <.0001
The ANOVA Procedure (page 3)
Level of ------------ratio------------
time N Mean Std Dev
0 16 0.26041667 0.40354333
1 19 0.92669173 0.14331184
2 15 0.91338384 0.15468059
5 13 0.67179487 0.42269991
CLASS statement. If a predictor is listed in
the CLASS statement, then a model is fit to each level of the variable.
Otherwise the assumption is that the predictor takes continuous
values, and a regression model is fit. Default plots depend on the
type of model used.
A PROC GLM example from The Little SAS Book.
DATA BBall;
INPUT Team $ Height @@;
datalines;
red 55 red 48 red 53 red 47 red 51 red 43
red 45 red 46 red 55 red 54 red 45 red 52
blue 46 blue 56 blue 48 blue 47 blue 54 blue 52
blue 49 blue 51 blue 45 blue 48 blue 55 blue 47
gray 55 gray 45 gray 47 gray 56 gray 49 gray 53
gray 48 gray 53 gray 51 gray 52 gray 48 gray 47
pink 53 pink 53 pink 58 pink 56 pink 50 pink 55
pink 59 pink 57 pink 49 pink 55 pink 56 pink 57
gold 53 gold 55 gold 48 gold 45 gold 47 gold 56
gold 55 gold 46 gold 47 gold 53 gold 51 gold 50
;
ODS graphics on / IMAGEFMT = png IMAGENAME = "BBallHtsGLM"
height =4in width=5in;
PROC GLM data = BBall PLOTS = diagnostics;
CLASS Team;
MODEL Height = Team;
MEANS Team / SCHEFFE; ** compare each pair of means;
TITLE "Girls' Heights on BBall Teams";
OUTPUT out = BBllResd
PREDICTED = yhats rstudent = rresid r = resid ;
run;
ODS graphics off;
Again, the question of interest is about differences
in means. Is there a significant difference in mean
height between the six teams? What part of the output
below gives you that conclusion?
Girls' Heights on BBall Teams
The GLM Procedure
Class Level Information
Class Levels Values
Team 5 blue gold gray pink red
Number of Observations Read 60
Number of Observations Used 60
The GLM Procedure
Dependent Variable: Height
Sum of
Source DF Squares Mean Square F Value Pr > F
Model 4 228.0000000 57.0000000 4.14 0.0053
Error 55 758.0000000 13.7818182
Corrected Total 59 986.0000000
R-Square Coeff Var Root MSE Height Mean
0.231237 7.279190 3.712387 51.00000
Source DF Type I SS Mean Square F Value Pr > F
Team 4 228.0000000 57.0000000 4.14 0.0053
Source DF Type III SS Mean Square F Value Pr > F
Team 4 228.0000000 57.0000000 4.14 0.0053
The GLM Procedure
Scheffe's Test for Height
NOTE: This test controls the Type I experimentwise error rate.
Alpha 0.05
Error Degrees of Freedom 55
Error Mean Square 13.78182
Critical Value of F 2.53969
Minimum Significant Difference 4.8306
Means with the same letter are not significantly different.
Scheffe Grouping Mean N Team
A 54.833 12 pink
A
B A 50.500 12 gold
B A
B A 50.333 12 gray
B
B 49.833 12 blue
B
B 49.500 12 red
Scheffe's test compares every pair of "treatments" to see if their
means are "different". Look for Teams which do not share either
letter, A or B. Blue is significantly shorter than pink, and red is
significantly different from __________________(which others?) at the 0.05
significance level. We are 95% confident that we are not making an
error in both of those statements simultaneously.
BBall <- read.table("data/basketBallHts.dat", head=T)
plot( height ~ team, BBall)
BBfit <- lm( height ~ team, BBall)
## team is a factor because it's character data
plot(BBfit)
require(multcomp)
glht(BBfit, linfct = mcp(team = "Tukey"))