Stat 408 Combining SAS Data Sets
Data may be collected in stages.
Need to combine small datasets into one big one for plots and
analysis.
Add more columns -- Merging datasets
In the Tumor Necrosis Factor data, the amount of TNF produced by cells
was recorded under four different treatments:
- MTB = mycobacterium tuberculosis
- FAT = fixed activated T-cells
- Both MTB and FAT
- Neither treatment
Each of 11 donors was used 3 times, and we want to keep
the donors on separate rows of the combined data set. Here are the
three datasets:
1 -0.01 1 -0.01 1 -0.13
2 16.13 2 -9.62 2 -14.88
4 3.63 3 -0.3 3 -0.95
5 -3.21 4 47.5 4 55.2
6 16.26 5 -5.64 5 -5.32
7 -12.74 6 52.21 6 17.93
8 -4.67 7 -5.23 7 -4.06
9 -5.4 8 20.1 8 110
10-10.94 9 20 9 10.3
11-4.19 10-5.26 11-11.83
11-6.29 10-2.73
Read each into it's own data set (all are in
my stat406/data
folder).
Donor number and TNF value run together in the last two lines, so use
column formatting.
data tnf1;
infile "data/tnfdata11.dat";
input donor 1-2 rep1 ;
data tnf2;
infile "data/tnfdata12.dat";
input donor 1-2 rep2 ;
data tnf3;
infile "data/tnfdata13.dat";
input donor 1-2 rep3 ;
Combine them with a merge command.
Use "by DONOR" to insure
that the same donor is credited with the right TNF values.
proc sort data = tnf3;
by donor; ** each dataset must be sorted;
run;
data tnfCombnd1;
merge tnf1 tnf2 tnf3 ;
by donor;
mtb=0; fat = 0; ** other variables we will need;
proc print;
run;
Which gives the output:
Obs donor rep1 rep2 rep3 mtb fat
1 1 -0.01 -0.01 -0.13 0 0
2 2 16.13 -9.62 -14.88 0 0
3 3 . -0.30 -0.95 0 0
4 4 3.63 47.50 55.20 0 0
5 5 -3.21 -5.64 -5.32 0 0
6 6 16.26 52.21 17.93 0 0
7 7 -12.74 -5.23 -4.06 0 0
8 8 -4.67 20.10 110.00 0 0
9 9 -5.40 20.00 10.30 0 0
10 10 -10.94 -5.26 -2.73 0 0
11 11 -4.19 -11.83 -6.29 0 0
Note the period denoting a missing value. If we had not used by
donor, the first column would have moved up, and the last row would
be missing. Often you need to sort the data before merging, like
this:
PROC SORT; By donor;
In our case, two were already sorted.
The are 3 other treatment combinations need the same merging.
data tnf21;
infile "data/tnfdata21.dat";
input donor 1-2 rep1 ;
data tnf22;
infile "data/tnfdata22.dat";
input donor 1-2 rep2 ;
data tnf23;
infile "data/tnfdata23.dat";
input donor 1-2 rep3 ;
data tnfCombnd2;
merge tnf21 tnf22 tnf23 ;
by donor;
mtb=1; fat = 0; ** new trt variables;
data tnf31;
infile "data/tnfdata31.dat";
input donor 1-2 rep1 ;
data tnf32;
infile "data/tnfdata32.dat";
input donor 1-2 rep2 ;
data tnf33;
infile "data/tnfdata33.dat";
input donor 1-2 rep3 ;
data tnfCombnd3;
merge tnf31 tnf32 tnf33 ;
mtb=0; fat = 1; ** new trt variables;
by donor;
data tnf41;
infile "data/tnfdata41.dat";
input donor 1-2 rep1 ;
data tnf42;
infile "data/tnfdata42.dat";
input donor 1-2 rep2 ;
data tnf43;
infile "data/tnfdata43.dat";
input donor 1-2 rep3 ;
data tnfCombnd4;
merge tnf41 tnf42 tnf43 ;
mtb=1; fat = 1; ** new trt variables;
by donor;
Combining rows
Each of the datasets tnfCombnd1, tnfCombnd2, tnfCombnd3,
tnfCombnd4 contain 11 rows and the same columns, MTB, FAT,
rep1, rep2, rep3. We now want to make them into one dataset with
44 rows, and the same columns.
data tnfCombind;
set tnfCombnd1 tnfCombnd2 tnfCombnd3 tnfCombnd4;
proc print;
run;
The new dataset has 44 rows and six variables. You can see that
in the log window:
NOTE: There were 11 observations read from the data set WORK.TNFCOMBND1.
NOTE: There were 11 observations read from the data set WORK.TNFCOMBND2.
NOTE: There were 11 observations read from the data set WORK.TNFCOMBND3.
NOTE: There were 11 observations read from the data set WORK.TNFCOMBND4.
NOTE: The data set WORK.TNFCOMBIND has 44 observations and 6 variables.
Stretching out the data
In the data of interest, there is no difference between rep1, rep2,
and rep3, so we want to combine them into a single variable. Use the
output statement and a do loop for this.
data tnfStretch;
set tnfCombind;
array reps{3} rep1-rep3;
do i=1 to 3;
tnf = reps{i};
repl = i;
trt = fat + 2 * mtb;
output;
end;
drop i rep1-rep3;
run;
proc print;
run;
Now the log says we have:
NOTE: The data set WORK.TNFSTRETCH has 132 observations and 6 variables.
which look like:
Obs donor mtb fat tnf repl trt
1 1 0 0 -0.01 1 0
2 1 0 0 -0.01 2 0
3 1 0 0 -0.13 3 0
4 2 0 0 16.13 1 0
5 2 0 0 -9.62 2 0
6 2 0 0 -14.88 3 0
7 3 0 0 . 1 0
...
126 9 1 1 1945.0 3 3
127 10 1 1 . 1 3
128 10 1 1 739.9 2 3
129 10 1 1 4379.0 3 3
130 11 1 1 6637.0 1 3
131 11 1 1 6909.0 2 3
132 11 1 1 6453.0 3 3
Now we can plot them.
proc sort data = tnfStretch;
by trt;
run;
ods html;
proc boxplot ;
plot tnf * trt;
run;
ods html off;
Which gives the following plot:
Author:
Jim Robison-Cox
Last Updated: