Stat 408 Combining SAS Data Sets

Data may be collected in stages.
Need to combine small datasets into one big one for plots and analysis.

Add more columns -- Merging datasets

In the Tumor Necrosis Factor data, the amount of TNF produced by cells was recorded under four different treatments: Each of 11 donors was used 3 times, and we want to keep the donors on separate rows of the combined data set. Here are the three datasets:
1 -0.01        1 -0.01        1 -0.13
2 16.13        2 -9.62        2 -14.88
4  3.63        3  -0.3        3 -0.95
5 -3.21        4  47.5        4  55.2
6 16.26        5 -5.64        5 -5.32
7 -12.74       6 52.21        6 17.93
8 -4.67        7 -5.23        7 -4.06
9  -5.4        8  20.1        8  110
10-10.94       9   20         9  10.3
11-4.19        10-5.26        11-11.83
               11-6.29        10-2.73
Read each into it's own data set (all are in my stat406/data folder).
Donor number and TNF value run together in the last two lines, so use column formatting.
data tnf1;
  infile "data/tnfdata11.dat";
  input donor 1-2 rep1 ;
data tnf2;
  infile "data/tnfdata12.dat";
  input donor 1-2 rep2 ;
data tnf3;
  infile "data/tnfdata13.dat";
  input donor 1-2 rep3 ;
Combine them with a merge command.
Use "by DONOR" to insure that the same donor is credited with the right TNF values.
proc sort data =  tnf3;
  by donor;               ** each dataset must be sorted;
run;

data tnfCombnd1;
   merge tnf1 tnf2 tnf3 ;
  by donor;
   mtb=0; fat = 0;        ** other variables we will need;
proc print;
run;
Which gives the output:
    Obs    donor      rep1      rep2       rep3    mtb    fat
      1       1      -0.01     -0.01      -0.13     0      0
      2       2      16.13     -9.62     -14.88     0      0
      3       3        .       -0.30      -0.95     0      0
      4       4       3.63     47.50      55.20     0      0
      5       5      -3.21     -5.64      -5.32     0      0
      6       6      16.26     52.21      17.93     0      0
      7       7     -12.74     -5.23      -4.06     0      0
      8       8      -4.67     20.10     110.00     0      0
      9       9      -5.40     20.00      10.30     0      0
     10      10     -10.94     -5.26      -2.73     0      0
     11      11      -4.19    -11.83      -6.29     0      0
Note the period denoting a missing value. If we had not used by donor, the first column would have moved up, and the last row would be missing. Often you need to sort the data before merging, like this:
PROC SORT; By donor;
In our case, two were already sorted.
The are 3 other treatment combinations need the same merging.
data tnf21;
  infile "data/tnfdata21.dat";
  input donor 1-2 rep1 ;
data tnf22;
  infile "data/tnfdata22.dat";
  input donor 1-2 rep2 ;
data tnf23;
  infile "data/tnfdata23.dat";
  input donor 1-2 rep3 ;
data tnfCombnd2;
   merge tnf21 tnf22 tnf23 ;
by donor;  
    mtb=1; fat = 0;                   ** new trt variables;
data tnf31;
  infile "data/tnfdata31.dat";
  input donor 1-2 rep1 ;
data tnf32;
  infile "data/tnfdata32.dat";
  input donor 1-2 rep2 ;
data tnf33;
  infile "data/tnfdata33.dat";
  input donor 1-2 rep3 ;
data tnfCombnd3;
   merge tnf31 tnf32 tnf33 ;
   mtb=0; fat = 1;                    ** new trt variables;
by donor;
data tnf41;
  infile "data/tnfdata41.dat";
  input donor 1-2 rep1 ;
data tnf42;
  infile "data/tnfdata42.dat";
  input donor 1-2 rep2 ;
data tnf43;
  infile "data/tnfdata43.dat";
  input donor 1-2 rep3 ;
data tnfCombnd4;
   merge tnf41 tnf42 tnf43 ;
   mtb=1; fat = 1;                   ** new trt variables;
by donor;

Combining rows

Each of the datasets tnfCombnd1, tnfCombnd2, tnfCombnd3, tnfCombnd4 contain 11 rows and the same columns, MTB, FAT, rep1, rep2, rep3. We now want to make them into one dataset with 44 rows, and the same columns.
data tnfCombind;
  set tnfCombnd1 tnfCombnd2 tnfCombnd3 tnfCombnd4;
proc print;
run;
The new dataset has 44 rows and six variables. You can see that in the log window:
NOTE: There were 11 observations read from the data set WORK.TNFCOMBND1.
NOTE: There were 11 observations read from the data set WORK.TNFCOMBND2.
NOTE: There were 11 observations read from the data set WORK.TNFCOMBND3.
NOTE: There were 11 observations read from the data set WORK.TNFCOMBND4.
NOTE: The data set WORK.TNFCOMBIND has 44 observations and 6 variables.

Stretching out the data

In the data of interest, there is no difference between rep1, rep2, and rep3, so we want to combine them into a single variable. Use the output statement and a do loop for this.
data tnfStretch;
  set tnfCombind;
  array reps{3} rep1-rep3;
  do i=1 to 3;
     tnf = reps{i};
     repl = i;
     trt = fat + 2 * mtb;
     output;
  end;
  drop i rep1-rep3;
run;
proc print;
run;
Now the log says we have:
NOTE: The data set WORK.TNFSTRETCH has 132 observations and 6 variables.
which look like:
         Obs    donor    mtb    fat      tnf    repl    trt
           1       1      0      0      -0.01      1      0
           2       1      0      0      -0.01      2      0
           3       1      0      0      -0.13      3      0
           4       2      0      0      16.13      1      0
           5       2      0      0      -9.62      2      0
           6       2      0      0     -14.88      3      0
           7       3      0      0        .        1      0
                          ...
         126       9      1      1     1945.0      3      3
         127      10      1      1         .       1      3
         128      10      1      1      739.9      2      3
         129      10      1      1     4379.0      3      3
         130      11      1      1     6637.0      1      3
         131      11      1      1     6909.0      2      3
         132      11      1      1     6453.0      3      3
Now we can plot them.
proc sort data = tnfStretch;
  by trt;
run;
ods html;
proc boxplot ;
  plot tnf * trt;
run;
ods html off;
Which gives the following plot:

Author: Jim Robison-Cox
Last Updated: