Stat 505 Assignment 1 Solutions

We start with a file containing records on 7,439 Rainbow and 4,399 Brown trout caught by FWP personnel in the Ruby river at four locations from 1994 to 2007. We need to tabulate counts of the fish in a table which will tell us: species, year, site, and their capture status: captured in first pass, captured in second pass and unmarked, or captured in second pass and marked.
options ps=66 ls=78 nodate nonumber;
dm "log;clear;out;clear;";

DATA rubyFish;
  infile "~/Projects/505HW1/Ruby-AllFish.csv" firstobs=2 delimiter = "," DSD;
  input trip mark length weight species$ year site $ river$;
run;

PROC MEANS data = rubyFish mean stddev;
  var trip mark length weight year;
run;

PROC IMPORT 
  datafile="~/Projects/505HW1/Ruby-AllFish.csv" 
  out=mydata   dbms=dlm    replace;
  delimiter=',' ;
  getnames=yes;
run;
PROC MEANS data = rubyFish mean sd;
  var trip mark length weight year;
run;

  1. Show plots to describe the data:
    1. A mosaicplot or stacked barchart to show the relative proportion of RBT and Brown at each site. Explain what you see.
      ods graphics on;
      PROC FREQ data = rubyFish;
        table site * species/ plot = mosaicplot;
      run;
      ods graphics off;
      
      Only the Greenhorn site has more browns than rainbows. ThreeForks has only one brown, Vigilante only a few percent, and Canyon has maybe 10 percent browns.
    2. Distribution of length separated by site and species. (Use lattice or ggplot so this is one plot with multiple panels). Do the same for weight, or if you prefer, log(weight). Explain what you see.
      ods graphics on;
      PROC SGPANEL data = rubyFish;
        panelby species site;
        density length;
        histogram length;
      run;
      ods graphics off;
      
      Lengths are typically bimodal with peaks between 200 and 400 mm and small spikes near zero for 3Fk and Ghorn sites. Log weights are almost symmetrically distributed with peaks for browns somewhat larger than for rainbows.
    3. Relationship between log(weight) and length separated by year and species. Explain what you see and show your code that fixes the problem for two of the years.
      data rubyFish;
        set rubyFish;
        logweight = log(weight);
        loglenth = log(length);
      ;
      ods graphics on;
      PROC SGPANEL data = rubyFish;
        panelby species year;
        reg x=length y=logweight / cli clm;
      run;
      ods graphics off;
      
      Something odd happened in 1994 and 1995. Weights went from 0.03 to 3, and lengths from 2 to 20. It appears that these are in pounds and inches instead of grams and millimeters like the others. We need to convert.
      data rubyFish;
        set rubyFish;
        if (year < 1996) then do;
           length = 25.4 * length;
           weight = 454 * weight;
         end;
      
  2. Make a new column which tells capture status: first pass, 2nd marked, or 2nd unmarked. We know which of the second pass fish had also been captured on first pass by their mark, so the mark identifies fish caught in both passes.
    data rubyFish;
      set rubyFish;
      if trip = 1
        then status = "first";
        else if mark = 1 status = "both";
        else status = "two only";
    ; 
    
  3. Make a table of capture status by species and by site.
    PROC tabular data = rubyFish;
       by species;
       table site * status;
    run;
    
  4. Give a biological or a geometry-based argument suggesting that the effect of length on weight is best viewed by taking logs on each. Plot log(weight) as a function of log(length). Comment on the relationship. Discuss: does it make sense to remove any outliers? If so, remove up to 1\% of the data, and give a justification for removing those point.
     ods graphics on;
    PROC SGPANEL data = rubyFish;
      panelby species site;
      reg x=loglength y=logweight / cli clm;
    run;
    
    There is a strong linear relationship between log(length) and log(weight).
    DATA ruby2;
      set rubyFish;
      if _N_ neq 8527 & _N_ neq 8772;
      if weight > 0 & length > 10;
    ;
    ods graphics off;
    
    Assume that fish bodies have roughly constant density ($d$). Then mass will increase with volume which is, roughly a product of cross-sectional area ($A$) times length ($L$): mass = $dAL \times $ error. If fish grow in such a way that $L$ is proportional to $A$, then volume is proportional to $L^3$, and in log scale we have log(mass) = $\log(d) + 3 \log(L) + $ error. I removed fish that had missing length or weight and the one Brown caught at 3Forks because we can't fit a line to one point. I fear that fish weighing less than 10 g are mistakes, so I removed these 24 fish as well and am fairly sure that the 55mm Brown at Greenhorn weighing 300g is a mistake, so I removed it, too.
  5. Fit a linear model for log(weight) on log(length). Does the intercept depend on site and or species? Does the slope? Fit a model with main effects and appropriate interactions (let's leave out the 3-way interaction). Interpret each coefficient estimate. Explain exactly what effect each is measuring.
    ods graphics on;
    PROC GLM data = ruby2 plot=diagnostics;
      class site species;
      model logweight = species loglength | site loglength*species; 
    run;
    ods graphics off;
    
    The anova output says that we have very strong evidence of a linear association between log(length) and log(weight) (F_{1,10114} = __, p-value <.0001). Given the linear effect is in the model, there is strong evidence that intercepts (when log(length) is 0 or length = 1) differ by site (F_{3,10114} = __, p-value < .0001). Given log(length) and site are in the model, there is strong evidence of a difference in mean species weight (F_{1,10114} = __, p-value <.0001). After accounting for site, species and log(length) we have strong evidence that the slope for log(length) varies by site (F_{3,10114} = __, p-value < .0001), and after all those terms above are entered, there is strong evidence that the slope over log(length) also depends on species (F_{1,10114} = __, p-value < .0001). From the summary of the linear model (putting SE's in parentheses):\\ When log(length) is 0 the estimate of log(weight) of Brown trout at 3Fks (where there are none) is (). Estimated slope for Browns at 3Fks is ). Moving from 3Fks Browns to Canyon Browns increases estimated log(weight) by , whereas going from 3Fks to Ghorn gives an increase of (), and to Vigilante a decrease of (). The estimated difference in intercept of Rainbows at 3Fks is () larger than for Browns. The slope for Browns is estimated to be () units smaller at Canyon, 0.07 smaller at Ghorn, and () larger at Vigilante relative to 3Fks. The slope for rainbows is () units smaller than for Browns. It's interesting that all slopes on length are close to 3, so we have strong support for the assumption that the volume of a trout bodies is proportional to length cubed.
  6. Plot the usual four diagnostic plots and comment on what you see. We see a few wild outliers, which may be recording errors or sick fish. The distribution of residuals is much longer tailed than a normal distribution. With such a large sample size, I don't worry about long tails. There is a version of the central limit theorem which would suggest that coefficient estimates are close to normally distributed. I see no problem with non-constant variance. We could have troubles with lack of independence because fish were collected in batches, but I have no way to check that.