Stat 505 Assignment 1 Solutions
We start with a file containing records on 7,439 Rainbow and 4,399 Brown
trout caught by FWP personnel in the Ruby river at four locations
from 1994 to 2007. We need to tabulate counts of the fish in a
table which will tell us: species, year, site, and their capture
status: captured in first pass, captured in second pass and
unmarked, or captured in second pass and marked.
options ps=66 ls=78 nodate nonumber;
dm "log;clear;out;clear;";
DATA rubyFish;
infile "~/Projects/505HW1/Ruby-AllFish.csv" firstobs=2 delimiter = "," DSD;
input trip mark length weight species$ year site $ river$;
run;
PROC MEANS data = rubyFish mean stddev;
var trip mark length weight year;
run;
PROC IMPORT
datafile="~/Projects/505HW1/Ruby-AllFish.csv"
out=mydata dbms=dlm replace;
delimiter=',' ;
getnames=yes;
run;
PROC MEANS data = rubyFish mean sd;
var trip mark length weight year;
run;
- Show plots to describe the data:
- A mosaicplot or stacked barchart to show the relative
proportion of RBT and Brown at each site. Explain what you see.
ods graphics on;
PROC FREQ data = rubyFish;
table site * species/ plot = mosaicplot;
run;
ods graphics off;
Only the Greenhorn site has more browns than rainbows.
ThreeForks has only one brown, Vigilante only a few percent, and
Canyon has maybe 10 percent browns.
- Distribution of length separated by site and species. (Use
lattice or ggplot so this is one plot with multiple panels). Do
the same for weight, or if you prefer, log(weight). Explain
what you see.
ods graphics on;
PROC SGPANEL data = rubyFish;
panelby species site;
density length;
histogram length;
run;
ods graphics off;
Lengths are typically bimodal with peaks between 200 and 400
mm and small spikes near zero for 3Fk and Ghorn sites. Log weights
are almost symmetrically distributed with peaks for browns
somewhat larger than for rainbows.
- Relationship between log(weight) and length separated by
year and species. Explain what you see and show your
code that fixes the problem for two of the years.
data rubyFish;
set rubyFish;
logweight = log(weight);
loglenth = log(length);
;
ods graphics on;
PROC SGPANEL data = rubyFish;
panelby species year;
reg x=length y=logweight / cli clm;
run;
ods graphics off;
Something odd happened in 1994 and 1995. Weights went from 0.03
to 3, and lengths from 2 to 20. It appears that these are in pounds
and inches instead of grams and millimeters like the others. We
need to convert.
data rubyFish;
set rubyFish;
if (year < 1996) then do;
length = 25.4 * length;
weight = 454 * weight;
end;
- Make a new column which tells capture status: first pass, 2nd
marked, or 2nd unmarked. We know which of the second pass fish
had also been captured on first pass by their mark, so the mark
identifies fish caught in both passes.
data rubyFish;
set rubyFish;
if trip = 1
then status = "first";
else if mark = 1 status = "both";
else status = "two only";
;
- Make a table of capture status by species and by site.
PROC tabular data = rubyFish;
by species;
table site * status;
run;
- Give a biological or a geometry-based argument suggesting that
the effect of length on weight is best viewed by taking logs on each.
Plot log(weight) as a function of log(length). Comment on the
relationship. Discuss: does it make sense to remove any outliers?
If so, remove up to 1\% of the data, and give a justification for
removing those point.
ods graphics on;
PROC SGPANEL data = rubyFish;
panelby species site;
reg x=loglength y=logweight / cli clm;
run;
There is a strong linear relationship between log(length) and
log(weight).
DATA ruby2;
set rubyFish;
if _N_ neq 8527 & _N_ neq 8772;
if weight > 0 & length > 10;
;
ods graphics off;
Assume that fish bodies have roughly constant density ($d$). Then
mass will increase with volume which is, roughly a product of
cross-sectional area ($A$) times length ($L$):
mass = $dAL \times $ error. If
fish grow in such a way that $L$ is proportional to $A$, then
volume is proportional to $L^3$, and in
log scale we have log(mass) = $\log(d) + 3 \log(L) + $ error.
I removed fish that had missing length or weight and the one
Brown caught at 3Forks because we can't fit a
line to one point. I fear that fish weighing less than 10 g are
mistakes, so I removed these 24 fish as well and am fairly sure
that the 55mm Brown at Greenhorn weighing 300g is a mistake, so I
removed it, too.
- Fit a linear model for log(weight) on log(length). Does the
intercept depend on site and or species? Does the slope? Fit a
model with main effects and appropriate interactions (let's leave
out the 3-way interaction). Interpret each coefficient estimate.
Explain exactly what effect each is measuring.
ods graphics on;
PROC GLM data = ruby2 plot=diagnostics;
class site species;
model logweight = species loglength | site loglength*species;
run;
ods graphics off;
The anova output says that we have very strong evidence of a
linear association between log(length) and log(weight)
(F_{1,10114} = __, p-value <.0001). Given the
linear effect is in the model, there is strong evidence that
intercepts (when log(length) is 0 or length = 1) differ by site
(F_{3,10114} = __, p-value < .0001). Given log(length) and
site are in the model, there is strong evidence of a difference in
mean species weight (F_{1,10114} = __, p-value
<.0001). After accounting for site, species and log(length) we
have strong evidence that the slope for log(length) varies by site
(F_{3,10114} = __, p-value < .0001), and after all those
terms above are entered, there is strong evidence that the slope
over log(length) also depends on species (F_{1,10114} = __, p-value < .0001).
From the summary of the linear model (putting SE's in parentheses):\\
When log(length) is 0 the estimate of log(weight) of Brown
trout at 3Fks (where there are none) is
(). Estimated slope
for Browns at 3Fks is ).
Moving from 3Fks Browns to Canyon
Browns increases estimated log(weight) by , whereas going
from 3Fks to Ghorn gives an increase of (), and to Vigilante a
decrease of (). The estimated difference in intercept of
Rainbows at 3Fks is () larger than for Browns. The slope for
Browns is estimated to be () units smaller at Canyon,
0.07 smaller at Ghorn, and () larger at Vigilante relative to
3Fks. The slope for rainbows is () units smaller than for
Browns.
It's interesting that all slopes on length are close to 3, so we
have strong support for the assumption that the volume of a
trout bodies is proportional to length cubed.
- Plot the usual four diagnostic plots and comment on what you see.
We see a few wild outliers, which may be recording errors or
sick fish. The distribution of residuals is much longer tailed
than a normal distribution. With such a large sample size, I
don't worry about long tails. There is a version of the central
limit theorem which would suggest that coefficient estimates are
close to normally distributed. I see no problem with non-constant
variance. We could have troubles with lack of independence
because fish were collected in batches, but I
have no way to check that.