This website is still in
DRAFT mode, I still have a few sections to add!
Let me know if I've terribly confused you with this presentation, it's still
under construction.
This website is intended as a starting point for implementing the methods
discussed in Greenwood and Humphrey (2003), my dissertation (2004) and
Greenwood and Humphrey (to be submitted). Initially, it will discuss
the use of Matlab and the Stat Toolbox to fit the nonlinear regression models
and calculate some model selection criteria. Eventually, the methods
will be discussed using R/ S-plus. Additional analysis methods for
this application that exploit Functional Data Analysis will be discussed
on a parallel site.
Data Format:
1) These methods are for the analysis of valley elevation cross
profiles with each observation containing relative location measurement
(x) and an elevation measurement (y). It is assumed (although not
absolutely necessary) that the x observations are distances from the first
location in the profile. The elevation measurements can be relative
to sea level or the minimum observation. If problems are encountered
fitting the models, this arbitrary datum (location of 0,0 in the coordinate
system) choice should be reconsidered.
2) The profiles are assumed to be perpendicular to the run of
the valley and the location of interest is randomly selected from some
population of locations of interest.
The following table contains a summary of data collection process using
a DEM:
3) A variable x contains the horizontal distance information and
y contains the vertical displacement or elevation. Some models will
care whether the first observation has an x of 0 and the minimum y observation
of 0, others are invariant to the 'datum' choice.
4) Matlab version 6.1 or better with the Stat toolbox is being
used.
Example profiles:
Nonlinear regression in Matlab:
Fitting nonlinear regression models in Matlab involves two steps. First,
a specific program must be written to provide estimates of the fitted values
for the model from a set of parameter estimates. The following example
shows the contents of the .m file used to fit Pattyn and van Huele's GPL
model. Second, an initial set of parameter estimates must be chosen.
As you use these models, you will get a sense for reasonable initializations.
It will depend on the size and shape of your profile and any scaling
choices made above. If you do not provide estimates moderately close
to the optimal estimates, the parameter estimation process may diverge.
A quick check of the parameter estimates and/or the estimated model
will provide this information. Try alternate prior estimates of the
parameters untill a reasonable result is found. Not all models may
reasonably fit for every profile that you might be interested in.
The GPL model requires the following .m file to be estimated using Newton-Raphson
in Matlab:
% The @gpl_1 calls
the following program that must be saved as gpl_1.m:
function yhat = gpl_1(beta,x)
yhat = beta(1)*abs(x-beta(3)).^beta(2)+beta(4);
%The following Matlab code will fit the model where
x are the horizontal distances, y are the elevations. You will need
the stats toolbox to use the nlinfit routine.
beta=[1;1.5;2000;2000]
[beta,r,J]=nlinfit(x,y,@gpl_1,beta)
If you have convergence problems, I recommend that you add the following
modified version of nlinfit that will allow more iterations before giving
up. You would copy the following m-file into a working directory:
nlinfit_mod1.m
Model Selection Criteria:
The GPL model is only one of many models that might work well for describing
the shape of a given profile, and using only model assumes that it contains
all the possible shapes of profiles that may be encountered. Asymmetric
profiles are a quick counter-example to assuming that the GPL model should
be ubiquitously applied.
Research is ongoing regarding the best model selection criterion to
use for this set of nonlinear regression models. Updates on results
will be posted on this website. For now, the PDC of Neath, Davies
and Cavanaugh is favored for its performance in linear regression simulations
and the minimal assumptions required in its derivations.
The following code will produce the PDC values as well as some other related
criteria. Research is ongoing related to assessing the optimal criteria
to use for this application.
% A small section of
the code used to estimate the model selection criteria is provided, the model selection continues for the other eight models considered
in a similar manner. Someday I'll make this look a bit better but this
gets at the basics.
minloc=input('Enter value of minimum location:')
%GPL MODEL
%initializations:
beta=[]; yhat=[]; resid=[]; xsave=[]; ysave=[];
betasave=[]; hat1=[]; sig2hat=[]; xnew1=[]; ynew1=[]; xpred=[]; ypred=[];
dresid1=[]; fresid1=[]; d=[]; c=[]; dsigsqhat=[]; fsigsqhat=[]; ypredd=[];
ypredf=[];
%step 1 is to find the
full data estimate
% x, y contain the profile
% estimating the model,
checking the results to make sure it converged
beta=[.0001;1.8;minloc;0];
[beta,r,J]=nlinfit_mod1(x,y,@powcurve_1,beta);
beta
yhat = beta(1)*abs(x-beta(3)).^beta(2)+beta(4);
plot(x,yhat);
reply = input('Need new beta0?'); %This is a way to check that the model fit was reasonable...
while ~isempty(reply)
plot(x,yhat,'g');
newbeta=input('Enter new beta0 for GPL:');
[beta,r,J]=nlinfit_mod1(x,y,@powcurve_1,newbeta);
yhat = beta(1)*abs(x-beta(3)).^beta(2)+beta(4);
plot(x,yhat,'r');
reply=input('Better?');
end
beta
yhat = beta(1)*abs(x-beta(3)).^beta(2)+beta(4);resid=y-yhat;
sig2hat=sum(resid.^2)/n; k=length(beta)+1; n=length(x); xsave=x; ysave=y;
a=min(xsave);
b=max(xsave); betasave=beta;
%Computing the approximate
hat matrix:
hat1=(J*inv(J'*J)*J'); hii=diag(hat1);
%These will not be
reasonable for highly nonlinear regression models, but are computed for comparison:
AIC=n*log(sig2hat)+n+2*k;
AICC=AIC+(2*k*(k+1))/(n-k-1);
%Finding the CV and
FCV predictions for the removal of each observation:
for j=0:n-1; %deleting each observation as I loop through...
x=xsave; y=ysave; beta=betasave;
%cross validation variables
xnew1=[x(1:j);x(j+2:end)]; ynew1=[y(1:j);y(j+2:end)];
%full cross validation
variables
xnew2=xsave; ynew2=ysave; ynew2(j+1,1)=yhat(j+1,1);
xpred(j+1,1)=x(j+1,1); %saving the x value for doing prediction
%fit the model on the
reduced dataset, CV
[beta1,r,J1]=nlinfit_mod1(xnew1,ynew1,@powcurve_1,beta);
ypredd(j+1,1)=beta1(1)*abs(xpred(j+1)-beta1(3)).^beta1(2)+beta1(4);
yhat1 = beta1(1)*abs(xnew1-beta1(3)).^beta1(2)+beta1(4);
%fit the model on the
reduced dataset, FCV
[beta2,r,J1]=nlinfit_mod1(xnew2,ynew2,@powcurve_1,beta);
yhat2 = beta2(1)*abs(x-beta2(3)).^beta2(2)+beta2(4);
ypredf(j+1,1)=beta2(1)*abs(xpred(j+1,1)-beta2(3)).^beta2(2)+beta2(4);
dresid1(j+1,1)=y(j+1,1)-ypredd(j+1,1);
fresid1(j+1,1)=y(j+1,1)-yhat2(j+1,1);
%CV variance estimate
at each loop based on n-1 (this is the MLE)
dsigsqhat(j+1,1)=(sum((ynew1-yhat1).^2))/(n-1);
%FCV variance estimate
at each loop based on approx likelihood and n (this is the MLE)
fsigsqhat(j+1,1)=(sum((y-yhat2).^2)-(y(j+1)-yhat2(j+1))^2+(yhat(j+1)-yhat2(j+1))^2)/n;
%computing c_i's for
BDP CV and FCV:
if j==0
d(j+1,1)=2*(xsave(j+2)-xsave(j+1));
c(j+1,1)=d(j+1,1)/(2*(b-a));
elseif (j>0)&(j<(n-1))
d(j+1,1)=(xsave(j+2)-xsave(j));
c(j+1,1)=d(j+1,1)/(2*(b-a));
else %j=n
d(j+1,1)=2*(xsave(j+1)-xsave(j));
c(j+1,1)=d(j+1,1)/(2*(b-a));
end
end
% Using the previous
values to compute the model selection criteria for this model:
CV=sum(c.*(dresid1.^2));
FCV=sum(c.*(fresid1.^2));
CVM=sum(dresid1.^2)/n;
FCVM=sum(fresid1.^2)/n;
PDC=sum(log(dsigsqhat))+sum(((y-ypredd).^2)./dsigsqhat);
PDCa=n*log(sig2hat)+((n-1)/(n-k-2))*sum(1./(1.-hii));
PDCF=sum(log(fsigsqhat))+sum(((y-ypredf).^2)./fsigsqhat);
PDCaF=n*log(sig2hat)+((n)/(n-k-1))*sum((1+hii).^2.*(1-hii));
Some model diagnostics:
Even though a model may have been chosen, it may still have significant
problems. Before interpreting any model, the assumptions made in it
should be checked as much as is possible. The following plots contains
an example of some diagnostics that can be explored for the previous model.
Clockwise from the upper left: (i) this plots the residuals versus the
explanatory variable and any pattern in this plot other than random scatter
is a cause for concern; (ii) This is the residuals versus the fitted values,
again any pattern in this plot indicates that either the functional form
used or error distribution assumed are not correct; (iii) A normal probability
plot assesses the assumption of normally distributed errors, patterns in
residuals that do not follow the dashed line are indications of a violation
of this assumption; (iv) The histogram of the residuals also assesses the
normality of the residuals as well as aiding in detecting outliers. These
residuals are not ideal but also do not provide a strong indication of any
problems with the assumed model.
[I'll add some additional diagnostics for the favored model soon]
[I'll also add diagnostics for a parabola for a V-shaped profile soon]
A final note:
This is meant to provide a starting point for working with the nonlinear
regression models and model selection criterion discussed in Greenwood
and Humphrey (to be submitted). Hopefully it will help if you are
trying to describe valley profiles or if you are looking to select amongst
a different set of candidate nonlinear regression models in a different
application.
Mark Greenwood
Department of Math Sciences
Montana State University - Bozeman
Contact info: greenwood at math dot montana dot edu
Copyright 2005
Link to Mark Greenwood's main page
This page is in DRAFT mode!