Nonlinear Regression and Model Selection for U and V shaped profile analysis

This website is still in DRAFT mode, I still have a few sections to add!  
Let me know if I've terribly confused you with this presentation, it's still under construction.
Simulated U and V shaped profiles

This website is intended as a starting point for implementing the methods discussed in Greenwood and Humphrey (2003), my dissertation (2004) and Greenwood and Humphrey (to be submitted).  Initially, it will discuss the use of Matlab and the Stat Toolbox to fit the nonlinear regression models and calculate some model selection criteria.  Eventually, the methods will be discussed using R/ S-plus.  Additional analysis methods for this application that exploit Functional Data Analysis will be discussed on a parallel site.

Data Format:

1)  These methods are for the analysis of valley elevation cross profiles with each observation containing relative location measurement (x) and an elevation measurement (y).  It is assumed (although not absolutely necessary) that the x observations are distances from the first location in the profile.  The elevation measurements can be relative to sea level or the minimum observation.  If problems are encountered fitting the models, this arbitrary datum (location of 0,0 in the coordinate system) choice should be reconsidered.

2)  The profiles are assumed to be perpendicular to the run of the valley and the location of interest is randomly selected from some population of locations of interest.  

The following table contains a summary of data collection process using a DEM:



3)  A variable x contains the horizontal distance information and y contains the vertical displacement or elevation.  Some models will care whether the first observation has an x of 0 and the minimum y observation of 0, others are invariant to the 'datum' choice.

4)  Matlab version 6.1 or better with the Stat toolbox is being used.

Example profiles:




Nonlinear regression in Matlab:

Fitting nonlinear regression models in Matlab involves two steps.  First, a specific program must be written to provide estimates of the fitted values for the model from a set of parameter estimates.  The following example shows the contents of the .m file used to fit Pattyn and van Huele's GPL model.  Second, an initial set of parameter estimates must be chosen.  As you use these models, you will get a sense for reasonable initializations.  It will depend on the size and shape of your profile and any scaling choices made above.  If you do not provide estimates moderately close to the optimal estimates, the parameter estimation process may diverge.  A quick check of the parameter estimates and/or the estimated model will provide this information.  Try alternate prior estimates of the parameters untill a reasonable result is found.  Not all models may reasonably fit for every profile that you might be interested in.

The GPL model requires the following .m file to be estimated using Newton-Raphson in Matlab:

% The @gpl_1 calls the following program that must be saved as gpl_1.m:

function yhat = gpl_1(beta,x)

yhat = beta(1)*abs(x-beta(3)).^beta(2)+beta(4);


%The following Matlab code will fit the model where x are the horizontal distances, y are the elevations.  You will need the stats toolbox to use the nlinfit routine.


beta=[1;1.5;2000;2000]

[beta,r,J]=nlinfit(x,y,@gpl_1,beta)


If you have convergence problems, I recommend that you add the following modified version of nlinfit that will allow more iterations before giving up.  You would copy the following m-file into a working directory:

nlinfit_mod1.m

Model Selection Criteria:

The GPL model is only one of many models that might work well for describing the shape of a given profile, and using only model assumes that it contains all the possible shapes of profiles that may be encountered.  Asymmetric profiles are a quick counter-example to assuming that the GPL model should be ubiquitously applied.



Research is ongoing regarding the best model selection criterion to use for this set of nonlinear regression models.  Updates on results will be posted on this website.  For now, the PDC of Neath, Davies and Cavanaugh is favored for its performance in linear regression simulations and the minimal assumptions required in its derivations.



The following code will produce the PDC values as well as some other related criteria.  Research is ongoing related to assessing the optimal criteria to use for this application.  

% A small section of the code used to estimate the model selection criteria is provided, the model selection continues for the other eight models considered in a similar manner.  Someday I'll make this look a bit better but this gets at the basics.

minloc=input('Enter value of minimum location:')

%GPL MODEL

%initializations:

beta=[]; yhat=[]; resid=[]; xsave=[]; ysave=[]; betasave=[]; hat1=[]; sig2hat=[]; xnew1=[]; ynew1=[]; xpred=[]; ypred=[]; dresid1=[]; fresid1=[]; d=[]; c=[]; dsigsqhat=[]; fsigsqhat=[]; ypredd=[]; ypredf=[];

%step 1 is to find the full data estimate

% x, y contain the profile

% estimating the model, checking the results to make sure it converged

beta=[.0001;1.8;minloc;0];

[beta,r,J]=nlinfit_mod1(x,y,@powcurve_1,beta); beta

yhat = beta(1)*abs(x-beta(3)).^beta(2)+beta(4); plot(x,yhat);

reply = input('Need new beta0?'); %This is a way to check that the model fit was reasonable...

while ~isempty(reply)

plot(x,yhat,'g');

newbeta=input('Enter new beta0 for GPL:');

[beta,r,J]=nlinfit_mod1(x,y,@powcurve_1,newbeta);

yhat = beta(1)*abs(x-beta(3)).^beta(2)+beta(4);

plot(x,yhat,'r');

reply=input('Better?');

end

beta

yhat = beta(1)*abs(x-beta(3)).^beta(2)+beta(4);resid=y-yhat; sig2hat=sum(resid.^2)/n; k=length(beta)+1; n=length(x); xsave=x; ysave=y; a=min(xsave);
b=max(xsave); betasave=beta;

%Computing the approximate hat matrix:

   hat1=(J*inv(J'*J)*J'); hii=diag(hat1);

%These will not be reasonable for highly nonlinear regression models, but are computed for comparison:

AIC=n*log(sig2hat)+n+2*k;

AICC=AIC+(2*k*(k+1))/(n-k-1);

%Finding the CV and FCV predictions for the removal of each observation:

for j=0:n-1; %deleting each observation as I loop through...

x=xsave; y=ysave; beta=betasave;

%cross validation variables

xnew1=[x(1:j);x(j+2:end)]; ynew1=[y(1:j);y(j+2:end)];

%full cross validation variables

xnew2=xsave; ynew2=ysave; ynew2(j+1,1)=yhat(j+1,1);

xpred(j+1,1)=x(j+1,1); %saving the x value for doing prediction

%fit the model on the reduced dataset, CV

[beta1,r,J1]=nlinfit_mod1(xnew1,ynew1,@powcurve_1,beta);

ypredd(j+1,1)=beta1(1)*abs(xpred(j+1)-beta1(3)).^beta1(2)+beta1(4);

yhat1 = beta1(1)*abs(xnew1-beta1(3)).^beta1(2)+beta1(4);


%fit the model on the reduced dataset, FCV

[beta2,r,J1]=nlinfit_mod1(xnew2,ynew2,@powcurve_1,beta);

yhat2 = beta2(1)*abs(x-beta2(3)).^beta2(2)+beta2(4);

ypredf(j+1,1)=beta2(1)*abs(xpred(j+1,1)-beta2(3)).^beta2(2)+beta2(4);

dresid1(j+1,1)=y(j+1,1)-ypredd(j+1,1);

fresid1(j+1,1)=y(j+1,1)-yhat2(j+1,1);

%CV variance estimate at each loop based on n-1 (this is the MLE)

dsigsqhat(j+1,1)=(sum((ynew1-yhat1).^2))/(n-1);

%FCV variance estimate at each loop based on approx likelihood and n (this is the MLE)

fsigsqhat(j+1,1)=(sum((y-yhat2).^2)-(y(j+1)-yhat2(j+1))^2+(yhat(j+1)-yhat2(j+1))^2)/n;

%computing c_i's for BDP CV and FCV:

if j==0

d(j+1,1)=2*(xsave(j+2)-xsave(j+1)); c(j+1,1)=d(j+1,1)/(2*(b-a));

elseif (j>0)&(j<(n-1))

d(j+1,1)=(xsave(j+2)-xsave(j)); c(j+1,1)=d(j+1,1)/(2*(b-a));


else %j=n

d(j+1,1)=2*(xsave(j+1)-xsave(j)); c(j+1,1)=d(j+1,1)/(2*(b-a));

end

end

% Using the previous values to compute the model selection criteria for this model:

CV=sum(c.*(dresid1.^2));

FCV=sum(c.*(fresid1.^2));

CVM=sum(dresid1.^2)/n;

FCVM=sum(fresid1.^2)/n;

PDC=sum(log(dsigsqhat))+sum(((y-ypredd).^2)./dsigsqhat);

PDCa=n*log(sig2hat)+((n-1)/(n-k-2))*sum(1./(1.-hii));

PDCF=sum(log(fsigsqhat))+sum(((y-ypredf).^2)./fsigsqhat);

PDCaF=n*log(sig2hat)+((n)/(n-k-1))*sum((1+hii).^2.*(1-hii));




Some model diagnostics:

Even though a model may have been chosen, it may still have significant problems.  Before interpreting any model, the assumptions made in it should be checked as much as is possible.  The following plots contains an example of some diagnostics that can be explored for the previous model.


Clockwise from the upper left: (i) this plots the residuals versus the explanatory variable and any pattern in this plot other than random scatter is a cause for concern; (ii) This is the residuals versus the fitted values, again any pattern in this plot indicates that either the functional form used or error distribution assumed are not correct; (iii) A normal probability plot assesses the assumption of normally distributed errors, patterns in residuals that do not follow the dashed line are indications of a violation of this assumption; (iv) The histogram of the residuals also assesses the normality of the residuals as well as aiding in detecting outliers.  These residuals are not ideal but also do not provide a strong indication of any problems with the assumed model.


[I'll add some additional diagnostics for the favored model soon]


[I'll also add diagnostics for a parabola for a V-shaped profile soon]


A final note:

This is meant to provide a starting point for working with the nonlinear regression models and model selection criterion discussed in Greenwood and Humphrey (to be submitted).  Hopefully it will help if you are trying to describe valley profiles or if you are looking to select amongst a different set of candidate nonlinear regression models in a different application.

Mark Greenwood
Department of Math Sciences
Montana State University - Bozeman
Contact info: greenwood at math dot montana dot edu
Copyright 2005

Link to Mark Greenwood's main page

This page is in DRAFT mode!

Last updated:  September 1, 2005

Number of visitors since 9/1/2005:    Hit Counter Image