Defining statistical models; formulæ

The template for a statistical model is a linear regression model with independent, homoscedastic errors \begin{displaymath}
y_i = \sum_{j=0}^p\beta_jx_{ij} + e_i,\qquad e_i\sim{\rm NID}(0,\sigma^2),\qquad
i = 1,2,\ldots,n\end{displaymath}In matrix terms this would be written \begin{displaymath}
{\bmit y} = {\bf X}{\bmit\beta} + {\bmit e}\end{displaymath}where the ${\bmit y}$ is the response vector, ${\bf X}$ is the model matrix or design matrix and has columns ${\bmit x}_0$, ${\bmit
x}_1$, $\ldots$, ${\bmit x}_p$, the determining variables. Very often ${\bmit x}_0$ will be a column of 1s defining an intercept term.

Examples.

Before giving a formal specification, a few examples may usefully set the picture.

Suppose y, x, x0, x1, x2, ... are numeric variables, X is a matrix and A, B, C, ... are factors. The following formulæ on the left side below specify statistical models as described on the right.


\begin{session}
\separate
y \~ x
y \~ 1 + x
&
Both imply the same simple linear ...
 ... (and hence also subplots), determined by factor $C$.
\cr
\separate\end{session}

The operator is used to define a model formula in .. The form, for an ordinary linear model, is

response t#tex2html_wrap_inline1673# term1 $\pm$ term2 $\pm$ term3 $\pm$$\cdots$

response
is a vector or matrix, (or expression evaluating to a vector or matrix) defining the response variable(s).
$\pm$
is an operator, either + or -, implying the inclusion or exclusion of a term in the model, (the first is optional).
term
is either In all cases each term defines a collection of columns either to be added to or removed from the model matrix. A 1 stands for an intercept column and is by default included in the model matrix unless explicitly removed.
The formula operators are similar in effect to the Wilkinson and Rogers notation used by such programs a Glim and Genstat. One inevitable change is that the operator ``.'' becomes ``:'' since the period is a valid name character in .. The notation is summarised as in the Table [*] (based on Chambers & Hastie, p. 29).
 
Table:   Summary of model operator semantics
\begin{displaymath}
\begin{tabular}
{@{\protect\strutt}\vert l\vert p{4.13in}\ve...
 ...nd that term appears in the model matrix.\\ \hline\end{tabular}\end{displaymath}



Note that inside the parentheses that usually enclose function arguments all operators have their normal arithmetic meaning. The function I() is an identity function used only to allow terms in model formulæ to be defined using arithmetic operators.

Note particularly that the model formulæ specify the columns of the model matrix, specification of the parameters is implicit. This is not the case in other contexts, for example in fitting nonlinear models



Jeff Banfield
2/13/1998