\documentclass[11pt,titlepage]{article}
\usepackage{amsmath}
\usepackage{graphicx}
\usepackage{verbatim}
\allowdisplaybreaks

\jot=.2in \pagestyle{plain} \setlength{\topmargin}{-.5in}
\setlength{\textheight}{9 in} \setlength{\oddsidemargin}{-0.2in}
\setlength{\evensidemargin}{-0.2in} \setlength{\textwidth}{6.75in}
\font\heada=cmbx10 scaled\magstep3 \font\headb=cmsl10
scaled\magstep1 \font\headc=cmr8 \pretolerance=10000 \raggedright
\setlength{\parindent}{2 em}
%\input macros
\newdimen\digitwidth
\newdimen\minuswidth
\setbox0=\hbox{\rm0}
\digitwidth=\wd0
\setbox1=\hbox{$-$}
\minuswidth=\wd1
\newdimen\starr
\setbox2=\hbox{${}^*$}
\starr=\wd2

{\catcode`?=\active
\def?{\kern\digitwidth}
\catcode`@=\active
\def@{\kern\minuswidth}
\catcode`|=\active
\def|{\kern\starr}}


\begin{document}
\noindent {\heada Chapter 1 Notes}

\section{Data}

\noindent \underline{\bf Statistics} consist of three major areas:
\begin{itemize}
\item Data Collection (sampling plans and experimental designs) \vspace{-0.1in}
\item Descriptive Statistics (numerical and graphical summaries) \vspace{-0.1in}
\item Inferential Statistics (confidence intervals and hypothesis testing)
\end{itemize}
\vspace{0.05in}

\noindent Statistical procedures are part (steps 2-5 below) of the
\underline{\bf Scientific Method} first espoused by Sir Francis
Bacon (1561-1626), who wrote ``to learn the secrets of nature
involves collecting data and carrying out experiments."  The
modern methodology:\vspace{-0.1in}
\begin{enumerate}
\item Observe some phenomenon \vspace{-0.1in}

\item State a hypothesis explaining the phenomenon\vspace{-0.1in}

\item Collect data\vspace{-0.1in}

\item Test: Does the data support the hypothesis?\vspace{-0.1in}

\item Conclusion.  If the test fails, go back to step 2.
\end{enumerate}

\noindent If you encounter a ``scientific claim" that you disagree
with, scrutinize the steps of the scientific method used.
``Statistics don't lie, but liars do statistics." - Mark Twain.

\vspace{0.1in}

\noindent \underline{\bf Individuals}: The objects from which data is collected.
Individuals may be people, places, animals, things, {\it even} time periods.
\vspace{0.1in}

\noindent \underline{\bf Variable}: Any characteristic of an
individual which can be measured. \vspace{0.1in}

\noindent \underline{\bf Two Types of Variables}:
\begin{itemize}
\item {\bf Categorical} (or Qualitative) - The possible values are
{\it categories}.  Beware, some category names are actually
numbers (e.g. zip codes and dates) \vspace{-.1in}

\item {\bf Numerical} (or Quantitative) - The possible values are
{\it numbers} so that mathematical operations, such as averaging,
make sense!
\end{itemize}


\begin{verse}
\noindent {\bf \underline{QUESTION}: Categorical or Numerical?}
\begin{enumerate}
\item Lifetime of a battery:
\item Type of battery:
\item Distance to school:
\item UPC:
\end{enumerate}
\end{verse}
\vspace{0.05in}

\noindent \underline{\bf Two Types of Numerical Variables}:
\begin{itemize}
\item {\bf Discrete} - The possible values are isolated points on
the number line.  Discrete variables can be either:
\vspace{-0.1in}
\begin{itemize}
\item  {\bf finite} (e.g. the number of beers left in a six pack:
0, 1, 2, 3, 4, 5 or 6)\vspace{-0.1in}

\item {\bf infinite} (e.g. the number of (full) minutes until the
next terrorist attack: 0, 1, 2, 3, $\hdots$ ,  $\infty$).
\end{itemize}
\item {\bf Continuous} - The possible values are an interval on
the number line (e.g. the distance between any two students in
this classroom (in feet) is in the interval [0,50) - all real
numbers between 0 and 50, including 0 and excluding 50).
\end{itemize}

\begin{verse}
\noindent {\bf \underline{QUESTION}: Discrete or Continuous?}
\begin{enumerate}
\item Amount of money on you:
\item Your height:
\item Reaction time:
\item Number of children you have:
\end{enumerate}
\end{verse}
\vspace{0.05in}

\noindent \underline{\bf Population}: The entire group of
individuals that we want information about. For example: all
grizzly bears in Yellowstone National Park; all G.E. light bulbs
(made now and in the future); all tosses with a weighted die

\vspace{0.05in}

\noindent \underline{\bf Sample}: A part of the population from
which data is collected.  For example: 22 tagged grizzly bears in
Yellowstone National Park; 1 box G.E. light bulbs; 100 tosses with
a weighted die. \vspace{0.05in}

\noindent Typically, it is unrealistic to obtain data from the
entire population of interest.  So one collects data from a sample
and uses the sample results to draw conclusions about the
population. This process is called \underline{\bf Inference}.
\vspace{0.05in}

\noindent \underline{\bf Explanatory Variable vs. Response Variable}: One or more variables
({\bf explanatory variables}) are used to predict or explain the values of another variable
({\bf response variable}).


\section{Obtaining and Installing R}

The following is a revised version of what appears in Chapter 7.1
of your \underline{Course Notes: Statistics for Researchers
STAT401 FALL 2006}:

\begin{enumerate}
\item Get on the Internet and go to the web address
\underline{http://cran.r-project.org}. This is the ``official"
site of the The Comprehensive R Archive Network (CRAN). Bookmark
this address. Lots of information (manuals, answers to frequently
asked questions, etc) can be downloaded from this site.

\item The first box on this page is labeled {\em Download and Install
R}.  In that box, click on the appropriate link.  For example, MAC
users will click on {\em MAC OS X} and Microsoft Windows users will
click on the link {\em Windows (95 and later)}.  The rest of the
instructions are specific to Windows users.

\item On the new page, click on the link named {\em base}.

\item On the new page, the link {\em README.R-2.4.1} provides a brief synopsis on
installation and other instructions for R version 2.4.1 for Windows.
You shouldn't need to look at this file, but take a look if you get
into trouble.

\item Click on the link {\em R-2.4.1-win32.exe}. Download this setup
program to the hard drive on your computer.

\item Exit from your Internet Browser and open Windows Explorer. Go to the
folder in which you saved R-2.4.1-win32.exe and run the program.

\item You will be guided through the installation by a Setup
Wizard.

\item There are many excellent resources for using R. One
interactive site is at
\underline{http://www.math.csi.cuny.edu/Statistics/R/simpleR/},
called ``Simple R" by John Veranzi.

\item Special-purpose software routines are bundled as separate
``packages." Some packages are automatically downloaded when base R
is downloaded. To download additional packages, execute R on your PC
and then click on the tab {\em Packages} from one of the tabs at the
top of the screen.  From the drop down menu, click on {\em Install
package(s) ...} and then choose the package(s) that you want to
download. The packages that you will need to download for this
course are the following:

\begin{itemize}
\item lattice

\item pastecs

\end{itemize}
MASS is another package which we will be using which you do NOT need
to download because it is a part of base R.
\end{enumerate}

\section{Entering Data into R}

\noindent A researcher is interested in determining whether adding
a certain type of bacteria, called PC, helps increase the firmness
of cottage cheese.  Seven dairies make two identical batches of
cottage cheese, one with and one without the bacteria PC. The
results of the experiment are in a text file called ``dairy.txt"
which is shown below:

\verbatiminput{Chapter1.data.dairy.txt}

\noindent Text data files that are tab or space delimited can be
imported into R.  This means that the names of the variables in
the file can not have spaces in them (e.g. don't use ``Cheese
Firmness"). To get dairy.txt into R, execute the following
command:

\begin{verbatim}
> D = read.table("dairy.txt",header=TRUE)
\end{verbatim}

\noindent {\bf read.table} is a {\em function}, and the {\em
parameter} {\bf header=TRUE} tells R that the first line of the
file contains the variable names of each of the columns of data.
You could end up with an error like:

\begin{verbatim}
Error in file(file, "r") : unable to open connection
In addition:
Warning message: cannot open file `dairy.txt'
\end{verbatim}

\noindent The above error occurred because dairy.txt was not in
the {\bf working directory}.  To change the working directory to
the one where dairy.txt resides, in R, click on tab {\bf File
$\to$ (Change dir ...)} and you will see a {\bf Choose Directory}
window appear. In this window, you can directly enter the
directory that contains diary.txt on your computer, or you can hit
the Browse button to find the directory. Once you find the
directory that contains dairy.txt, then (click OK in the Browser
Window if you hit the Browse button and then ...) click OK in the
{\bf Choose Directory} window. Now we can try to read the data
into R again.

\begin{verbatim}
> D = read.table("dairy.txt",header=TRUE)
\end{verbatim}

\noindent The R-variable {\bf D} which contains the data is called
a {\bf data frame}.  We could have used any variable name like
``DairyData" ``CCheese", but I don't like to type much, so I used
``D".  Note that you can not have spaces in your R-variable names!
Type the variable name at the R prompt to see what the data looks
like:

\verbatiminput{Rout1.txt}

\noindent To access the individual columns of the data in D, type

\verbatiminput{Rout2.txt}

\noindent Or you can execute

\verbatiminput{Rout3.txt}

\noindent R is case-sensitive!  The upper and lower-case letters
in the variable name must be EXACTLY as given in the data file or
R will not find it.  For example,

\begin{verbatim}
> FIRMNESS
Error: object "FIRMNESS" not found
> D$FirmneSS
NULL
\end{verbatim}

\noindent Notice that R recognizes that {\bf Farm} and {\bf
Treatment} are categorical variables and gives the {\em levels} or
categories associated with each.  The variable {\bf Firmness} is
recognized as a quantitative variable.

\vspace{.1in}

\noindent In addition to {\bf read.table}, we will be using many
other functions that R has available.  For example, {\bf mean()}
calculates the mean and {\bf median()} calculates the median.  The
functions {\bf sd()} and {\bf var()} calculate the standard
deviation and variance respectively.  For example:

\begin{verbatim}
> mean(Firmness)
[1] 63.5
> median(Firmness)
[1] 66
> sd(Firmness)
[1] 12.91243
> sd(Farm)
Error in var(as.vector(x), na.rm = na.rm) :
        missing observations in cov/cor
In addition: Warning message: NAs introduced by coercion
\end{verbatim}

\noindent The command {\bf sd(Farm)} yields an error because {\bf
Farm} is a categorical variable.

\vspace{.1in}

\noindent Oftentimes, it is a good idea to store a result in an
R-variable so that you can refer to it later. Then you can type
the new variable name to see what is stored in it.  For example,

\begin{verbatim}
> firm.mean = mean(Firmness)
> firm.mean
[1] 63.5
> firm.mean/10 +100
[1] 106.35
\end{verbatim}

\noindent The last command shows that R-variables can be used with
the mathematical operators +, -, * and /.  To compute the mean and
standard deviation of the firmness of cottage cheese without PC
and of the firmness of cottage cheese with PC, execute

\begin{verbatim}
> tapply(Firmness,Treatment,mean)
withoutPC    withPC
 61.14286  65.85714
> tapply(Firmness,Treatment,sd)
withoutPC    withPC
 12.62839  13.74080
\end{verbatim}

\noindent Does this suggest that adding PC might increase the
firmness of cottage cheese?




\section{Exercises}

1.3 on p10: 1, 2, 3, 5, 7

\noindent 1.4 on p18: 9, 11, 15, 19

\section{Reading} All sections of Chapter 1

\end{document}

