Lab 1  Introduction to R and R Studio
Getting Started in R and RStudio
R is a highlevel objectoriented programming language, which is a free version of
the programming language S (click here for a history of R). By objectoriented, we mean that everything in R is treated
as an "object". A data frame is a specific type of object in R; so is a numeric value,
or a character value, or a matrix.
We will be using RStudio, which is an integrated development environment (IDE) for R. RStudio is more userfriendly
than using R directly since it keeps track of your R script file, console, plots,
and history, all in one place.
How to Learn R
There are many free resources online including:
 Coursera Free R classes: R Programming by Johns Hopkins
 QuickR website
 Rbloggers: is a central hub of content collected from bloggers who write about R
How to Start RStudio
Once R and RStudio are installed on your computer, it opens as any other application. In Windows, go to the start menu, then find RStudio. On a Mac, open the Applications folder and click on the RStudio icon. It is also useful to create a desktop shortcut to the program.
The R Console
When you open RStudio, you will see the "Console" window and >, R's command prompt. This indicates R is ready to evaluate a command. For example,
> sample(1:6,1)
[1] 5
>
The command sample(1:6,1) tells R to take a sample of size 1 from the numbers 1 through 6. R responds with [1] 5. The [1] says how many calculations R has done (you can ignore it). Then it gives
another >, showing that it's ready for another command.
R also has a continuation prompt, +, which occurs if your command did not properly end.
> sample(1:6,
+ 1)
[1] 5
R will not return to the command prompt > until you finish the command that started the continuation prompt.
R Syntax
R has many builtin functions. When we use an R function, the syntax is as follows:
function.name( arg.name=value, ... )
For example, the "rep" function creates a vector of repeated values:
rep(x=3, times=10)
The function has two arguments named "x" and "times". We want to repeat the value 3 ten times. If we keep the arguments in the same order, we do not need to type their names:
rep(3,10)
But if we don't use the argument names, order matters:
rep(10,3)
The "#" sign is the comment character. Everything after a # is ignored by R:
3+5 # R ignores everything I say from now on...
Make ample use of the "#" sign to put comments in your R script files. This will help you remember what you were doing when you go back to look at your code at a later date, as well as help others understand your code.
Working Directory
Any objects we export from R will be saved in its working directory. We can use the commands setwd() or getwd() to set the working directory or to ask R what working directory it is using. Alternatively, we can click on the "Files" tab to view or change the Working Directory.
Organizing R Code
You should get in the habit of writing your R commands in a R Script file before evaluating the commands in the R console. This will become apparent as you start using for loops and writing functions. From within RStudio we can open a new "R Script" file by going to File > New File > R Script. When you save a script file, you should use the extension .R. Later, if you would like to run the code, you can highlight the code in the script file and click "Run", or involving even less work, you can source the file into R without opening it. For example, if your code is saved as mycode.R, then in the R console, you can type
source("mycode.R")
if the file is in your working directory. If the file is located elsewhere on the computer, you can enter the entire file extension. We can even source in code off the web!
How to Save and Quit R
There are three types of files you might want to save from your RStudio session:
 R script file (name.R)
 R workspace file (name.RData)
 R history file (name.Rhistory)
R does not save what prints in the R console. Your R script file is the R code; the workspace saves an R session with all of the created objects in the workspace; your R history file is a history of the commands that were entered into the console.
Basic R Commands
Here are some examples of basic R commands that you will find useful. Try typing them into the R console (each command followed by RETURN). If you get an error message of the form "Error: Object not found", it may be because you skipped an earlier example which created the object, or because you misspelled the name of the object. Remember that R is casesensitive!
Object Assignment
y < 4 # assignment: creates an object in your workspace named y
y # enter the name of an object to see
# its contents
x = 3 # an equals sign also works for assignment, though not recommended
x
Basic Mathematics
1+1
32
5*8
3^2
8/2+6^2
sqrt(81) # square root function
cos(pi) # cosine function; pi is a builtin R object
log(100) # natural log function
log(exp(3.2)) # exp(a) raises the constant e to the ath power
log(100,10) # log(a,b) is the log of a base b
Getting Help
help(log) # the help function
?cos # shortcut for the help function
Creating Data Sets
n < 2:5 # integer sequences
n
x < seq(1,2,.1) # general sequences using the "seq" function;
x # writes over the previous "x" object
# data entry; the "c" function stands for "combine"
z < c(2.3, 1.2, 4.4, 4.7, 1.2, 6.3)
z
Z < runif(6) # generate 6 random numbers in (0,1)
Z
y < Z+z # add two data sets to create a third
y
a < c("red", "orange", "yellow", "green","blue", "violet")
a
b < rep(1:2,3) # repeated values
b
B < rep(1:2,c(3,3)) # repeated values, take two
B
Builtin Data Sets
help(data)
data() # shows you a list of all the builtin data sets
data(co2) # loads the data set "co2" into your workspace
co2 # print data set
help(co2)
plot(co2) # plot the timeseries
plot(co2,col="red")
Data Types
x < seq(1,2,.2)
x # numeric
w < c("a","a","b","b","c","c")
w # character data
b < factor(w)
b # factor: a coded categorical data set
X < data.frame(x,b)
X # data frame: a table of data
# rows are cases, columns are variables
check < w=="a"
check # logical vector of T/F
sum(check) # R treats TRUE as a 1 and FALSE as a 0,
# so summing a logical vector gives the number of TRUE's.
Selecting subsets of data sets and logical operators
z[1] # case selection by subscript; square brackets always indicate subsets of an object
z[2]
z[2:5]
z[5:2]
z[n]
# case selection by logical comparison
z > 2
z[z > 2]
a == "blue"
a != "blue"
a != "blue" & a != "orange"
a != "blue"  a != "orange"
# case selection using another data set of the same size z[a == "blue"]
z[a == "blue"]
z[a != "blue"]
X[1,] # select first row
X[,1] # select first column
X[1,2] # select item in the first row and second column
X$b # select column with the name "b"
X[X[,2] == "a",]
# select the rows for which column 2 is "a"
Operations on data sets
2+z
3*z
z^2
2+3*z
sqrt(z) # Note: the square root of a negative number is
# undefined (NaN: Not a Number)
sort(z)
sort(2+3*z)
2*a # we can't do arithmetic with all data!
Probability Distributions in R
pnorm(2) # cdf of standard normal distribution evaluated at z=2 = P(Z <= 2)
pnorm(300, 515, 100) # cdf of normal distribution with mean 515 and standard deviation 100
# evaluated at 300
?pnorm # Explore other R functions used with the normal distribution.
?pt # tdistribution
More Useful R functions
mean(z)
median(z)
max(z)
min(z)
range(z)
sum(z)
length(z)
ls() # lists all the objects saved in your workspace
options(digits=20) # set the digits option to display 20 digits
options(digits=7) # (default is 7)
Data Input: Birth Weights
Taken from Stat Labs by Nolan and Speed, originally from the Child Health and Development Studies conducted at the Oakland, CA, Kaiser Foundation Hospital. The variables are
 bwt: baby's weight in ounces at birth
 gestation: duration of pregnancy in days
 parity: parity indicator (first born = 1, later birth = 0)
 age: mother's age in years
 height: mother's height in inches
 weight: mother's weight in pounds (during pregnancy)
 smoke: indicator for whether mother smokes (1=yes, 0=no)
The data will be read into R in "data frame" format, which are arrays of data in which
each case (here an individual) corresponds to a row, and each variable corresponds
to a column. The row labels for these data frames are just row numbers, the column
labels are the names of the variables. Only complete cases are included here.
Use the following command to load the data into your R session:
babies < read.table("http://www.math.montana.edu/shancock/courses/stat401/data/Bwt.dat", header=TRUE, sep=",")
Check that the data were read in correctly:
head(babies)
tail(babies)
dim(babies) # For the dimension of a vector, use the function "length"
names(babies)
is.data.frame(babies)
We will take bwt to be the response variable. For now, consider gestation as the only
predictor. (We will explore this data set in more detail, using more predictor variables,
in the future.) The first step to any data analysis should be to explore the data
 plots and summary statistics.
Basic Plotting in R
Histograms
Let's take a look at the distribution of gestation periods using a histogram:
hist(babies$gestation)
A histogram places each observation into predetermined "bins" where the height of the bin is the number of observations in that bin. Our histogram doesn't look too good  let's try a different bin size:
hist(babies$gestation, breaks=40)
The option breaks=40 tells R to break up the xaxis into 40 bins. Notice that the labels on the vertical axis are counts (frequencies). We could also look at the "density" histogram, use
hist(babies$gestation, breaks=40, freq=F)
When you use the freq=F argument in hist(), you are asking for the density histogram, which has total area 1. Area is proportional to relative frequency (the count divided by the total number of observations). For example, in the interval
from 280 to 300, the frequency histogram shows a count of 61. The height of that interval
in the density histogram is about .00247, and the width is 20. Thus the area for that
interval is about .091*5 = 0.049 (4.9% of the sample), and 0.049*1236 is about 61.
We can also add a smoothed density line to a histogram:
lines(density(babies$gestation))
Note that the histogram must be on the density scale in order to add a smoothed density
line.
In every plot function, there are various arguments that add to the figure, such as
adding titles, axes labels, or color:
hist(babies$gestation,breaks=40,col="seagreen",main="Gestation Distribution",xlab="Days")
For a list of all the colors in R, type:
colors()
Box Plots
A boxplot is really not much more than a graphical display of a 5number summary (min, 1st quartile, median, 3rd quartile, max). The body of the box represents the location of the quartiles, with a line added at the median. The "whiskers", or lines extending out from the box, display the distance to the furthest observations which are no more than 1.5 times the innerquartile range (Q3Q1) from the quartiles. Outliers are displayed as points or lines beyond the whiskers.
boxplot(babies$gestation)
# or horizontal
boxplot(babies$gestation, horizontal = TRUE)
Note that the above plots are missing axis labels. How would you add them?
R will also do sidebyside boxplots which we can use to compare distributions of
quantitative variables across categories:
boxplot(gestation ~ smoke, data=babies, xlab="Smoker (0 = no; 1 = yes)", ylab="Gestation (days)")
Comparing Two Variables
For the most part, we will be interested not in just one variable, but the relationship
between two or more variables. Depending on if the variables are quantitative or categorical,
this can be done in a variety of ways.
In order to refer to variables directly by name (rather than preceding the variable
name with babies$), let's attach the data set:
attach(babies)
(Note: Many R coders do not recommend the use of the attach function since it can clutter your R workspace. Make sure you detach the data set after you are finished!)
The "plot" function is a generic plotting function. We can feed it one variable:
plot(bwt) # Plots birthweight vs. observation number
or two variables (scatterplot):
plot(bwt ~ gestation) # Order of arguments are y~x where x is the xaxis variable and y is the yaxis variable.
# (Could instead use arguments x,y, e.g., plot(gestation,bwt).)
or sidebyside boxplots:
plot(bwt ~ smoke)
or an entire data set:
plot(babies)
What if we want to compare more than two variables? If they are all quantitative, we would need a 3D scatterplot. However, if there are two quantitative variables and one categorical variable, we can use a scatterplot of the two quantitative variables with plot symbols denoting the levels of the categorical variable. Let's try this with gestation, bwt, and smoke.
plot(bwt~gestation,type="n",main="Bwt vs. Gestation by Smoke",xlab="Gestation (Days)",ylab="Bwt (Ounces)")
# type="n" tells R just to set up the plot window
# and don't plot the points (think of "n" standing for "nothing")
points(bwt[smoke==0]~gestation[smoke==0],pch=15,col="hotpink") # Plot nonsmokers
points(bwt[smoke==1]~gestation[smoke==1],pch=18,col="rosybrown") # Plot smokers
legend(locator(1),c("Nonsmoker","Smoker"),pch=c(15,18),col=c("hotpink","rosybrown")) # Add legend
abline(lm(bwt[smoke==0]~gestation[smoke==0]),col="hotpink") # Add least squares regression lines
abline(lm(bwt[smoke==1]~gestation[smoke==1]),col="rosybrown")
Now that we are finished referring to the variables by name (without the babies$ prefix), let's detach the data set:
detach(babies)
Summary Statistics
R has functions built in for most of the standard quantitative measures that we are likely to use. Those that aren't built in are easy to add.
Basic numerical measures for a data set X1, X2, X3,..., Xn, stored in an R variable named "x"
Statistic  Definition  R Function 
mean (average)  (X1+X2+...+Xn)/n  mean(x) 
median (middle value)  50th percentile, i.e., a value M such that 50% of the data are less than M and 50% are greater than M.  median(x) 
minimum  value of the smallest data point  min(x) 
maximum  value of the largest data point  max(x) 
pth quantile  A value Q such that p*100% of the data are less than Q and (1p)100% are greater than
Q. Special cases:

quantile(x,p) 
5 number summary  min, Q1, median, Q2, max  quantile(x) 
variance  mean squared deviation from the mean  var(x) 
standard deviation  square root of the variance; a "typical" deviation from the mean  sd(x) 
order statistics  data in ascending order  sort(x) 
We can easily add functions, for example to compute the interquartile range (IQR)
= Q3Q1:
iqr < function(x){
# computes the interquartile range of a data set
r < quantile(x, c(0.25,0.75))
r[2]r[1]
}
In the iqr function defined above, the variable r will be a dataset with 2 elements, the first
and third quartiles. The final expression of the function is the value returned, in
this case the difference between the two quartiles.
There are also functions that are meant for two or more variables, such as correlation:
cor(babies$bwt, babies$gestation)
cor(babies)