Getting Started in R and RStudio

R is a high-level object-oriented programming language, which is a free version of the programming language S (see a history of R). By object-oriented, we mean that everything in R is treated as an "object". A data frame is a specific type of object in R; so is a numeric value, or a character value, or a matrix. 

We will be using RStudio, which is an integrated development environment (IDE) for R. RStudio is more user-friendly than using R directly since it keeps track of your R script file, console, plots, and history, all in one place. RStudio uses what it calls "Projects" to organize your workflow. Each project file (.RProj) is a self-contained unit that contains R code, R objects, data files, etc. for a given project. You may want to create a separate project for each analysis in this course.

Recently, RStudio created an online workspace called RStudio Cloud. If you prefer not to download RStudio to your computer, or if you'd like to access your work from other computers, you may choose to use RStudio Cloud rather than RStudio on your computer. After you create an account and log in to RStudio Cloud, you can "New Project" to start working in RStudio online. You will need to upload any data sets or code to the cloud using the "Upload" button in the "Files" pane (bottom right).

How to Learn R

There are many free resources online including:

How to Start RStudio

Once R and RStudio are installed on your computer, it opens as any other application. In Windows, go to the start menu, then find RStudio. On a Mac, open the Applications folder and click on the RStudio icon. It is also useful to create a desktop shortcut to the program.

The R Console

When you open RStudio, you will see the "Console" window and >, R's command prompt. This indicates R is ready to evaluate a command. For example,

> sample(1:6,1)
[1] 5
>

The command sample(1:6,1) tells R to take a sample of size 1 from the numbers 1 through 6. R responds with [1] 5. The [1] says how many calculations R has done (you can ignore it). Then it gives another >, showing that it's ready for another command.

R also has a continuation prompt, +, which occurs if your command did not properly end. 

> sample(1:6,
+ 1)
[1] 5

R will not return to the command prompt > until you finish the command that started the continuation prompt or you hit Esc.

R Syntax

R has many built-in functions. When we use an R function, the syntax is as follows:

function.name( arg.name=value, ... )


For example, the "rep" function creates a vector of repeated values:

rep(x=3, times=10)

The function has two arguments named "x" and "times". We want to repeat the value 3 ten times. If we keep the arguments in the same order, we do not need to type their names:

rep(3,10)

But if we don't use the argument names, order matters:

rep(10,3)

The "#" sign is the comment character. Everything after a # is ignored by R:

3+5 # R ignores everything I say from now on...

Make ample use of the "#" sign to put comments in your R script files. This will help you remember what you were doing when you go back to look at your code at a later date, as well as help others understand your code.

Working Directory

Any objects we export from R will be saved in its working directory. We can use the commands setwd() or getwd() to set the working directory or to ask R what working directory it is using. Alternatively, we can click on the "Files" tab to view or change the Working Directory.

Organizing R Code

You should get in the habit of writing your R commands in a R Script file before evaluating the commands in the R console. This will become apparent as you start using for loops and writing functions. From within RStudio we can open a new "R Script" file by going to File -> New File -> R Script. When you save a script file, you should use the extension .R. Later, if you would like to run the code, you can highlight the code in the script file and click "Run", or involving even less work, you can source the file into R without opening it. For example, if your code is saved as mycode.R, then in the R console, you can type

source("mycode.R")

if the file is in your working directory. If the file is located elsewhere on the computer, you can enter the entire file extension. We can even source in code off the web!

How to Save and Quit R

There are four types of files you might want to save from your RStudio session:

  1. R script file (name.R) - your R code
  2. R workspace file (name.RData) - an R session containing all of the created objects in the "Environment" pane (top right).
  3. R history file (name.Rhistory) - a history of the commands that were entered into the console
  4. R project file (name.Rproj) - your RStudio session: script files that are open, R objects and data sets, etc.

Your R script file is the R code; the workspace saves an R session with all of the created objects in the workspace; your R history file is a history of the commands that were entered into the console. When you re-open a workspace or project, you will need to re-load any necessary libraries again (using the "library" command); thus, it is a good habit to include the code for any required libraries at the top of your script file. Note: R does not save what prints in the R console!


Basic R Commands

Here are some examples of basic R commands that you will find useful. Try typing them into the R console (each command followed by RETURN). If you get an error message of the form "Error: Object not found", it may be because you skipped an earlier example which created the object, or because you mis-spelled the name of the object. As you work through the code and familiarize yourself with the syntax, try to understand exactly what each line of code is telling R to do. Remember that R is case-sensitive!

Object Assignment

y <- 4		 # assignment: creates an object in your workspace named y

y # enter the name of an object to see
# its contents

x = 3 # an equals sign also works for assignment, though not recommended
x

Basic Mathematics

1+1              
3-2
5*8
3^2
8/2+6^2
sqrt(81) # square root function
cos(pi) # cosine function; pi is a built-in R object
log(100) # natural log function
log(exp(3.2)) # exp(a) raises the constant e to the ath power
log(100,10) # log(a,b) is the log of a base b

Getting Help

help(log)	# the help function
?cos # shortcut for the help function

Creating Data Sets

n <- 2:5         # integer sequences
n

x <- seq(1,2,.1) # general sequences using the "seq" function;
x # writes over the previous "x" object

# data entry; the "c" function stands for "combine"
z <- c(2.3, 1.2, 4.4, 4.7, -1.2, 6.3)
z

Z <- runif(6) # generate 6 random numbers in (0,1)
Z

y <- Z+z # add two data sets to create a third
y

a <- c("red", "orange", "yellow", "green","blue", "violet")
a

b <- rep(1:2,3) # repeated values
b

B <- rep(1:2,c(3,3)) # repeated values, take two
B
 

Built-in Data Sets

help(data)
data() # shows you a list of all the built-in data sets
data(co2) # loads the data set "co2" into your workspace
co2 # print data set
help(co2)
plot(co2)         # plot the time-series
plot(co2,col="red")

Data Types

x <- seq(1,2,.2)            
x      # numeric

w <- c("a","a","b","b","c","c")
w      # character data

b <- factor(w)
b      # factor: a coded categorical data set

X <- data.frame(x,b)
X      # data frame: a table of data
       # rows are cases, columns are variables

check <- w=="a"
check # logical vector of T/F
sum(check) # R treats TRUE as a 1 and FALSE as a 0,
# so summing a logical vector gives the number of TRUE's.

Selecting subsets of data sets and logical operators

z[1] 	# case selection by subscript; square brackets always indicate subsets of an object 
z[2]
z[2:5]
z[5:2]
z[n]

# case selection by logical comparison
z > 2
z[z > 2]
a == "blue"
a != "blue"
a != "blue" & a != "orange"
a != "blue" | a != "orange"
 
# case selection using another data set of the same size z[a == "blue"]
z[a == "blue"]
z[a != "blue"]

X[1,] # select first row
X[,1] # select first column
X[1,2] # select item in the first row and second column
X$b # select column with the name "b"
X[X[,2] == "a",]
       # select the rows for which column 2 is "a"
 

Operations on data sets

2+z 
3*z
z^2
2+3*z
sqrt(z) # Note: the square root of a negative number is
# undefined (NaN: Not a Number)

sort(z)
sort(2+3*z)

2*a # we can't do arithmetic with all data!

 Probability Distributions in R

pnorm(2)		# cdf of standard normal distribution evaluated at z=2 = P(Z <= 2)
pnorm(300, 515, 100) # cdf of normal distribution with mean 515 and standard deviation 100
# evaluated at 300

?pnorm # Explore other R functions used with the normal distribution.
?pt # t-distribution

More Useful R functions

mean(z) 
median(z)
max(z)
min(z)
range(z)
sum(z)
length(z)

ls() # lists all the objects saved in your workspace
options(digits=20) # set the digits option to display 20 digits
options(digits=7) # (default is 7)


Data Input: Birth Weights

Taken from Stat Labs by Nolan and Speed, originally from the Child Health and Development Studies conducted at the Oakland, CA, Kaiser Foundation Hospital. The variables are

  1. bwt: baby's weight in ounces at birth
  2. gestation: duration of pregnancy in days
  3. parity: parity indicator (first born = 1, later birth = 0)
  4. age: mother's age in years
  5. height: mother's height in inches
  6. weight: mother's weight in pounds (during pregnancy)
  7. smoke: indicator for whether mother smokes (1=yes, 0=no)

The data will be read into R in "data frame" format, which are arrays of data in which each case (here an individual) corresponds to a row, and each variable corresponds to a column. The row labels for these data frames are just row numbers, the column labels are the names of the variables. Only complete cases are included here.

Use the following command to load the data into your R session:

babies <- read.table("http://www.math.montana.edu/shancock/data/Bwt.dat", header=TRUE, sep=",")

Check that the data were read in correctly:

head(babies)
tail(babies)
dim(babies) # For the dimension of a vector, use the function "length"
names(babies)
is.data.frame(babies)

We will take bwt to be the response variable. For now, consider gestation as the only predictor. (We will explore this data set in more detail, using more predictor variables, in the future.) The first step to any data analysis should be to explore the data - plots and summary statistics.


Basic Plotting in R

Histograms

Let's take a look at the distribution of gestation periods using a histogram:

hist(babies$gestation)

A histogram places each observation into pre-determined "bins" where the height of the bin is the number of observations in that bin. Our histogram doesn't look too good - let's try a different bin size:

hist(babies$gestation, breaks=40)

The option breaks=40 tells R to break up the x-axis into 40 bins. Notice that the labels on the vertical axis are counts (frequencies). We could also look at the "density" histogram, use

hist(babies$gestation, breaks=40, freq=F)

When you use the freq=F argument in hist(), you are asking for the density histogram, which has total area 1. Area is proportional to relative frequency (the count divided by the total number of observations). For example, in the interval from 280 to 300, the frequency histogram shows a count of 61. The height of that interval in the density histogram is about .00247, and the width is 20. Thus the area for that interval is about .091*5 = 0.049 (4.9% of the sample), and 0.049*1236 is about 61. 

We can also add a smoothed density line to a histogram:

lines(density(babies$gestation))

Note that the histogram must be on the density scale in order to add a smoothed density line.

In every plot function, there are various arguments that add to the figure, such as adding titles, axes labels, or color:

hist(babies$gestation,breaks=40,col="seagreen",main="Gestation Distribution",xlab="Days")

For a list of all the colors in R, type:

colors()

Box Plots

A boxplot is really not much more than a graphical display of a 5-number summary (min, 1st quartile, median, 3rd quartile, max). The body of the box represents the location of the quartiles, with a line added at the median. The "whiskers", or lines extending out from the box, display the distance to the furthest observations which are no more than 1.5 times the inner-quartile range (Q3-Q1) from the quartiles. Outliers are displayed as points or lines beyond the whiskers.

boxplot(babies$gestation)
# or horizontal--
boxplot(babies$gestation, horizontal = TRUE)

Note that the above plots are missing axis labels. How would you add them?

R will also do side-by-side boxplots which we can use to compare distributions of quantitative variables across categories:

boxplot(gestation ~ smoke, data=babies, xlab="Smoker (0 = no; 1 = yes)", ylab="Gestation (days)")

Comparing Two Variables

For the most part, we will be interested not in just one variable, but the relationship between two or more variables. Depending on if the variables are quantitative or categorical, this can be done in a variety of ways.

In order to refer to variables directly by name (rather than preceding the variable name with babies$), let's attach the data set:

attach(babies)

(Note: Many R coders do not recommend the use of the attach function since it can clutter your R workspace. Make sure you detach the data set after you are finished!)

The "plot" function is a generic plotting function. We can feed it one variable:

plot(bwt)    # Plots birthweight vs. observation number

or two variables (scatterplot):

plot(bwt ~ gestation)   # Order of arguments are y~x where x is the x-axis variable and y is the y-axis variable. 
# (Could instead use arguments x,y, e.g., plot(gestation,bwt).)

or side-by-side boxplots:

plot(bwt ~ smoke)

or an entire data set:

plot(babies)

What if we want to compare more than two variables? If they are all quantitative, we would need a 3-D scatterplot. However, if there are two quantitative variables and one categorical variable, we can use a scatterplot of the two quantitative variables with plot symbols denoting the levels of the categorical variable. Let's try this with gestation, bwt, and smoke. 

plot(bwt~gestation,type="n",main="Bwt vs. Gestation by Smoke",xlab="Gestation (Days)",ylab="Bwt (Ounces)") 
   # type="n" tells R just to set up the plot window
     # and don't plot the points (think of "n" standing for "nothing")
points(bwt[smoke==0]~gestation[smoke==0],pch=15,col="hotpink") # Plot nonsmokers
points(bwt[smoke==1]~gestation[smoke==1],pch=18,col="rosybrown") # Plot smokers
legend(locator(1),c("Nonsmoker","Smoker"),pch=c(15,18),col=c("hotpink","rosybrown")) # Add legend
abline(lm(bwt[smoke==0]~gestation[smoke==0]),col="hotpink") # Add least squares regression lines
abline(lm(bwt[smoke==1]~gestation[smoke==1]),col="rosybrown")

Now that we are finished referring to the variables by name (without the babies$ prefix), let's detach the data set:

detach(babies)

Summary Statistics

 R has functions built in for most of the standard quantitative measures that we are likely to use. Those that aren't built in are easy to add.

Basic numerical measures for a data set X1, X2, X3,..., Xn, stored in an R variable named "x"

Statistic Definition R Function
mean (average) (X1+X2+...+Xn)/n mean(x)
median (middle value) 50th percentile, i.e., a value M such that 50% of the data are less than M and 50% are greater than M. median(x)
minimum value of the smallest data point min(x)
maximum value of the largest data point max(x)
p-th quantile A value Q such that p*100% of the data are less than Q and (1-p)100% are greater than Q. Special cases:
  • Median = .50th quantile
  • 1st quartile (Q1) = .25th quantile
  • 3rd quartile (Q3) = .75th quantile
quantile(x,p)
5 number summary min, Q1, median, Q2, max quantile(x)
variance mean squared deviation from the mean var(x)
standard deviation square root of the variance; a "typical" deviation from the mean sd(x)
order statistics data in ascending order sort(x)


We can easily add functions, for example to compute the inter-quartile range (IQR) = Q3-Q1:

iqr <- function(x){
# computes the inter-quartile range of a data set
r <- quantile(x, c(0.25,0.75))
r[2]-r[1]
}  

In the iqr function defined above, the variable r will be a dataset with 2 elements, the first and third quartiles. The final expression of the function is the value returned, in this case the difference between the two quartiles.

There are also functions that are meant for two or more variables, such as correlation:

cor(babies$bwt, babies$gestation)
cor(babies)

Additional Practice with Data in R

Work through the OpenIntro Introduction to Data lab to better familiarize yourself with how to work with data in base R.