Getting Started in R and RStudio

R is a high-level object-oriented programming language, which is a free version of the programming language S (click here for a history of R). By object-oriented, we mean that everything in R is treated as an "object". A data frame is a specific type of object in R; so is a numeric value, or a character value, or a matrix.

We will be using RStudio, which is an integrated development environment (IDE) for R. RStudio is more user-friendly than using R directly since it keeps track of your R script file, console, plots, and history, all in one place.

How to Learn R

There are many free resources online including:

How to Start RStudio

Once R and RStudio are installed on your computer, it opens as any other application. In Windows, go to the start menu, then find RStudio. On a Mac, open the Applications folder and click on the RStudio icon. It is also useful to create a desktop shortcut to the program.

The R Console

When you open RStudio, you will see the "Console" window and >, R's command prompt. This indicates R is ready to evaluate a command. For example,

`> sample(1:6,1)[1] 5>`

The command sample(1:6,1) tells R to take a sample of size 1 from the numbers 1 through 6. R responds with [1] 5. The [1] says how many calculations R has done (you can ignore it). Then it gives another >, showing that it's ready for another command.

R also has a continuation prompt, +, which occurs if your command did not properly end.

`> sample(1:6,+ 1)[1] 5`

R will not return to the command prompt > until you finish the command that started the continuation prompt.

R Syntax

R has many built-in functions. When we use an R function, the syntax is as follows:

function.name( arg.name=value, ... )

For example, the "rep" function creates a vector of repeated values:

`rep(x=3, times=10)`

The function has two arguments named "x" and "times". We want to repeat the value 3 ten times. If we keep the arguments in the same order, we do not need to type their names:

`rep(3,10)`

But if we don't use the argument names, order matters:

`rep(10,3)`

The "#" sign is the comment character. Everything after a # is ignored by R:

`3+5 # R ignores everything I say from now on...`

Make ample use of the "#" sign to put comments in your R script files. This will help you remember what you were doing when you go back to look at your code at a later date, as well as help others understand your code.

Working Directory

Any objects we export from R will be saved in its working directory. We can use the commands setwd() or getwd() to set the working directory or to ask R what working directory it is using. Alternatively, we can click on the "Files" tab to view or change the Working Directory.

Organizing R Code

You should get in the habit of writing your R commands in a R Script file before evaluating the commands in the R console. This will become apparent as you start using for loops and writing functions. From within RStudio we can open a new "R Script" file by going to File -> New File -> R Script. When you save a script file, you should use the extension .R. Later, if you would like to run the code, you can highlight the code in the script file and click "Run", or involving even less work, you can source the file into R without opening it. For example, if your code is saved as mycode.R, then in the R console, you can type

`source("mycode.R")`

if the file is in your working directory. If the file is located elsewhere on the computer, you can enter the entire file extension. We can even source in code off the web!

How to Save and Quit R

There are three types of files you might want to save from your RStudio session:

1. R script file (name.R)
2. R workspace file (name.RData)
3. R history file (name.Rhistory)

R does not save what prints in the R console. Your R script file is the R code; the workspace saves an R session with all of the created objects in the workspace; your R history file is a history of the commands that were entered into the console.

Basic R Commands

Here are some examples of basic R commands that you will find useful. Try typing them into the R console (each command followed by RETURN). If you get an error message of the form "Error: Object not found", it may be because you skipped an earlier example which created the object, or because you mis-spelled the name of the object. Remember that R is case-sensitive!

Object Assignment

`y <- 4		 # assignment: creates an object in your workspace named yy                # enter the name of an object to see                 #  its contentsx = 3		 # an equals sign also works for assignment, though not recommendedx`

Basic Mathematics

`1+1              3-25*83^28/2+6^2sqrt(81)	 # square root functioncos(pi)		 # cosine function; pi is a built-in R objectlog(100)	 # natural log functionlog(exp(3.2))    # exp(a) raises the constant e to the ath powerlog(100,10)      # log(a,b) is the log of a base b`

Getting Help

`help(log)	# the help function?cos		# shortcut for the help function`

Creating Data Sets

`n <- 2:5         # integer sequencesnx <- seq(1,2,.1) # general sequences using the "seq" function;x		 # writes over the previous "x" object                 # data entry; the "c" function stands for "combine" z <- c(2.3, 1.2, 4.4, 4.7, -1.2, 6.3)      zZ <- runif(6)    # generate 6 random numbers in (0,1)Zy <- Z+z         # add two data sets to create a thirdya <- c("red", "orange", "yellow", "green","blue", "violet")ab <- rep(1:2,3)  # repeated valuesbB <- rep(1:2,c(3,3))  # repeated values, take twoB `

Built-in Data Sets

`help(data)data()		  # shows you a list of all the built-in data setsdata(co2)	  # loads the data set "co2" into your workspaceco2		  # print data sethelp(co2)plot(co2)         # plot the time-seriesplot(co2,col="red")`

Data Types

`x <- seq(1,2,.2)            x      # numericw <- c("a","a","b","b","c","c") w      # character datab <- factor(w)b      # factor: a coded categorical data setX <- data.frame(x,b)X      # data frame: a table of data       # rows are cases, columns are variablescheck <- w=="a"check  		# logical vector of T/Fsum(check)	# R treats TRUE as a 1 and FALSE as a 0,		# so summing a logical vector gives the number of TRUE's.`

Selecting subsets of data sets and logical operators

`z[1] 	# case selection by subscript; square brackets always indicate subsets of an object z[2] z[2:5] z[5:2] z[n] 	# case selection by logical comparison z > 2 z[z > 2] a == "blue"a != "blue" a != "blue" & a != "orange"a != "blue" | a != "orange" 	# case selection using another data set of the same size z[a == "blue"]z[a == "blue"]z[a != "blue"]X[1,]	# select first rowX[,1]	# select first columnX[1,2]	# select item in the first row and second columnX\$b	# select column with the name "b"X[X[,2] == "a",]       # select the rows for which column 2 is "a" `

Operations on data sets

`2+z 3*z z^2 2+3*z sqrt(z) # Note: the square root of a negative number is 	# undefined (NaN: Not a Number) sort(z) sort(2+3*z) 2*a # we can't do arithmetic with all data! `

Probability Distributions in R

`pnorm(2)		# cdf of standard normal distribution evaluated at z=2 = P(Z <= 2)pnorm(300, 515, 100)	# cdf of normal distribution with mean 515 and standard deviation 100			# 	evaluated at 300?pnorm			# Explore other R functions used with the normal distribution.?pt			# t-distribution`

More Useful R functions

`mean(z) median(z) max(z) min(z) range(z) sum(z) length(z)ls()			# lists all the objects saved in your workspaceoptions(digits=20)	# set the digits option to display 20 digitsoptions(digits=7)	# (default is 7)`

Data Input: Birth Weights

Taken from Stat Labs by Nolan and Speed, originally from the Child Health and Development Studies conducted at the Oakland, CA, Kaiser Foundation Hospital. The variables are

1. bwt: baby's weight in ounces at birth
2. gestation: duration of pregnancy in days
3. parity: parity indicator (first born = 1, later birth = 0)
4. age: mother's age in years
5. height: mother's height in inches
6. weight: mother's weight in pounds (during pregnancy)
7. smoke: indicator for whether mother smokes (1=yes, 0=no)

The data will be read into R in "data frame" format, which are arrays of data in which each case (here an individual) corresponds to a row, and each variable corresponds to a column. The row labels for these data frames are just row numbers, the column labels are the names of the variables. Only complete cases are included here.

Use the following command to load the data into your R session:

`babies <- read.table("http://www.math.montana.edu/shancock/courses/stat401/data/Bwt.dat", header=TRUE, sep=",")`

Check that the data were read in correctly:

`head(babies)tail(babies)dim(babies)	# For the dimension of a vector, use the function "length"names(babies)is.data.frame(babies)`

We will take bwt to be the response variable. For now, consider gestation as the only predictor. (We will explore this data set in more detail, using more predictor variables, in the future.) The first step to any data analysis should be to explore the data - plots and summary statistics.

Basic Plotting in R

Histograms

Let's take a look at the distribution of gestation periods using a histogram:

`hist(babies\$gestation)`

A histogram places each observation into pre-determined "bins" where the height of the bin is the number of observations in that bin. Our histogram doesn't look too good - let's try a different bin size:

`hist(babies\$gestation, breaks=40)`

The option breaks=40 tells R to break up the x-axis into 40 bins. Notice that the labels on the vertical axis are counts (frequencies). We could also look at the "density" histogram, use

`hist(babies\$gestation, breaks=40, freq=F)`

When you use the freq=F argument in hist(), you are asking for the density histogram, which has total area 1. Area is proportional to relative frequency (the count divided by the total number of observations). For example, in the interval from 280 to 300, the frequency histogram shows a count of 61. The height of that interval in the density histogram is about .00247, and the width is 20. Thus the area for that interval is about .091*5 = 0.049 (4.9% of the sample), and 0.049*1236 is about 61.

We can also add a smoothed density line to a histogram:

`lines(density(babies\$gestation))`

Note that the histogram must be on the density scale in order to add a smoothed density line.

In every plot function, there are various arguments that add to the figure, such as adding titles, axes labels, or color:

`hist(babies\$gestation,breaks=40,col="seagreen",main="Gestation Distribution",xlab="Days")`

For a list of all the colors in R, type:

`colors()`

Box Plots

A boxplot is really not much more than a graphical display of a 5-number summary (min, 1st quartile, median, 3rd quartile, max). The body of the box represents the location of the quartiles, with a line added at the median. The "whiskers", or lines extending out from the box, display the distance to the furthest observations which are no more than 1.5 times the inner-quartile range (Q3-Q1) from the quartiles. Outliers are displayed as points or lines beyond the whiskers.

`boxplot(babies\$gestation)# or horizontal--boxplot(babies\$gestation, horizontal = TRUE)`

Note that the above plots are missing axis labels. How would you add them?

R will also do side-by-side boxplots which we can use to compare distributions of quantitative variables across categories:

`boxplot(gestation ~ smoke, data=babies, xlab="Smoker (0 = no; 1 = yes)", ylab="Gestation (days)")`

Comparing Two Variables

For the most part, we will be interested not in just one variable, but the relationship between two or more variables. Depending on if the variables are quantitative or categorical, this can be done in a variety of ways.

In order to refer to variables directly by name (rather than preceding the variable name with babies\$), let's attach the data set:

`attach(babies)`

(Note: Many R coders do not recommend the use of the attach function since it can clutter your R workspace. Make sure you detach the data set after you are finished!)

The "plot" function is a generic plotting function. We can feed it one variable:

`plot(bwt)    # Plots birthweight vs. observation number`

or two variables (scatterplot):

`plot(bwt ~ gestation)   # Order of arguments are y~x where x is the x-axis variable and y is the y-axis variable.  # (Could instead use arguments x,y, e.g., plot(gestation,bwt).)`

or side-by-side boxplots:

`plot(bwt ~ smoke)`

or an entire data set:

`plot(babies)`

What if we want to compare more than two variables? If they are all quantitative, we would need a 3-D scatterplot. However, if there are two quantitative variables and one categorical variable, we can use a scatterplot of the two quantitative variables with plot symbols denoting the levels of the categorical variable. Let's try this with gestation, bwt, and smoke.

`plot(bwt~gestation,type="n",main="Bwt vs. Gestation by Smoke",xlab="Gestation (Days)",ylab="Bwt (Ounces)")      # type="n" tells R just to set up the plot window     # and don't plot the points (think of "n" standing for "nothing")points(bwt[smoke==0]~gestation[smoke==0],pch=15,col="hotpink")		# Plot nonsmokerspoints(bwt[smoke==1]~gestation[smoke==1],pch=18,col="rosybrown")	# Plot smokerslegend(locator(1),c("Nonsmoker","Smoker"),pch=c(15,18),col=c("hotpink","rosybrown"))	# Add legendabline(lm(bwt[smoke==0]~gestation[smoke==0]),col="hotpink")	# Add least squares regression linesabline(lm(bwt[smoke==1]~gestation[smoke==1]),col="rosybrown")`

Now that we are finished referring to the variables by name (without the babies\$ prefix), let's detach the data set:

`detach(babies)`

Summary Statistics

R has functions built in for most of the standard quantitative measures that we are likely to use. Those that aren't built in are easy to add.

Basic numerical measures for a data set X1, X2, X3,..., Xn, stored in an R variable named "x"

 Statistic Definition R Function mean (average) (X1+X2+...+Xn)/n mean(x) median (middle value) 50th percentile, i.e., a value M such that 50% of the data are less than M and 50% are greater than M. median(x) minimum value of the smallest data point min(x) maximum value of the largest data point max(x) p-th quantile A value Q such that p*100% of the data are less than Q and (1-p)100% are greater than Q. Special cases: Median = .50th quantile 1st quartile (Q1) = .25th quantile 3rd quartile (Q3) = .75th quantile quantile(x,p) 5 number summary min, Q1, median, Q2, max quantile(x) variance mean squared deviation from the mean var(x) standard deviation square root of the variance; a "typical" deviation from the mean sd(x) order statistics data in ascending order sort(x)

We can easily add functions, for example to compute the inter-quartile range (IQR) = Q3-Q1:

`iqr <- function(x){	# computes the inter-quartile range of a data set 	r <- quantile(x, c(0.25,0.75))	r[2]-r[1]}   `

In the iqr function defined above, the variable r will be a dataset with 2 elements, the first and third quartiles. The final expression of the function is the value returned, in this case the difference between the two quartiles.

There are also functions that are meant for two or more variables, such as correlation:

`cor(babies\$bwt, babies\$gestation)cor(babies)`