--- title: "STAT 408 - R Overview" date: "January 11, 2018" output: beamer_presentation: theme: "PaloAlto" fonttheme: "structuresmallcapsserif" --- ```{r setup, include=FALSE} library(knitr) knitr::opts_chunk$set(echo = TRUE) knitr::knit_hooks$set(mysize = function(before, options, envir) { if (before) return(options$size) }) ``` # R Intro ## Why use R? R is: - a free public domain implementation of S, - the standard among (academic) professional statisticians, - available for Windows, Mac, and Linux, - an object-oriented and functional programming structure, and - designed to connect to high-performance programming languages like C and Fortran. ## Why use R? R has: - an open-software environment with a large community that makes getting help easy, - a massive set of packages for statistical modeling, data science, visualization, and importing and manipulating data, - powerful tools for replicating and communicating your results, - an interactive development environment (R Studio) tailored for interactive data analysis and statistical programmming, - available for Windows, Mac, and Linux, and - an object-oriented and functional programming structure. ## Reading Data files The ability to datasets into R is an essential skill. For this class, most of the files will be on the course webpage and can be directly downloaded using `read.csv`. Consider a dataset available at: [http://math.montana.edu/ahoegh/teaching/stat408/datasets/SeattleHousing.csv](http://math.montana.edu/ahoegh/teaching/stat408/datasets/SeattleHousing.csv) ```{r read.data, mysize=TRUE,size = '\\tiny'} Seattle <- read.csv( 'http://math.montana.edu/ahoegh/teaching/stat408/datasets/SeattleHousing.csv', stringsAsFactors = F) ``` ## Viewing Data files A common function that we will use is `head`, which shows the first few rows of a data frame. ```{r, , mysize=TRUE,size = '\\scriptsize'} head(Seattle) ``` # R Data Structures ## Data structure Overview R has four common types of data structures: - Vectors - Matrices (and Arrays) - Lists - Data Frames ## Data structure Overview The base data structures in R can be organized by dimensionality and whether they are homogenous. Dimension | Homogenous | Heterogenous ------------- | ------------- | ------------- 1d | Vector | List 2d | Matrix | Data Frame no d | Array | ## Vector Types There are four common types of vectors: logical, integer, double (or numeric), and character. The `c()` function is used for combining elements into a vector ```{r vectors} dbl <- c(1,2.5,pi) int <- c(1L,4L,10L) log <- c(TRUE,FALSE,F,T) char <- c('this is','a character string') ``` ## Vector Types They type of vector can be identified using the ` typeof()` function. Note that only a single data type is allowed. ```{r vector.type} typeof(dbl) comb <- c(char,dbl) typeof(comb) comb ``` ## Exercise: Vectors Create a vector with your first, middle, and last names. ## Solution: Vectors 1. Create a vector with your first, middle, and last names. ```{r} andy.names <- c("Andrew","Blake","Hoegh") andy.names ``` ## Data Frame Overview A data frame: - is the most common way of storing data in R - is like a matrix with rows-and-column structure; however, unlike a matrix each column may have a different mode - in a technical sense, a data frame is a list of equal-length vectors. ```{r df} df <- data.frame(x = 1:3, y = c('a','b','c')) kable(df) ``` # Subsetting ## Vector Subsetting: I Subsetting allows you to extract certain elements from a data frame or vector (or matrix, array, list). ```{r subset,mysize=TRUE,size = '\\small'} num.vec <- seq(from = 1, to = 9, by = 1); num.vec num.vec[1:3] num.vec[c(1,5,8)] num.vec[-5] ``` ## Vector Subsetting: II Subsetting also works with logical values or expressions. ```{r subset2} num.vec[num.vec > 5] num.vec[num.vec != 6] num.vec[rep(c(TRUE,FALSE,TRUE),each=3)] ``` ## Data Frame Subsetting: I The same ideas apply to data frames, but the indices now constitute rows and columns of the data frame. ```{r subset.df} df <- data.frame(x=1:3, y=3:1, z=c('a','b','c')) df[,1] df[-1,c(2:3)] ``` ## Data Frame Subsetting: II There are also a couple built in functions in R for subsetting data frames. ```{r subset.df2} df$x new.df <- subset(df, x >1); new.df ``` ## Exercise: Subsetting 1. Create a new data frame that only includes houses worth more than $1,000,000. 2. (bonus) From this new data frame what is the average living square footage of houses. Hint columns in a data.frame can be indexed by `Seattle$sqft_living` ## Exercise: Subsetting - Solutions 1. Create a new data frame that only includes houses worth more than $1,000,000. ```{r, size= 'tiny'} expensive.houses <- subset(Seattle, price > 1000000) ``` 2. (bonus) From this new data frame what is the average living square footage of houses. Hint columns in a data.frame can be indexed by `Seattle$sqft_living` ```{r} mean(expensive.houses$sqft_living) ``` # Graphics ## Basic Plotting in R: Scatterplot Later in the course, we will spend considerable time on graphics. For now, let's consider some of the basic functionality in R. ```{r plot,fig.align='center',fig.width=4, fig.height=3} plot(Seattle$price~Seattle$sqft_living) ``` ## Basic Plotting in R: labels ```{r plot2,fig.align='center',fig.width=4, fig.height=3.25,echo=T} plot(Seattle$price~Seattle$sqft_living, ylab='Price',xlab='Living Sqft') ``` ## Basic Plotting in R: pch ```{r plot3,fig.align='center',fig.width=4, fig.height=3.25, echo=T, mysize=TRUE, size = '\\small'} plot(Seattle$price~Seattle$sqft_living, ylab='Price',xlab='Living Sqft', pch=16) ``` ## Basic Plotting in R: color ```{r plot3b,fig.align='center',fig.width=4, fig.height=3.25, echo=T, mysize=TRUE, size = '\\small'} plot(Seattle$price~Seattle$sqft_living, pch=16, col=rgb(0,0,.3,.3),ylab='Price',xlab='Living Sqft') ``` ## Basic Plotting in R: title ```{r plot4,fig.align='center',fig.width=4, fig.height=3.25,echo=T, mysize=TRUE, size = '\\footnotesize'} plot(Seattle$price~Seattle$sqft_living, pch=16, ylab='Price', xlab='Living Sqft',main='Price vs. Living Sqft') ``` ## Basic Plotting in R: histogram ```{r plot5,fig.align='center',fig.width=4, fig.height=3.25, mysize=TRUE, size = '\\footnotesize'} hist(Seattle$price,xlab='Price', breaks='FD') ``` ## Basic Plotting in R: histogram ```{r plot6,fig.align='center',fig.width=4, fig.height=3.25, mysize=TRUE, size = '\\footnotesize'} boxplot(Seattle$price~Seattle$bedrooms,ylab='Price', col='red', xlab='bedrooms',main='Price by Bedrooms for Seattle') ``` ## Exercise: Basic Plot - Using only the subset of homes worth more than a million dollars, create a graphic. ## Solution: Basic Plot ```{r fig.align='center',fig.width=4, fig.height=3.25,echo=F} boxplot(expensive.houses$price~expensive.houses$bedrooms,ylab='Price', col='red', xlab='bedrooms',main='Price by Bedrooms for Seattle', sub='For homes worth more than $1,000,000') ``` ## Solution: Basic Plot - with Code ```{r fig.align='center',fig.width=4, fig.height=3.25,echo=T, eval=F} boxplot(expensive.houses$price ~ expensive.houses$bedrooms, ylab='Price', col='red', xlab='bedrooms', main='Price by Bedrooms for Seattle', sub='For homes worth more than $1,000,000') ```