---
title: "STAT 408 - R Overview"
date: "January 11, 2018"
output:
  beamer_presentation:
    theme: "PaloAlto"
    fonttheme: "structuresmallcapsserif"
---


```{r setup, include=FALSE}
library(knitr)
knitr::opts_chunk$set(echo = TRUE)
knitr::knit_hooks$set(mysize = function(before, options, envir) {
  if (before) 
    return(options$size)
})
```
# R Intro

## Why use R?
R is:

- a free public domain implementation of S,
- the standard among (academic) professional statisticians,
- available for Windows, Mac, and Linux,
- an object-oriented and functional programming structure, and 
- designed to connect to high-performance programming languages like C and Fortran.


## Why use R?
R has:

- an open-software environment with a large community that makes getting help easy,
- a massive set of packages for statistical modeling, data science, visualization, and importing and manipulating data,
- powerful tools for replicating and communicating your results,
- an interactive development environment (R Studio) tailored for interactive data analysis and statistical programmming,
- available for Windows, Mac, and Linux, and
- an object-oriented and functional programming structure.


## Reading Data files
The ability to datasets into R is an essential skill. For this class, most of the files will be on the course webpage and can be directly downloaded using `read.csv`. Consider a dataset available at: [http://math.montana.edu/ahoegh/teaching/stat408/datasets/SeattleHousing.csv](http://math.montana.edu/ahoegh/teaching/stat408/datasets/SeattleHousing.csv)

```{r read.data, mysize=TRUE,size = '\\tiny'}
Seattle <- read.csv(
  'http://math.montana.edu/ahoegh/teaching/stat408/datasets/SeattleHousing.csv', 
  stringsAsFactors = F)
```

## Viewing Data files
A common function that we will use is `head`, which shows the first few rows of a data frame.

```{r, , mysize=TRUE,size = '\\scriptsize'}
head(Seattle)
```

# R Data Structures
## Data structure Overview
R has four common types of data structures:

- Vectors
- Matrices (and Arrays) 
- Lists
- Data Frames

## Data structure Overview
The base data structures in R can be organized by dimensionality and whether they are homogenous.

Dimension     | Homogenous     | Heterogenous
------------- | -------------  | -------------
1d            | Vector         | List
2d            | Matrix         | Data Frame
no d          | Array          | 


## Vector Types
There are four common types of vectors: logical, integer, double (or numeric), and character. The `c()` function is used for combining elements into a vector

```{r vectors}
dbl <- c(1,2.5,pi)
int <- c(1L,4L,10L)
log <- c(TRUE,FALSE,F,T)
char <- c('this is','a character string')
```

## Vector Types
They type of vector can be identified using the ` typeof()` function. Note that only a single data type is allowed.
```{r vector.type}
  typeof(dbl)
  comb <- c(char,dbl)
  typeof(comb)
  comb
```


## Exercise: Vectors

Create a vector with your first, middle, and last names.


## Solution: Vectors

1. Create a vector with your first, middle, and last names.
```{r}
andy.names <- c("Andrew","Blake","Hoegh")
andy.names
```


## Data Frame Overview
A data frame:

- is the most common way of storing data in R
- is like a matrix with rows-and-column structure; however, unlike a matrix each column may have a different mode
- in a technical sense, a data frame is a list of equal-length vectors.

```{r df}
df <- data.frame(x = 1:3, y = c('a','b','c'))
kable(df)
```

# Subsetting
 
## Vector Subsetting: I
Subsetting allows you to extract certain elements from a data frame or vector (or matrix, array, list).

```{r subset,mysize=TRUE,size = '\\small'}
num.vec <- seq(from = 1, to = 9, by = 1); num.vec
num.vec[1:3]
num.vec[c(1,5,8)]
num.vec[-5]
```

## Vector Subsetting: II
Subsetting also works with logical values or expressions.
```{r subset2}
num.vec[num.vec > 5]
num.vec[num.vec != 6]
num.vec[rep(c(TRUE,FALSE,TRUE),each=3)]
```

## Data Frame Subsetting: I
The same ideas apply to data frames, but the indices now constitute rows and columns of the data frame.
```{r subset.df}
df <- data.frame(x=1:3, y=3:1, z=c('a','b','c'))
df[,1]
df[-1,c(2:3)]
```

## Data Frame Subsetting: II
There are also a couple built in functions in R for subsetting data frames.
```{r subset.df2}
df$x
new.df <- subset(df, x >1); new.df
```

## Exercise: Subsetting
1. Create a new data frame that only includes houses worth more than $1,000,000.

2. (bonus) From this new data frame what is the average living square footage of houses. Hint columns in a data.frame can be indexed by `Seattle$sqft_living`

## Exercise: Subsetting - Solutions
1. Create a new data frame that only includes houses worth more than $1,000,000.
```{r, size= 'tiny'}
expensive.houses <- subset(Seattle, price > 1000000)
```

2. (bonus) From this new data frame what is the average living square footage of houses. Hint columns in a data.frame can be indexed by `Seattle$sqft_living`
```{r}
mean(expensive.houses$sqft_living)
```


# Graphics

## Basic Plotting in R: Scatterplot
Later in the course, we will spend considerable time on graphics. For now, let's consider some of the basic functionality in R.

```{r plot,fig.align='center',fig.width=4, fig.height=3}
plot(Seattle$price~Seattle$sqft_living)
```

## Basic Plotting in R: labels
```{r plot2,fig.align='center',fig.width=4, fig.height=3.25,echo=T}
plot(Seattle$price~Seattle$sqft_living, 
     ylab='Price',xlab='Living Sqft')
```

## Basic Plotting in R: pch
```{r plot3,fig.align='center',fig.width=4, fig.height=3.25, echo=T, mysize=TRUE, size = '\\small'}
plot(Seattle$price~Seattle$sqft_living, 
     ylab='Price',xlab='Living Sqft', pch=16)
```

## Basic Plotting in R: color
```{r plot3b,fig.align='center',fig.width=4, fig.height=3.25, echo=T, mysize=TRUE, size = '\\small'}
plot(Seattle$price~Seattle$sqft_living, pch=16,
     col=rgb(0,0,.3,.3),ylab='Price',xlab='Living Sqft')
```


## Basic Plotting in R: title
```{r plot4,fig.align='center',fig.width=4, fig.height=3.25,echo=T, mysize=TRUE, size = '\\footnotesize'}
plot(Seattle$price~Seattle$sqft_living, pch=16, ylab='Price',
     xlab='Living Sqft',main='Price vs. Living Sqft')
```

## Basic Plotting in R: histogram
```{r plot5,fig.align='center',fig.width=4, fig.height=3.25, mysize=TRUE, size = '\\footnotesize'}
hist(Seattle$price,xlab='Price', breaks='FD')
```

## Basic Plotting in R: histogram
```{r plot6,fig.align='center',fig.width=4, fig.height=3.25, mysize=TRUE, size = '\\footnotesize'}
boxplot(Seattle$price~Seattle$bedrooms,ylab='Price', col='red',
        xlab='bedrooms',main='Price by Bedrooms for Seattle')
```

## Exercise: Basic Plot

- Using only the subset of homes worth more than a million dollars, create a graphic.

## Solution: Basic Plot 

```{r fig.align='center',fig.width=4, fig.height=3.25,echo=F}
boxplot(expensive.houses$price~expensive.houses$bedrooms,ylab='Price', col='red',
        xlab='bedrooms',main='Price by Bedrooms for Seattle',
        sub='For homes worth more than $1,000,000')
```

## Solution: Basic Plot - with Code

```{r fig.align='center',fig.width=4, fig.height=3.25,echo=T, eval=F}
boxplot(expensive.houses$price ~ 
          expensive.houses$bedrooms,
        ylab='Price', col='red', xlab='bedrooms',
        main='Price by Bedrooms for Seattle',
        sub='For homes worth more than $1,000,000')
```