---
title: |
    | STAT 408 - Week 4:
    | Tidy Data, Data Manipulation, and Processing
date: "February 2, 2018"
output:
  beamer_presentation:
    theme: "PaloAlto"
    fonttheme: "structuresmallcapsserif"
---


```{r setup, include=FALSE}
library(knitr)
library(formatR)
library(XML)
library(dplyr)
knitr::opts_chunk$set(echo = TRUE)
knitr::knit_hooks$set(mysize = function(before, options, envir) {
  if (before) 
    return(options$size)
})
```
# Tidy Data

## Rules for Tidy Data
The concept of tidy data can be attributed to Hadley Wickham and has three principles for organizing data.

1. Each variable must have its own column,
2. Each observation must have its own row, and
3. Each value must have its own cell.

## Rules for Tidy Data

![Visual Representation of Tidy Data. Source: R4DS ](images/tidy.png)


## Why use Tidy Data

*Tidy datasets are all alike, but every messy dataset is messy in itw own way. - Hadley Wickham*

- Storing data in a consistent way gives familiarity with methods for manipulating data.
- Tidy data structure takes advantage of vectorised operations in R
- Many useful packages `dplyr` `ggplot2` require tidy data.


# Baltimore Towing Data

## An Overview
The first example focuses on a subset of a dataset that contains information on vehicles towed in Baltimore, MD:

- A larger version of this dataset along with additional descriptions can be found at:  [https://data.baltimorecity.gov/Transportation/DOT-Towing/k78j-azhn](https://data.baltimorecity.gov/Transportation/DOT-Towing/k78j-azhn). 

- The full version of the dataset contains 61,000 rows and 36 columns, where each row corresponds to a vehicle and the columns are information pertaining to the vehicle.

- We will be working with a smaller dataset with approximately 30,000 rows and 5 columns.

## The dataset
First read in the data set which is available at: [http://www.math.montana.edu/ahoegh/teaching/stat408/datasets/BaltimoreTowing.csv](http://www.math.montana.edu/ahoegh/teaching/stat408/datasets/BaltimoreTowing.csv).
```{r, mysize=TRUE, size='\\scriptsize'}
baltimore.tow <- 
  read.csv('http://www.math.montana.edu/ahoegh/teaching/stat408/datasets/BaltimoreTowing.csv', 
           stringsAsFactors = F)
str(baltimore.tow)
```

## Information for a few vehicles

```{r, mysize=TRUE, size='\\tiny'}
kable(head(baltimore.tow, 20))
```

# dplyr package

## dplyr package
We have a set of tools to complete the analysis, but we are going to use a packaged called `dplyr` to do the data transformation and aggregation for this problem.

There is a nice cheat sheet on the functionality of `dplyr` which a link can be accessed on the course webpage.

## piping %>%

The pipe operator is an important part of the dplyr toolkit.

- the pipe operator can be read as "THEN"
- the pipe operator allows linking of other dplyr functionality

## filter()

The `filter()` command is another way to carry out subsetting.

\scriptsize
```{r}
baltimore.tow %>% filter(vehicleType == 'Van') %>% head
```


## summarize() or summarise()

Summarize is a way to apply functions in a vectorized manner. Below we calculate the average cost of towing vans.

```{r, mysize=TRUE, size='\\scriptsize',error=T}
baltimore.tow$totalNumeric <- 
  as.numeric(substr(baltimore.tow$totalPaid, 
                    start = 2, stop=nchar(baltimore.tow)))
baltimore.tow %>% filter(vehicleType == 'Van') %>% 
  summarize(mean.cost = mean(totalNumeric))
```

## Exercise: group_by()

Now also use the group by procedure to compute the average towing cost for all vehicle types.

## Solution: group_by()

Now also use the group by procedure to compute the average towing cost for all vehicle types.

```{r, mysize=TRUE, size='\\scriptsize'}
baltimore.tow %>% 
  group_by(vehicleType) %>% 
  summarize(mean.cost = mean(totalNumeric))
```

## tibbles 

- tibbles are a modern rethinking of data.frame() in R.
- tibbles can be coerced into a data.frame with `as.data.frame()`
- tibbles only print first 10 rows when printing and also includes column report type

# Data Wrangling

## Data Wrangling
As a statistician or more generally a data scientist the ability to manipulate, process, clean, and merge datasets is an essential skill. 

- These skills are generally referred to as data wrangling or munging.
- In a data analysis or visualization setting, they will undoubtedly require a majority of your time.
- Wrangling data can be a painful process.
- This lecture will provide some tools and example of organizing data.

## Goal 1: Vehicles Towed by Year
The first goal is to determine how many vehicles were towed for each year in the data set. 

- Given that the we don't have a column for year and the first observation for receiving date is "`r baltimore.tow$receivingDateTime[1]`".
- Describe the process for obtaining this information.
- What R functions are you familiar with that might be useful here?


##  `substr()` function
Consider adding a column for year to the data set. This can be done using `substr().`

*Usage*: substr(x, start, stop)

*Arguments*: 

- x, text	a character vector.
- start, first integer. The first element to be extracted
- stop, last	 integer. The last element to be extracted

## Exercise: Using the substr() function
Use the substr() function to extract year and create a new variable in R. 
```{r, mysize=TRUE, size='\\scriptsize'}
# baltimore.tow$Year <- 
  
```


## Solution: Using the substr() function

```{r, mysize=TRUE, size='\\scriptsize'}
baltimore.tow$Year <- substr(baltimore.tow$receivingDateTime,7,10)
head(baltimore.tow$Year)
baltimore.tow$YearNumeric <- as.numeric(baltimore.tow$Year)
head(baltimore.tow$YearNumeric)
```

## strsplit() function
In many situations, the year could be in a different position so the `substr()` might not work. For example month the date could be coded `4/1/2015` rather than `04/01/2015` So consider, using `strsplit()` instead.

*Usage*: strsplit(x, split)

*Arguments*:

- x: character vector, each element of which is to be split. Other inputs, including a factor, will give an error.
- split: character vector (or object which can be coerced to such) containing regular expression(s) (unless fixed = TRUE) to use for splitting. 

## strsplit() function
So, what do we need to split on to isolate year, given that the value follows the form "`r baltimore.tow$receivingDateTime[1]`"?

It is also useful to know that:

- `strsplit` requires a character vector and a factor will return an error and 
- the output of this function is a list, where each string in the vector is an element in the list and each element contains a vector of the pieces that have been split.


## strsplit() function
First split on '/': 
```{r, mysize=TRUE, size='\\footnotesize'}
pieces <- strsplit(as.character(
  baltimore.tow$receivingDateTime[1]), '/')
pieces
```
Notice the brackets, that means this returns a list

## strsplit function
```{r, mysize=TRUE, size='\\footnotesize'}
split <- function(string.in, split.char, position){
  # function to split and retain a portion of output
  # ARGS: string - string so split
  #       split - value to split on
  #       position - position in string to retain
  # RETURNS: character string
  return(strsplit(string.in, split = split.char)[[1]][[position]])
}

temp.date <- 
  sapply(baltimore.tow$receivingDateTime, 
         split, split.char = '/', position = 3)
as.character(temp.date[1])
```

## Exercise: strsplit function 
Now we can extract year from this chunk of code contained in pieces.mat.
```{r, size='tiny'}
#baltimore.tow$Year <- 
```

## Solution: strsplit function 
Now we can extract year from this chunk of code contained in pieces.mat.
```{r, size='tiny'}
baltimore.tow$Year <- sapply(temp.date, split, 
               split.char = ' ', position = 1)
```

## Goal 2. Type of Vehicles Towed by Month
Next we wish to compute how many vehicles were towed in the AM and PM for each type of vehicle.

However, we want to take a close look at the vehicle types in the data set and perhaps create more useful groups.

## Unique function - how to group vehicles
First examine the unique types of vehicles in this data set.

```{r, mysize=TRUE, size='\\footnotesize'}
unique(baltimore.tow$vehicleType)
```

## Grouping
First consider reasonable groups for vehicle types.

1. Cars - (Car, convertible)
2. Large Cars - (SUV, Station Wagon, Sport Utility Vehicle, Van, Taxi)
3. Trucks - (Pick-up Truck, Pickup Truck)
4. Large Trucks - (Truck, Tractor Trailer, Tow Truck, Tractor, Construction Equipment, Commercial Truck)
5. Bikes - (Motor Cycle (Street Bike), Dirt Bike, All terrain - 4 wheel bike, Mini-Bike)
6. Misc (delete) - (Boat, Golf Cart, Trailer)

## Messy Data: Grouping
Next examine values in some of these groups, we will just look at the vehicle type of 'Truck'.

```{r, mysize=TRUE, size='\\scriptsize', tidy=T}
unique(baltimore.tow$vehicleMake[baltimore.tow$vehicleType == 'Truck'])
```
\normalsize
Note that there are several spelling errors in this data set. How do we combine them?


## Messy Data: Data Cleaning
Spelling errors can be addressed, by reassigning vehicles to the correct spelling.
```{r, mysize=TRUE, size='\\scriptsize'}
baltimore.tow$vehicleMake[baltimore.tow$vehicleMake == 
                            'Peterbelt'] <- 'Peterbilt'
baltimore.tow$vehicleMake[baltimore.tow$vehicleMake == 
                            'Izuzu'] <- 'Isuzu'
baltimore.tow$vehicleMake[baltimore.tow$vehicleMake == 
                            'Frightliner'] <- 'Freightliner'
baltimore.tow$vehicleMake[baltimore.tow$vehicleMake == 
                            'Internantional'] <- 'International'
```
\normalsize
Also note that many of the groupings have mis-classified vehicles, but we will not focus on that yet.

## Exercise: Delete Misc. Type Vehicles

First we will delete golf carts, boats, and trailers. There are several ways to do this, consider making a new data frame called balt.tow.small that does not include golf carts, boats, and trailers.

```{r, eval=F}
balt.tow.small <-
  
  
```

## Solution: Delete Misc. Type Vehicles

First we will delete golf carts, boats, and trailers.

```{r, mysize=TRUE, size='\\small'}
balt.tow.small <- subset(baltimore.tow,!vehicleType %in%
    c('Golf Cart','Trailer','Boat'))
```

## Exercise: Create Additional Groups

Now we need to create a variable for the additional groups below.

1. Cars - (Car, convertible)
2. Large Cars - (SUV, Station Wagon, Sport Utility Vehicle, Van, Taxi)
3. Trucks - (Pick-up Truck, Pickup Truck)
4. Large Trucks - (Truck, Tractor Trailer, Tow Truck, Tractor, Construction Equipment, Commercial Truck)
5. Bikes - (Motor Cycle (Street Bike), Dirt Bike, All terrain - 4 wheel bike, Mini-Bike)


## Solution: Create Additional Groups

One way to create groups is by creating a new variable

```{r, mysize=TRUE, size='\\scriptsize'}
balt.tow.small$Group <- ''

balt.tow.small$Group[balt.tow.small$vehicleType %in% 
                       c('Car','Convertible')] <- 'Cars'

balt.tow.small$Group[balt.tow.small$vehicleType %in% c('SUV',
 'Station Wagon','Sport Utility Vehicle','Van','Taxi')] <- 'Large Cars'

balt.tow.small$Group[balt.tow.small$vehicleType %in% 
                       c('Pick-up Truck','Pickup Truck')] <- 'Trucks'

balt.tow.small$Group[balt.tow.small$vehicleType %in% 
    c('Truck','Tractor Trailer','Tow Truck','Tractor',
      'Construction Equipment','Commercial Truck')] <- 'Large Trucks'

balt.tow.small$Group[balt.tow.small$vehicleType %in% 
    c('Motor Cycle (Street Bike)','Dirt Bike','Mini-Bike',
      'All terrain - 4 wheel bike')] <- 'Bikes'
```


## Ready for Calculations?
First we need to extract the AM/PM tag from the time-date character string. Recall the general structure of this variable is `r balt.tow.small$receivingDateTime[1]`.

As the tag that we are looking for falls at the end of the string, we can use `nchar()` to find the length of the string.
```{r, mysize=TRUE, size='\\footnotesize'}
unique(nchar(as.character(balt.tow.small$receivingDateTime)))
balt.tow.small$Time <- 
  substr(balt.tow.small$receivingDateTime,21,22)
```

## Solution: Aggregate
We could use aggregate, as such:
```{r, mysize=TRUE, size='\\footnotesize'}
towed.counts <- aggregate(rep(1,nrow(balt.tow.small)),
      by=list(balt.tow.small$Group,balt.tow.small$Time),sum)
colnames(towed.counts)
colnames(towed.counts) <- c('VehicleGroup','Time','VehicleCount')
```

## Solution: Aggregate Results
In this case, `sort()` is not the function we are looking for, rather we use order. Order returns integers for the 'ordering' based on the variables.
```{r, mysize=TRUE, size='\\scriptsize'}
towed.counts[order(towed.counts$VehicleGroup,towed.counts$Time),]
```

## Solution: dplyr
We could use aggregate, as such:
```{r, mysize=TRUE, size='\\footnotesize'}
balt.tow.small %>% group_by(Time, Group) %>% tally()
```