--- title: | | STAT 408 - Week 4: | Tidy Data, Data Manipulation, and Processing date: "February 2, 2018" output: beamer_presentation: theme: "PaloAlto" fonttheme: "structuresmallcapsserif" --- ```{r setup, include=FALSE} library(knitr) library(formatR) library(XML) library(dplyr) knitr::opts_chunk$set(echo = TRUE) knitr::knit_hooks$set(mysize = function(before, options, envir) { if (before) return(options$size) }) ``` # Tidy Data ## Rules for Tidy Data The concept of tidy data can be attributed to Hadley Wickham and has three principles for organizing data. 1. Each variable must have its own column, 2. Each observation must have its own row, and 3. Each value must have its own cell. ## Rules for Tidy Data ![Visual Representation of Tidy Data. Source: R4DS ](images/tidy.png) ## Why use Tidy Data *Tidy datasets are all alike, but every messy dataset is messy in itw own way. - Hadley Wickham* - Storing data in a consistent way gives familiarity with methods for manipulating data. - Tidy data structure takes advantage of vectorised operations in R - Many useful packages `dplyr` `ggplot2` require tidy data. # Baltimore Towing Data ## An Overview The first example focuses on a subset of a dataset that contains information on vehicles towed in Baltimore, MD: - A larger version of this dataset along with additional descriptions can be found at: [https://data.baltimorecity.gov/Transportation/DOT-Towing/k78j-azhn](https://data.baltimorecity.gov/Transportation/DOT-Towing/k78j-azhn). - The full version of the dataset contains 61,000 rows and 36 columns, where each row corresponds to a vehicle and the columns are information pertaining to the vehicle. - We will be working with a smaller dataset with approximately 30,000 rows and 5 columns. ## The dataset First read in the data set which is available at: [http://www.math.montana.edu/ahoegh/teaching/stat408/datasets/BaltimoreTowing.csv](http://www.math.montana.edu/ahoegh/teaching/stat408/datasets/BaltimoreTowing.csv). ```{r, mysize=TRUE, size='\\scriptsize'} baltimore.tow <- read.csv('http://www.math.montana.edu/ahoegh/teaching/stat408/datasets/BaltimoreTowing.csv', stringsAsFactors = F) str(baltimore.tow) ``` ## Information for a few vehicles ```{r, mysize=TRUE, size='\\tiny'} kable(head(baltimore.tow, 20)) ``` # dplyr package ## dplyr package We have a set of tools to complete the analysis, but we are going to use a packaged called `dplyr` to do the data transformation and aggregation for this problem. There is a nice cheat sheet on the functionality of `dplyr` which a link can be accessed on the course webpage. ## piping %>% The pipe operator is an important part of the dplyr toolkit. - the pipe operator can be read as "THEN" - the pipe operator allows linking of other dplyr functionality ## filter() The `filter()` command is another way to carry out subsetting. \scriptsize ```{r} baltimore.tow %>% filter(vehicleType == 'Van') %>% head ``` ## summarize() or summarise() Summarize is a way to apply functions in a vectorized manner. Below we calculate the average cost of towing vans. ```{r, mysize=TRUE, size='\\scriptsize',error=T} baltimore.tow$totalNumeric <- as.numeric(substr(baltimore.tow$totalPaid, start = 2, stop=nchar(baltimore.tow))) baltimore.tow %>% filter(vehicleType == 'Van') %>% summarize(mean.cost = mean(totalNumeric)) ``` ## Exercise: group_by() Now also use the group by procedure to compute the average towing cost for all vehicle types. ## Solution: group_by() Now also use the group by procedure to compute the average towing cost for all vehicle types. ```{r, mysize=TRUE, size='\\scriptsize'} baltimore.tow %>% group_by(vehicleType) %>% summarize(mean.cost = mean(totalNumeric)) ``` ## tibbles - tibbles are a modern rethinking of data.frame() in R. - tibbles can be coerced into a data.frame with `as.data.frame()` - tibbles only print first 10 rows when printing and also includes column report type # Data Wrangling ## Data Wrangling As a statistician or more generally a data scientist the ability to manipulate, process, clean, and merge datasets is an essential skill. - These skills are generally referred to as data wrangling or munging. - In a data analysis or visualization setting, they will undoubtedly require a majority of your time. - Wrangling data can be a painful process. - This lecture will provide some tools and example of organizing data. ## Goal 1: Vehicles Towed by Year The first goal is to determine how many vehicles were towed for each year in the data set. - Given that the we don't have a column for year and the first observation for receiving date is "`r baltimore.tow$receivingDateTime[1]`". - Describe the process for obtaining this information. - What R functions are you familiar with that might be useful here? ## `substr()` function Consider adding a column for year to the data set. This can be done using `substr().` *Usage*: substr(x, start, stop) *Arguments*: - x, text a character vector. - start, first integer. The first element to be extracted - stop, last integer. The last element to be extracted ## Exercise: Using the substr() function Use the substr() function to extract year and create a new variable in R. ```{r, mysize=TRUE, size='\\scriptsize'} # baltimore.tow$Year <- ``` ## Solution: Using the substr() function ```{r, mysize=TRUE, size='\\scriptsize'} baltimore.tow$Year <- substr(baltimore.tow$receivingDateTime,7,10) head(baltimore.tow$Year) baltimore.tow$YearNumeric <- as.numeric(baltimore.tow$Year) head(baltimore.tow$YearNumeric) ``` ## strsplit() function In many situations, the year could be in a different position so the `substr()` might not work. For example month the date could be coded `4/1/2015` rather than `04/01/2015` So consider, using `strsplit()` instead. *Usage*: strsplit(x, split) *Arguments*: - x: character vector, each element of which is to be split. Other inputs, including a factor, will give an error. - split: character vector (or object which can be coerced to such) containing regular expression(s) (unless fixed = TRUE) to use for splitting. ## strsplit() function So, what do we need to split on to isolate year, given that the value follows the form "`r baltimore.tow$receivingDateTime[1]`"? It is also useful to know that: - `strsplit` requires a character vector and a factor will return an error and - the output of this function is a list, where each string in the vector is an element in the list and each element contains a vector of the pieces that have been split. ## strsplit() function First split on '/': ```{r, mysize=TRUE, size='\\footnotesize'} pieces <- strsplit(as.character( baltimore.tow$receivingDateTime[1]), '/') pieces ``` Notice the brackets, that means this returns a list ## strsplit function ```{r, mysize=TRUE, size='\\footnotesize'} split <- function(string.in, split.char, position){ # function to split and retain a portion of output # ARGS: string - string so split # split - value to split on # position - position in string to retain # RETURNS: character string return(strsplit(string.in, split = split.char)[[1]][[position]]) } temp.date <- sapply(baltimore.tow$receivingDateTime, split, split.char = '/', position = 3) as.character(temp.date[1]) ``` ## Exercise: strsplit function Now we can extract year from this chunk of code contained in pieces.mat. ```{r, size='tiny'} #baltimore.tow$Year <- ``` ## Solution: strsplit function Now we can extract year from this chunk of code contained in pieces.mat. ```{r, size='tiny'} baltimore.tow$Year <- sapply(temp.date, split, split.char = ' ', position = 1) ``` ## Goal 2. Type of Vehicles Towed by Month Next we wish to compute how many vehicles were towed in the AM and PM for each type of vehicle. However, we want to take a close look at the vehicle types in the data set and perhaps create more useful groups. ## Unique function - how to group vehicles First examine the unique types of vehicles in this data set. ```{r, mysize=TRUE, size='\\footnotesize'} unique(baltimore.tow$vehicleType) ``` ## Grouping First consider reasonable groups for vehicle types. 1. Cars - (Car, convertible) 2. Large Cars - (SUV, Station Wagon, Sport Utility Vehicle, Van, Taxi) 3. Trucks - (Pick-up Truck, Pickup Truck) 4. Large Trucks - (Truck, Tractor Trailer, Tow Truck, Tractor, Construction Equipment, Commercial Truck) 5. Bikes - (Motor Cycle (Street Bike), Dirt Bike, All terrain - 4 wheel bike, Mini-Bike) 6. Misc (delete) - (Boat, Golf Cart, Trailer) ## Messy Data: Grouping Next examine values in some of these groups, we will just look at the vehicle type of 'Truck'. ```{r, mysize=TRUE, size='\\scriptsize', tidy=T} unique(baltimore.tow$vehicleMake[baltimore.tow$vehicleType == 'Truck']) ``` \normalsize Note that there are several spelling errors in this data set. How do we combine them? ## Messy Data: Data Cleaning Spelling errors can be addressed, by reassigning vehicles to the correct spelling. ```{r, mysize=TRUE, size='\\scriptsize'} baltimore.tow$vehicleMake[baltimore.tow$vehicleMake == 'Peterbelt'] <- 'Peterbilt' baltimore.tow$vehicleMake[baltimore.tow$vehicleMake == 'Izuzu'] <- 'Isuzu' baltimore.tow$vehicleMake[baltimore.tow$vehicleMake == 'Frightliner'] <- 'Freightliner' baltimore.tow$vehicleMake[baltimore.tow$vehicleMake == 'Internantional'] <- 'International' ``` \normalsize Also note that many of the groupings have mis-classified vehicles, but we will not focus on that yet. ## Exercise: Delete Misc. Type Vehicles First we will delete golf carts, boats, and trailers. There are several ways to do this, consider making a new data frame called balt.tow.small that does not include golf carts, boats, and trailers. ```{r, eval=F} balt.tow.small <- ``` ## Solution: Delete Misc. Type Vehicles First we will delete golf carts, boats, and trailers. ```{r, mysize=TRUE, size='\\small'} balt.tow.small <- subset(baltimore.tow,!vehicleType %in% c('Golf Cart','Trailer','Boat')) ``` ## Exercise: Create Additional Groups Now we need to create a variable for the additional groups below. 1. Cars - (Car, convertible) 2. Large Cars - (SUV, Station Wagon, Sport Utility Vehicle, Van, Taxi) 3. Trucks - (Pick-up Truck, Pickup Truck) 4. Large Trucks - (Truck, Tractor Trailer, Tow Truck, Tractor, Construction Equipment, Commercial Truck) 5. Bikes - (Motor Cycle (Street Bike), Dirt Bike, All terrain - 4 wheel bike, Mini-Bike) ## Solution: Create Additional Groups One way to create groups is by creating a new variable ```{r, mysize=TRUE, size='\\scriptsize'} balt.tow.small$Group <- '' balt.tow.small$Group[balt.tow.small$vehicleType %in% c('Car','Convertible')] <- 'Cars' balt.tow.small$Group[balt.tow.small$vehicleType %in% c('SUV', 'Station Wagon','Sport Utility Vehicle','Van','Taxi')] <- 'Large Cars' balt.tow.small$Group[balt.tow.small$vehicleType %in% c('Pick-up Truck','Pickup Truck')] <- 'Trucks' balt.tow.small$Group[balt.tow.small$vehicleType %in% c('Truck','Tractor Trailer','Tow Truck','Tractor', 'Construction Equipment','Commercial Truck')] <- 'Large Trucks' balt.tow.small$Group[balt.tow.small$vehicleType %in% c('Motor Cycle (Street Bike)','Dirt Bike','Mini-Bike', 'All terrain - 4 wheel bike')] <- 'Bikes' ``` ## Ready for Calculations? First we need to extract the AM/PM tag from the time-date character string. Recall the general structure of this variable is `r balt.tow.small$receivingDateTime[1]`. As the tag that we are looking for falls at the end of the string, we can use `nchar()` to find the length of the string. ```{r, mysize=TRUE, size='\\footnotesize'} unique(nchar(as.character(balt.tow.small$receivingDateTime))) balt.tow.small$Time <- substr(balt.tow.small$receivingDateTime,21,22) ``` ## Solution: Aggregate We could use aggregate, as such: ```{r, mysize=TRUE, size='\\footnotesize'} towed.counts <- aggregate(rep(1,nrow(balt.tow.small)), by=list(balt.tow.small$Group,balt.tow.small$Time),sum) colnames(towed.counts) colnames(towed.counts) <- c('VehicleGroup','Time','VehicleCount') ``` ## Solution: Aggregate Results In this case, `sort()` is not the function we are looking for, rather we use order. Order returns integers for the 'ordering' based on the variables. ```{r, mysize=TRUE, size='\\scriptsize'} towed.counts[order(towed.counts$VehicleGroup,towed.counts$Time),] ``` ## Solution: dplyr We could use aggregate, as such: ```{r, mysize=TRUE, size='\\footnotesize'} balt.tow.small %>% group_by(Time, Group) %>% tally() ```