--- title: | | STAT 408 - Week 4: | Tidy Data, Data Manipulation, and Processing date: "February 2, 2018" output: html_document --- ```{r setup, include=FALSE} library(knitr) library(formatR) library(XML) library(dplyr) knitr::opts_chunk\$set(echo = TRUE) knitr::knit_hooks\$set(mysize = function(before, options, envir) { if (before) return(options\$size) }) ``` ## The dataset First read in the data set which is available at: [http://www.math.montana.edu/ahoegh/teaching/stat408/datasets/BaltimoreTowing.csv](http://www.math.montana.edu/ahoegh/teaching/stat408/datasets/BaltimoreTowing.csv). ```{r, mysize=TRUE, size='\\scriptsize'} baltimore.tow <- read.csv('http://www.math.montana.edu/ahoegh/teaching/stat408/datasets/BaltimoreTowing.csv', stringsAsFactors = F) baltimore.tow\$totalNumeric <- as.numeric(substr(baltimore.tow\$totalPaid, start = 2, stop=nchar(baltimore.tow))) str(baltimore.tow) ``` ## Exercise: group_by() Now also use the group by procedure to compute the average towing cost for all vehicle types. ## Goal 1: Vehicles Towed by Year The first goal is to determine how many vehicles were towed for each year in the data set. - Given that the we don't have a column for year and the first observation for receiving date is "`r baltimore.tow\$receivingDateTime[1]`". - Describe the process for obtaining this information. - What R functions are you familiar with that might be useful here? ## Exercise: Using the substr() function Use the substr() function to extract year and create a new variable in R. ```{r, mysize=TRUE, size='\\scriptsize'} # baltimore.tow\$Year <- ``` ## Exercise: strsplit function Now we can extract year from this chunk of code contained in pieces.mat. ```{r, size='tiny'} #baltimore.tow\$Year <- ``` ## Goal 2. Type of Vehicles Towed by Month Next we wish to compute how many vehicles were towed in the AM and PM for each type of vehicle. However, we want to take a close look at the vehicle types in the data set and perhaps create more useful groups. ## Messy Data: Data Cleaning Spelling errors can be addressed, by reassigning vehicles to the correct spelling. ```{r, mysize=TRUE, size='\\scriptsize'} baltimore.tow\$vehicleMake[baltimore.tow\$vehicleMake == 'Peterbelt'] <- 'Peterbilt' baltimore.tow\$vehicleMake[baltimore.tow\$vehicleMake == 'Izuzu'] <- 'Isuzu' baltimore.tow\$vehicleMake[baltimore.tow\$vehicleMake == 'Frightliner'] <- 'Freightliner' baltimore.tow\$vehicleMake[baltimore.tow\$vehicleMake == 'Internantional'] <- 'International' ``` \normalsize Also note that many of the groupings have mis-classified vehicles, but we will not focus on that yet. ## Exercise: Delete Misc. Type Vehicles First we will delete golf carts, boats, and trailers. There are several ways to do this, consider making a new data frame called balt.tow.small that does not include golf carts, boats, and trailers. ```{r, eval=F} balt.tow.small <- ``` ## Exercise: Create Additional Groups Now we need to create a variable for the additional groups below. 1. Cars - (Car, convertible) 2. Large Cars - (SUV, Station Wagon, Sport Utility Vehicle, Van, Taxi) 3. Trucks - (Pick-up Truck, Pickup Truck) 4. Large Trucks - (Truck, Tractor Trailer, Tow Truck, Tractor, Construction Equipment, Commercial Truck) 5. Bikes - (Motor Cycle (Street Bike), Dirt Bike, All terrain - 4 wheel bike, Mini-Bike) ## Solution: Create Additional Groups One way to create groups is by creating a new variable ```{r, mysize=TRUE, size='\\scriptsize'} ``` ## Ready for Calculations? First we need to extract the AM/PM tag from the time-date character string. As the tag that we are looking for falls at the end of the string, we can use `nchar()` to find the length of the string. ```{r, mysize=TRUE, size='\\footnotesize'} ``` ## Solution: Aggregate We could use aggregate, as such: ```{r, mysize=TRUE, size='\\footnotesize'} ``` ## Solution: dplyr We could use aggregate, as such: ```{r, mysize=TRUE, size='\\footnotesize'} ```