Turn in one copy for each group, both HTML and R Markdown files. If group members are not present in class they will be required to complete their own lab to receive credit. Please turn in both an HTML file and your R Markdown script. This is due Monday, October 2 at 10AM.

Lab Overview

The entire lab will be worth 100 points. Clarity of code, including comments and interpretable variables names, along with thoughtful writing with an emphasis on concise interpretations will be worth 10.

Questions

Answer the following questions in this R Markdown document. Please include code where necessary.

1. Capital BikeShare Data

This data set contains single bike trips from January - March of 2017 for the Capital BikeShare system in Washington, D.C.

a. (5 points)

Download the file http://math.montana.edu/ahoegh/teaching/stat408/datasets/biketrips2017.csv.

bikes.in <- read.csv('http://math.montana.edu/ahoegh/teaching/stat408/datasets/biketrips2017.csv',
                     stringsAsFactors = F)

b. (15 points)

Summarize the data set. What does each column represent? What about each row?

str(bikes.in)
## 'data.frame':    244687 obs. of  5 variables:
##  $ Start.date   : chr  "3/31/2017 23:55" "3/31/2017 23:52" "3/31/2017 23:51" "3/31/2017 23:50" ...
##  $ End.date     : chr  "3/31/2017 23:58" "3/31/2017 23:54" "3/31/2017 23:59" "3/31/2017 23:53" ...
##  $ Start.station: chr  "14th & Irving St NW" "15th & P St NW" "5th St & Massachusetts Ave NW" "Columbus Circle / Union Station" ...
##  $ End.station  : chr  "15th & Euclid St  NW" "17th St & Massachusetts Ave NW" "17th St & Massachusetts Ave NW" "8th & F St NE" ...
##  $ Member.Type  : chr  "Registered" "Registered" "Registered" "Registered" ...

The data set contains 244687 rows, each of which corresponds to a single bike trip. The columns include information about the start time and end time of the trip, including day and time. Additionally, information is available about the start and ending stations and about the member type.

c. (25 points)

Using commands from the dplyr packaged, compute how many trips were made each day of the month. (I have filtered out cases that started and ended on different days.) Note this will require an intermediate step to extract the day of the month.

unique(nchar(bikes.in$Start.date)) # different lengths, need to use strsplit
## [1] 15 14 13
bikes.in$day <- as.numeric(matrix(unlist(strsplit(bikes.in$Start.date, '/')),nrow = nrow(bikes.in), ncol = 3, byrow=T)[,2])
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(knitr)
kable(count(bikes.in,day))
day n
1 8958
2 8125
3 6732
4 5242
5 4992
6 8565
7 8368
8 10941
9 12370
10 5982
11 5097
12 4614
13 6467
14 1582
15 3356
16 5174
17 7181
18 8229
19 5666
20 9283
21 10592
22 8306
23 8606
24 9042
25 16067
26 6595
27 11324
28 10408
29 12761
30 10296
31 3766

d. (40 points)

Use the piping structure from dplyr to compute the average trip length by member type. As a hint, the end date and start date have time stamps that you’ll need to extract to compute the trip time for each bike rental.

start.time <- matrix(unlist(strsplit(matrix(unlist(strsplit(bikes.in$Start.date, ' ')),nrow = nrow(bikes.in), ncol = 2, byrow=T)[,2], ':')),byrow=T,  ncol = 2, nrow = nrow(bikes.in))
start.time <- apply(start.time,2,as.numeric)
bikes.in$start.time <- start.time[,1] + start.time[,2] / 60 # start time in fraction of hour

end.time <- matrix(unlist(strsplit(matrix(unlist(strsplit(bikes.in$End.date, ' ')),nrow = nrow(bikes.in), ncol = 2, byrow=T)[,2], ':')),byrow=T,  ncol = 2, nrow = nrow(bikes.in))
end.time <- apply(end.time,2,as.numeric)
bikes.in$end.time <- end.time[,1] + end.time[,2] / 60 # start time in fraction of hour
head(bikes.in)
##        Start.date        End.date                   Start.station
## 1 3/31/2017 23:55 3/31/2017 23:58             14th & Irving St NW
## 2 3/31/2017 23:52 3/31/2017 23:54                  15th & P St NW
## 3 3/31/2017 23:51 3/31/2017 23:59   5th St & Massachusetts Ave NW
## 4 3/31/2017 23:50 3/31/2017 23:53 Columbus Circle / Union Station
## 5 3/31/2017 23:50 3/31/2017 23:54     Columbia Rd & Belmont St NW
## 6 3/31/2017 23:49 3/31/2017 23:55    Columbia Rd & Georgia Ave NW
##                      End.station Member.Type day start.time end.time
## 1           15th & Euclid St  NW  Registered  31   23.91667 23.96667
## 2 17th St & Massachusetts Ave NW  Registered  31   23.86667 23.90000
## 3 17th St & Massachusetts Ave NW  Registered  31   23.85000 23.98333
## 4                  8th & F St NE  Registered  31   23.83333 23.88333
## 5           16th & Harvard St NW  Registered  31   23.83333 23.90000
## 6             9th & Upshur St NW  Registered  31   23.81667 23.91667
bikes.out <- mutate(bikes.in, trip.time = end.time -start.time)

kable(bikes.out %>% group_by(Member.Type) %>% summarize(ave.trip.hr = mean(trip.time)))
Member.Type ave.trip.hr
Casual 0.6378864
Registered 0.1888411

e. (5 points)

Describe one pro or con about using dplyr for data wrangling.