Turn in one copy for each group. If group members are not present in class they will be required to complete their own lab to receive credit. Then turn in an HTML and R Markdown output. This is due Tuesday March 27th at 12:15 PM.

Q1. (40 points)

Visit the website: http://www.montana.edu/marketing/about-msu/, your goal is to write code to extract the table that contains the Top 10 Student Home States. In this case you could likely enter this by hand quicker, but if the table was much larger scraping would be more efficient.

msu <- read_html('http://www.montana.edu/marketing/about-msu/')
msu.students <- msu %>% html_node('table') %>% html_table()
colnames(msu.students) <- c('Home State','Enrollment')
kable(msu.students, caption = 'TOP 10 STUDENT HOME STATES')
TOP 10 STUDENT HOME STATES
Home State Enrollment
Montana 9,569
Washington 1,212
California 1,033
Colorado 950
Minnesota 495
Idaho 390
Oregon 359
Alaska 274
Illinois 214
Wyoming 208

Q2. (60 points)

For this question, a subset of the tables contained in the History of Baseball database are available. The sqlite database can be downloaded at http://math.montana.edu/ahoegh/teaching/stat408/datasets/example.sqlite. You will need to store this file locally on your computer. Additional details are available here: https://www.kaggle.com/seanlahman/the-history-of-baseball.

SQL

Select players born in the State of Montana and compute:

  • the total number of players
  • the total number of homeruns by these players. (This will require joining the batting and player tables).

First dplyr solution

# read in the data files
batting <- read.csv('http://math.montana.edu/ahoegh/teaching/stat408/datasets/batting.csv', stringsAsFactors = F)
player <- read.csv('http://math.montana.edu/ahoegh/teaching/stat408/datasets/player.csv', stringsAsFactors = F)
# answer first question
Mt.players <- player %>% filter(birth_state == 'MT') 
num.Mt <- Mt.players %>% tally() %>% as.numeric()

There have been 24 MLB players born in the state of Montana.

# answer second question
Mt.homeruns <- Mt.players %>% inner_join(batting, by = 'player_id') %>% summarise(hr.total = sum(hr, na.rm=T)) %>% as.numeric()

The MT born players have hit a total of 206 home runs in MLB.

Now SQL solution

# Create SQL database in memory and put the .CSV files into the database.
library(RSQLite)
mlb.db <- dbConnect(RSQLite::SQLite(), ":memory:")
dbWriteTable(mlb.db, "player", player)
dbWriteTable(mlb.db, "batting", batting)
dbListTables(mlb.db)
## [1] "batting" "player"
dbListFields(mlb.db, 'batting')
##  [1] "player_id" "year"      "stint"     "team_id"   "league_id"
##  [6] "g"         "ab"        "r"         "h"         "double"   
## [11] "triple"    "hr"        "rbi"       "sb"        "cs"       
## [16] "bb"        "so"        "ibb"       "hbp"       "sh"       
## [21] "sf"        "g_idp"
dbListFields(mlb.db, 'player')
##  [1] "player_id"     "birth_year"    "birth_month"   "birth_day"    
##  [5] "birth_country" "birth_state"   "birth_city"    "death_year"   
##  [9] "death_month"   "death_day"     "death_country" "death_state"  
## [13] "death_city"    "name_first"    "name_last"     "name_given"   
## [17] "weight"        "height"        "bats"          "throws"       
## [21] "debut"         "final_game"    "retro_id"      "bbref_id"
# Okay everything is here, now start the SQL commands
## first extract the number of rows corresponding to players born in MT
mt.player.sql <- dbGetQuery(mlb.db,"SELECT count(*) FROM player WHERE birth_state = 'MT'") %>% as.numeric()

# create new table with MT.players
dbSendQuery(mlb.db, 'CREATE TABLE mt_players AS
    SELECT player_id, birth_state
    FROM player
    WHERE birth_state = "MT";')
## <SQLiteResult>
##   SQL  CREATE TABLE mt_players AS
##     SELECT player_id, birth_state
##     FROM player
##     WHERE birth_state = "MT";
##   ROWS Fetched: 0 [complete]
##        Changed: 1
dbListTables(mlb.db)
## Warning: Closing open result set, pending rows
## [1] "batting"    "mt_players" "player"
dbListFields(mlb.db, "mt_players")
## [1] "player_id"   "birth_state"
## Create new table for 
mt.hr.sql <- dbGetQuery(mlb.db,"SELECT sum(hr) 
FROM batting INNER JOIN mt_players 
ON batting.player_id = mt_players.player_id; ") %>% as.numeric()

dbDisconnect(mlb.db)

Here we get the same results using SQL with 24 Montana born players hitting a total of 206 home runs.