--- title: "Lab 9: SQL and Data Scraping" author: 'Solution Key' date: "March 22, 2018" output: html_document --- Turn in one copy for each group. If group members are not present in class they will be required to complete their own lab to receive credit. Then turn in an HTML and R Markdown output. This is due Tuesday March 27th at 12:15 PM. ## Q1. (40 points) Visit the website: [http://www.montana.edu/marketing/about-msu/](http://www.montana.edu/marketing/about-msu/), your goal is to write code to extract the table that contains the Top 10 Student Home States. In this case you could likely enter this by hand quicker, but if the table was much larger scraping would be more efficient. ```{r, include = F} library(rvest) library(knitr) library(dplyr) ``` ```{r} msu <- read_html('http://www.montana.edu/marketing/about-msu/') msu.students <- msu %>% html_node('table') %>% html_table() colnames(msu.students) <- c('Home State','Enrollment') kable(msu.students, caption = 'TOP 10 STUDENT HOME STATES') ``` ## Q2. (60 points) For this question, a subset of the tables contained in the History of Baseball database are available. The sqlite database can be downloaded at [http://math.montana.edu/ahoegh/teaching/stat408/datasets/example.sqlite](http://math.montana.edu/ahoegh/teaching/stat408/datasets/example.sqlite). You will need to store this file locally on your computer. Additional details are available here: [https://www.kaggle.com/seanlahman/the-history-of-baseball](https://www.kaggle.com/seanlahman/the-history-of-baseball). #### SQL Select players born in the State of Montana and compute: - the total number of players - the total number of homeruns by these players. (This will require joining the batting and player tables). ### First dplyr solution ```{r} # read in the data files batting <- read.csv('http://math.montana.edu/ahoegh/teaching/stat408/datasets/batting.csv', stringsAsFactors = F) player <- read.csv('http://math.montana.edu/ahoegh/teaching/stat408/datasets/player.csv', stringsAsFactors = F) ``` ```{r} # answer first question Mt.players <- player %>% filter(birth_state == 'MT') num.Mt <- Mt.players %>% tally() %>% as.numeric() ``` There have been `r num.Mt` MLB players born in the state of Montana. ```{r} # answer second question Mt.homeruns <- Mt.players %>% inner_join(batting, by = 'player_id') %>% summarise(hr.total = sum(hr, na.rm=T)) %>% as.numeric() ``` The MT born players have hit a total of `r Mt.homeruns` home runs in MLB. ### Now SQL solution ```{r} # Create SQL database in memory and put the .CSV files into the database. library(RSQLite) mlb.db <- dbConnect(RSQLite::SQLite(), ":memory:") dbWriteTable(mlb.db, "player", player) dbWriteTable(mlb.db, "batting", batting) dbListTables(mlb.db) dbListFields(mlb.db, 'batting') dbListFields(mlb.db, 'player') # Okay everything is here, now start the SQL commands ## first extract the number of rows corresponding to players born in MT mt.player.sql <- dbGetQuery(mlb.db,"SELECT count(*) FROM player WHERE birth_state = 'MT'") %>% as.numeric() # create new table with MT.players dbSendQuery(mlb.db, 'CREATE TABLE mt_players AS SELECT player_id, birth_state FROM player WHERE birth_state = "MT";') dbListTables(mlb.db) dbListFields(mlb.db, "mt_players") ## Create new table for mt.hr.sql <- dbGetQuery(mlb.db,"SELECT sum(hr) FROM batting INNER JOIN mt_players ON batting.player_id = mt_players.player_id; ") %>% as.numeric() dbDisconnect(mlb.db) ``` Here we get the same results using SQL with `r mt.player.sql` Montana born players hitting a total of `r mt.hr.sql` home runs.