---
title: |
| STAT 408
| Data Scraping and SQL
date: "March 8, 2018"
output:
beamer_presentation:
theme: "PaloAlto"
fonttheme: "structuresmallcapsserif"
---
```{r setup, include=FALSE}
library(dplyr)
library(knitr)
library(rvest)
library(stringr)
knitr::opts_chunk$set(echo = TRUE)
knitr::knit_hooks$set(mysize = function(before, options, envir) {
if (before)
return(options$size)
})
```
# Data Scraping
## Data Scraping
Data scraping is defined as using a computer to extract information, typically from human readable websites. We could spend multiple weeks on this, so this will be a basic introduction that will allow you to:
- extract text and numbers from webpages and
- extract tables from webpages.
## A bit about HTML
HTML elements are written with a start tag, an end tag, and with the content in between: content. The tags which typically contain the textual content we wish to scrape. Some tags include:
- $
$, $$,…,: for headings
- $
$: Paragraph elements
- $
$: Unordered bulleted list
- $$: Ordered list
- $- $: Individual List item
- $
$: Division or section
- $
$: Table
## HTML Example
![MSU website](images/MSU.png)
## Scraping with rvest
```{r}
library(rvest)
library(stringr)
msu.math <- read_html("http://math.montana.edu/")
msu.math
```
## Scraping with rvest
```{r, mysize=TRUE, size='\\footnotesize'}
msu.math %>% html_nodes('h1')
msu.math %>% html_nodes('h1') %>% html_text()
```
## Tidying Up
```{r, mysize=TRUE, size='\\footnotesize'}
msu.math %>% html_nodes('h1') %>% html_text() %>%
str_replace_all("\\s+", " ") %>%
str_replace_all(pattern = "\n", replacement = "") %>%
str_replace_all(pattern = "\t", replacement = "")
```
## Scraping h3
```{r, mysize=TRUE, size='\\footnotesize'}
msu.math %>% html_nodes('h3') %>% html_text() %>%
str_replace_all("\\s+", " ") %>%
str_replace_all(pattern = "\n", replacement = "") %>%
str_replace_all(pattern = "\t", replacement = "")
```
## A River Runs Through It
![IMDB: A River Runs Through It](images/river.png)
## Get Story line
```{r, mysize=TRUE, size='\\footnotesize' }
river <- read_html("http://www.imdb.com/title/tt0105265/")
story.line <- river %>%
html_nodes('#titleStoryLine') %>%
html_nodes('p') %>% html_text() %>%
str_replace_all(pattern = "\n", replacement = "")
```
The storyline is : *`r story.line`*.
## SelectorGadget for accessing actors
![Using SelectorGadget](images/river2.png)
## Actors
```{r, mysize=TRUE, size='\\footnotesize'}
# Get Actors
river %>% html_nodes('#titleCast') %>%
html_nodes(".itemprop span") %>% html_text()
```
## Selecting Tables: baseball data
![HTML Table](images/baseball.png)
## Scraping Tables
```{r, mysize=TRUE, size='\\tiny'}
batting <- read_html("https://www.baseball-reference.com/leagues/MLB/2017-standard-batting.shtml")
batting.list <- batting %>% html_nodes('table') %>% html_table()
batting.df <- tbl_df(batting.list[[1]])
kable(batting.df)
```
## Scraping Exercise: Get Team Info
Visit the baseball reference website for the Colorado Rockies [https://www.baseball-reference.com/teams/COL/2017.shtml](https://www.baseball-reference.com/teams/COL/2017.shtml) and scrape a table or text.
## Scraping Solution: Get Team Info
```{r, mysize=TRUE, size='\\tiny'}
batting.CO <- read_html("https://www.baseball-reference.com/teams/COL/2017.shtml")
tables.CO <- batting.CO %>% html_nodes('table') %>% html_table()
tbl_df(tables.CO[[1]])
```
# SQL
## SQLite
For this class we will use SQLite which enables users to store database files locally, but the principles are the same for querying a server-based database.
We will use a European soccer database available at [https://www.kaggle.com/hugomathien/soccer/](https://www.kaggle.com/hugomathien/soccer/) which can be downloaded with the following link: [https://www.kaggle.com/hugomathien/soccer/downloads/database.sqlite](https://www.kaggle.com/hugomathien/soccer/downloads/database.sqlite)
## Accessing Database
```{r, mysize=TRUE, size='\\footnotesize'}
library(DBI)
library(RSQLite)
## connect to a database WHICH IS STORED LOCALLY
my.database <- dbConnect(SQLite(),
dbname="~/Google Drive/teaching/STAT408/data/database.sqlite")
dbListTables(my.database)
dbDisconnect(my.database)
```
## Identifying fields in table
```{r, mysize=TRUE, size='\\footnotesize'}
my.database <- dbConnect(SQLite(),
dbname="~/Google Drive/teaching/STAT408/data/database.sqlite")
dbListFields(my.database, "Player")
```
## SQL commands
The most basic SQL queries have the following structure:
- **SELECT** var1name, var2name (filter columns)
- **FROM** tablename (identify table)
- **WHERE** condition1 (filter rows)
- **GROUP_BY** var3name (aggregate data)
- **HAVING** condition2 (filter aggregated data)
- **ORDER_BY** var (arrange ordering)
## SQL Query 1
Select all columns for player and view first 5 rows.
```{r, mysize=TRUE, size='\\tiny'}
kable((dbGetQuery(my.database,"SELECT * FROM Player"))[1:5,])
```
## SQL Query 2
Retain player name, weight, height for players over 200 cm
```{r, mysize=TRUE, size='\\tiny'}
kable(dbGetQuery(my.database,"SELECT player_name, height, weight FROM Player WHERE height > 200"))
```
## SQL Query 3
Compute average weight for players of 200 cm
```{r, mysize=TRUE, size='\\tiny'}
dbGetQuery(my.database,"SELECT AVG(weight) as mean_weight FROM Player WHERE height > 200")
```
## Create a database
```{r, mysize=TRUE, size='\\tiny'}
new.db <- dbConnect(RSQLite::SQLite(), ":memory:")
dbListTables(new.db)
player <- tbl_df(dbGetQuery(my.database,"SELECT * FROM Player"))
dbWriteTable(new.db, "player", player)
dbListTables(new.db)
dbDisconnect(new.db)
dbDisconnect(my.database)
```
## Additional SQL
SQL also has functionality for merging and updating tables. See the cheat sheet for more details.
## SQL Exercise
Select the average goals scored in matches in different countries from the match table
## SQL Solution
Select the average goals scored in matches in different countries from the match table
```{r, mysize=TRUE, size='\\tiny'}
my.database <- dbConnect(SQLite(),
dbname="~/Google Drive/teaching/STAT408/data/database.sqlite")
st <- "SELECT AVG(home_team_goal + away_team_goal) as total_goals,
country_id FROM match GROUP by country_id"
dbGetQuery(my.database,st)
dbListTables(my.database)
```
## SQL Solution
Select the average goals scored in matches in different countries from the match table
```{r, mysize=TRUE, size='\\tiny'}
st2 <- "Create Table goals as SELECT AVG(home_team_goal + away_team_goal) as
total_goals, country_id FROM match GROUP by country_id "
dbGetQuery(my.database,st2)
dbGetQuery(my.database, "SELECT * from goals INNER JOIN country on goals.country_id = country.id")
dbSendQuery(my.database, "Drop Table goals")
dbDisconnect(my.database)
```