--- title: | | STAT 408 - Statistical Learning | Predictive Modeling output: html_document --- ```{r setup, include=FALSE} library(ggplot2) library(dplyr) library(knitr) library(randomForest) library(maps) library(plotrix) library(mnormt) library(rpart) knitr::opts_chunk$set(echo = TRUE) knitr::knit_hooks$set(mysize = function(before, options, envir) { if (before) return(options$size) }) options(scipen=999) ``` ## Exercise - Prediction for Capital Bike Share ```{r, mysize=TRUE, size='\\tiny'} bikes <- read.csv('http://www.math.montana.edu/ahoegh/teaching/stat408/datasets/Bike.csv') set.seed(11142017) num.obs <- nrow(bikes) test.ids <- base::sample(1:num.obs, size=round(num.obs*.3)) test.bikes <- bikes[test.ids,] train.bikes <- bikes[(1:num.obs)[!(1:num.obs) %in% test.ids],] dim(bikes) dim(test.bikes) dim(train.bikes) ``` ## Exercise - Prediction for Capital Bike Share ```{r, mysize=TRUE, size='\\normalsize'} lm.bikes <- lm(count ~ holiday + atemp, data=train.bikes) lm.mad <- mean(abs(test.bikes$count - predict(lm.bikes,test.bikes))) ``` Create another predictive model and compare the results to the MAD of the linear model above ($`r round(lm.mad)`$). However, don't use casual and registered in your model as those two will sum to the total count. ## Exercise: Predict Titanic Survival ```{r, mysize=TRUE, size='\\scriptsize'} titanic <- read.csv( 'http://www.math.montana.edu/ahoegh/teaching/stat408/datasets/titanic.csv') set.seed(11142017) titanic <- titanic %>% filter(!is.na(Age)) num.pass <- nrow(titanic) test.ids <- base::sample(1:num.pass, size=round(num.pass*.3)) test.titanic <- titanic[test.ids,] train.titanic <- titanic[(1:num.pass)[!(1:num.pass) %in% test.ids],] dim(titanic) dim(test.titanic) dim(train.titanic) ``` ## Exercise: Predict Titanic Survival See if you can improve the classification error from the model below. ```{r} glm.titanic <- glm(Survived ~ Age, data=train.titanic, family = binomial) Class.Error <- mean(test.titanic$Survived != round(predict(glm.titanic, test.titanic, type='response'))) ``` The logistic regression model only using age is wrong $`r round(Class.Error,2)* 100`$% of the time.