Lab 10: Predictive Modeling

Turn in one copy for each group. If group members are not present in class they will be required to complete their own lab to receive credit. This is due Monday April 2.

For this lab we will be using a subset of a dataset collected in Brazil relating to whether patients show up for a scheduled medical appointment. More information about the dataset is available at: https://www.kaggle.com/joniarroba/noshowappointments.

Q1. (10 points)

Look at the dataset and summarize the variables included in the dataset.

Q2. (15 points)

Before fitting a model, discuss which variables you think might be related to missing a medical appointment. Specifically what relationship do you expect between these variable and missing a medical appointment.

Q3. (15 points)

Note that in predictive modeling, a fair amount of data wrangling is necessary to create relevant variables to use for prediction. Examples in this setting include: - appointment delay (time between scheduling having appointment) - whether the user has missed an appointment before (I removed user ID from the data base) - the day of week of the appointment.

You don’t need to do this, but explain how you would create a variable that had the day of week for the appointment.

Q4. (40 points)

Now using the training dataset available at http://www.math.montana.edu/ahoegh/teaching/stat408/datasets/MedApptsTrain.csv to fit a predictive model for determining whether a patient will miss the medical appointment. Then use your model to make predictions on the test dataset, which is available at http://www.math.montana.edu/ahoegh/teaching/stat408/datasets/MedApptsTest.csv

For this model we are using classification error and another criteria called log-loss. Log-loss is useful to evaluate a case when a prediction is made as a probability (prob missed appointment = .3) and the result is a binary outcome (missed appointment or not). The goal is to have a lower value for log-loss. Mathematically log-loss can be written as

\[ loss = - y * log(p) - (1-y) * log(1-p) \] Your model should show an improved classification error over the model below.

# Read in Data
train.appts <- read.csv('http://www.math.montana.edu/ahoegh/teaching/stat408/datasets/MedApptsTrain.csv')
test.appts <- read.csv('http://www.math.montana.edu/ahoegh/teaching/stat408/datasets/MedApptsTest.csv')

# Fit logistic regression - 
#    this model assigns Prob=.5 to both outcomes
glm.simple <- glm(No.show~ -1, family= binomial, data=train.appts)

# Calculate Classification Error
CE <- mean((test.appts$No.show == "Yes") != (round(predict(glm.simple, test.appts, type='response'))))

# Calculate Log-Loss
log.loss <- mean(- as.numeric(test.appts$No.show == "Yes") * log(predict(glm.simple, test.appts, type='response')) - as.numeric(test.appts$No.show == "No") * log(1 - predict(glm.simple, test.appts, type='response')))

The basic model has classification error of \(0.201\) and log loss of \(0.6931472\). See if you can do better, but note that the classification error might not change, because your binary prediction might always be that a person will show up for the appointment.

Q5. (20 points)

Summarize your findings and discuss the accuracy of your predictive approach. Why did you choose this model over alternatives?