--- title: | | STAT 408 - Week 5: | R Miscellanea and Debugging R Code date: "February 8, 2018" output: beamer_presentation: theme: "PaloAlto" fonttheme: "structuresmallcapsserif" --- ```{r setup, include=FALSE} library(knitr) library(formatR) library(XML) library(dplyr) knitr::opts_chunk$set(echo = TRUE) knitr::knit_hooks$set(mysize = function(before, options, envir) { if (before) return(options$size) }) ``` ## Course Goals With this class, we cannot cover every possible situation that you will encounter. The goals are to: 1. Give you a broad range of tools that can be employed to manipulate, visualize, and analyze data, and 2. teach you to find help when you or your code "gets stuck". # R Miscellanea - more tools ## Lists We have used lists (some), but it is worth talking about them in more details. Here are some questions to get started. - Where have lists shown up in this course? - How do we typically index elements in a list? - What other functions have we used to manipulate lists? ## Exercise: Lists Consider the two lists, write out what gets printed from R. ```{r, eval=F, mysize=TRUE, size='\\tiny'} msu.info <- list( name = c('Waded Cruzado','Andy Hoegh'), degree.from = c('University of Texas at Arlington','Virginia Tech'), job.title = c('President', 'Assistant Professor of Statistics')) msu.info msu.info2 <- list(c('Waded Cruzado','University of Texas at Arlington', 'President'), c('Andy Hoegh', 'Virginia Tech','Assistant Professor of Statistics')) msu.info2 ``` \normalsize What do all of those brackets mean? ## Solution: Lists ```{r, mysize=TRUE, size='\\tiny'} msu.info <- list( name = c('Waded Cruzado','Andy Hoegh'), degree.from = c('University of Texas at Arlington','Virginia Tech'), job.title = c('President', 'Assistant Professor of Statistics')) msu.info msu.info2 <- list(c('Waded Cruzado','University of Texas at Arlington', 'President'), c('Andy Hoegh', 'Virginia Tech','Assistant Professor of Statistics')) msu.info2 ``` ## Lists - indexing With the current lists we can index elements using the double bracket `[[ ]]` notation or if names have been initialized, those can be used too. So the first element of each list can be indexed ```{r, mysize=TRUE, size='\\tiny'} msu.info[[1]] msu.info$name ``` ## Exercise: Lists Explore the indexing with these commands. ```{r, mysize=TRUE, size='\\tiny', eval=F} msu.info <- list( name = c('Waded Cruzado','Andy Hoegh'), degree.from = c('University of Texas at Arlington','Virginia Tech'), job.title = c('President', 'Assistant Professor of Statistics')) msu.info[1] msu.info[[1]] msu.info$name[2] msu.info[1:2] unlist(msu.info) ``` ## Solution: Lists 1 ```{r, mysize=TRUE, size='\\tiny'} msu.info[1] msu.info[[1]] msu.info$name[2] ``` ## Solution: Lists 2 ```{r, mysize=TRUE, size='\\tiny'} msu.info[1:2] unlist(msu.info) ``` ## Lists - nested lists ```{r, mysize=TRUE, size='\\footnotesize'} list(list('a','b'),list('c','d')) ``` ## Arrays Arrays are a general form a matrix, but have a higher dimension. ```{r, mysize=TRUE, size='\\tiny'} array.1 <- array(1:8, dim=c(2,2,2)); array.1 array.1[2,2,1] ``` ## Exercise: Arrays Create an array of dimension 2 x 2 x 3, where each of the three 2 x 2 subarray (or matrix) is the Identity matrix. ## Solution: Arrays Create an array of dimension 2 x 2 x 3, where each of the three 2 x 2 subarray (or matrix) is the Identity matrix. ```{r, mysize=TRUE, size='\\tiny'} array(c(1,0,0,1), dim = c(2,2,3)) ``` ## Merge Another important skill is merging or combining data sets. Consider the two data frames, how can we merge them and what should be the dimensions of the merged data frame. ```{r, mysize=TRUE, size='\\scriptsize'} df1 <- data.frame(school = c('MSU','VT','Mines'), state= c('MT','VA','CO'), stringsAsFactors = F) df1 df2 <- data.frame(school = c('Mines','MSU','VT'), enrollment = c(5794,15688,30598), stringsAsFactors = F) df2 ``` ## sort() and order() One possibility is to use the `sort()` / `order()` functionality as a first step. ```{r, mysize=TRUE, size='\\tiny'} order(df1$school) order(df2$school) df1 <- df1[order(df1$school),] df1 df2 <- df2[order(df2$school),] df2 ``` ## rbind() and cbind() Now, given that the data frames are both sorted the same way, we can bind the rows together. ```{r} comb.df <- cbind(df1,df2) comb.df comb.df <- comb.df[,-3] ``` ## rbind() and cbind() Now assume we want to add another school to the data frame. ```{r, error=TRUE} new.school <- c('Luther', 'IA',2337) rbind(comb.df, new.school) ``` Note: if your strings are saved as factors, this chunk of code will give you an error. ## join() We could have also used some of the more advanced merge (join) features from dplyr. ```{r} library(dplyr) new.df <- full_join(df1,df2, by='school') new.df ``` ## Exercise: merging Combine the two data sets ```{r, mysize=TRUE, size='\\tiny'} df.cost <- data.frame( ski.resort = c('Bridger Bowl', 'Big Sky', 'Steamboat', 'Jackson'), ticket.cost = c(60, 'depends',145, 130)) df.acres <- data.frame( ski.hill = c('Bridger Bowl', 'Jackson', 'Steamboat', 'Big Sky'), skiable.acres = c(2000, "2500+",2965, 5800)) ``` ## Solution: merging Combine the two data sets ```{r, mysize=TRUE, size='\\tiny'} df.cost <- data.frame( ski.resort = c('Bridger Bowl', 'Big Sky', 'Steamboat', 'Jackson'), ticket.cost = c(60, 'depends',145, 130)) df.acres <- data.frame( ski.hill = c('Bridger Bowl', 'Jackson', 'Steamboat', 'Big Sky'), skiable.acres = c(2000, "2500+",2965, 5800)) kable(full_join(df.cost, df.acres, by = c('ski.resort' = 'ski.hill'))) ``` # Debugging R code ## Process for writing code When writing code (and conducting statistical analyses) an iterative approach is a good strategy. 1. Test each line of code as you write it and if necessary confirm that nested functions are giving the desired results. 2. Start simple and then add more complexity. ## Debugging Overview > Finding your bug is a process of confirming the many things that you believe are true -- until you find one which is not true. - Norm Matloff ## Debugging Guide We will first focus on debugging when an error, or warning is tripped. 1. Realize you have a bug (if error or warning, read the message) 2. Make it repeatable 3. Identify the problematic line (using print statements can be helpful) 4. Fix it and test it (evaluate nested functions if necessary) ## Warnings vs. Errors R will flag, print out a message, in two cases: warnings and errors. - What is the difference between the two? - Is the R process treated differently for errors and warnings? ## Warnings vs. Errors - Fatal errors are signaled with `stop()` and force all execution of code to stop triggering an `error`. - Warnings are generated with `warning()` and display potential problems. Warnings **do not** stop code from executing. - Messages can also be passed using `message()`, which pass along information. ## Bugs without warning/error In other cases, we will have bugs in our code that don't necessarily give a warning or an error. - How do we identify these bugs? - How can we exit a case where: - R is running and may be stuck? - the code won't execute because of misaligned parenthesis, braces, brackets? Note: `NA` values often return a warning message, but not always. ## Exercise: Debugging a Warning Fix the script that determines if each item in a sequence is less than zero. ```{r, mysize=TRUE, size='\\tiny'} val.in <- seq(-1,1,by=.25) if (val.in < 0){ print(paste(val.in, 'less than 0')) } ``` ## Solution: Debugging a Warning ```{r, mysize=TRUE, size='\\tiny'} val.in <- seq(-1,1,by=.25) ifelse(val.in < 0,paste(val.in, 'less than 0'),paste(val.in, 'greater than (equal to) 0')) ``` ## Exercise: Debugging an Error Identify the issue(s) with this function ```{r, error=TRUE, eval=FALSE, mysize=TRUE, size='\\footnotesize'} MergeData <- function(data1, data2, key1, key2){ # function to merge two data sets # Args: data1 - first dataset # data2 - second dataset # key1 - key name in first dataset # key2 - key name in second dataset # Returns: merged dataframe if key matches, # otherwise print an error if (key1 = key2){ data.out <- join(data1,data2, by = key1) return(dataout) } else { stop('keys are not the same') } } ``` ## Solution: Debugging an Error ### Step 1 - fix '=' ```{r, error=TRUE, eval=FALSE, mysize=TRUE, size='\\footnotesize'} MergeData <- function(data1, data2, key1, key2){ # function to merge two data sets # Args: data1 - first dataset # data2 - second dataset # key1 - key name in first dataset # key2 - key name in second dataset # Returns: merged dataframe if key matches, # otherwise print an error if (key1 == key2){ data.out <- join(data1,data2, by = key1) return(dataout) } else { stop('keys are not the same') } } ``` ## Solution: Debugging an Error ### Step 2 - load dplyr() & use full_join ```{r, error=TRUE, eval=FALSE, mysize=TRUE, size='\\footnotesize'} MergeData <- function(data1, data2, key1, key2){ # function to merge two data sets # Args: data1 - first dataset # data2 - second dataset # key1 - key name in first dataset # key2 - key name in second dataset # Returns: merged dataframe library(dplyr) if (key1 == key2){ data.out <- full_join(data1,data2, by = key1) return(dataout) } else { stop('keys are not the same') } } MergeData(df1,df2,"school","school") ``` ## Solution: Debugging an Error ### Step 3 - correct dataout to data.out ```{r, error=TRUE, mysize=TRUE, size='\\footnotesize'} MergeData <- function(data1, data2, key1, key2){ # function to merge two data sets # Args: data1 - first dataset # data2 - second dataset # key1 - key name in first dataset # key2 - key name in second dataset # Returns: merged dataframe library(dplyr) if (key1 == key2){ data.out <- full_join(data1,data2, by = key1) return(data.out) } else { stop('keys are not the same') } } ``` ## Solution: Debugging an Error ### Step 3 - correct dataout to data.out ```{r, error=TRUE, mysize=TRUE, size='\\footnotesize'} MergeData(df1,df2,"school","school") MergeData(df.cost,df.acres, 'ski.resort','ski.hill') ``` # Advanced Debugging ## Overview We can often fix bugs using the ideas sketched out previously and this becomes *easier* with more experience coding in R. Trial and error can be very effective and strategic use of print function help to identify where bugs are occuring. However, R does also have advanced tools to help with debugging code. - `traceback()` - "Rerun with debug" - `browser()` ## traceback() Consider the following code: ```{r, error=TRUE, mysize=TRUE, size='\\small'} f <- function(a) g(a) g <- function(b) h(b) h <- function(c) i(c) i <- function(d) "a" + d f(10) ``` ## traceback() Consider the `traceback()` function. Which identifies which functions have been executed (along with the row number of the function). ```{r,eval=FALSE} > traceback() 4: i(c) at #1 3: h(b) at #1 2: g(a) at #1 1: f(10) ``` Note: due to the way that R Markdown is compiled, `traceback()` needs to be run directly in R, not R Markdown. ## Browsing on an error Another option (in R Studio) is to browse on the error. This gives you an interactive way to move through the function calls to identify the problem of the location. This can also be called explicitly using `debug()`. ![](images/debug.png) ## browser() The browser function can also be used to interactively step through a function. ```{r} SS <- function(mu, x) { browser() d <- x - mu d2 <- d^2 ss <- sum(d2) ss } ``` ## browser() step 1 ![](images/browse1.png) ## browser() step 2 ![](images/browse2.png) ## browser() step 3 ![](images/browse3.png) ## browser() step 4 ![](images/browse4.png) ## browser() step 5 ![](images/browse5.png)