Warning: Programming is Frustrating

  • All stats analysis requires the use of the computer
  • Computers do exactly what we tell them to do, not what we're thinking they should do
  • Lots of finicky little conventions must be memorized
  • Fun part is to get it to do new and beautiful things. There is a reward in the end. Computers are fast and accurate.

Example: Data manipulation

  • Input a data file - often in a CSV format
housing.data <- 
  read.csv('~/Google Drive/Teaching/STAT408/Data/HousingSales.csv')
ne.housing <- subset(housing.data, State == 'NE')
head(ne.housing)
##      City State Zip_Code Living_Sq_Ft Closing_Price
## 1 LINCOLN    NE    68524         1458        139000
## 2 LINCOLN    NE    68524         1526        106000
## 3 LINCOLN    NE    68524         2374        123000
## 4 LINCOLN    NE    68524          910        124000
## 5 LINCOLN    NE    68524         1369        110000
## 6 LINCOLN    NE    68524         1316        111000

Example: Create Plots

plot(Closing_Price ~ Living_Sq_Ft, data=ne.housing)

Example: Better Plots

qplot(x=Living_Sq_Ft, y=Closing_Price,data = ne.housing) + 
geom_smooth(method='loess') + theme_bw()

Computing Basics

  • Stay organized. Create a folder for STAT 408, subfolders as needed for notes, homework, …
  • We will work with .R, .sas, .Rmd code files
  • Data files: .csv, .txt,
  • Know where files reside
  • Need to back up your work: Google Drive, Dropbox, montana.box.com

Programming in General

  • Plan ahead "Top-Down" programming
  • Programming is an iterative process
  • Reproducibility - code should make sense a year from now - Avoid programming with graphical interfaces - or save code run in background
    • We use command line interface
    • Use comments in code to explain what you are doing
    • Include code and comments in one big file

R is:

  • A programming environment
  • A way to run stat analyses
  • Built of functions and objects
  • Great at making complex plots (not necessarily easy)
  • A project involving work from hundreds of peopld
  • Rapidly expanding.

R is not:

  • A spreadsheet.
  • A database.
  • A place to enter data from the field directly.
  • A point-and-click environment.
  • A commercial product with professional support staff.

Why learn R?

  • For complex stat analyses.
    • Hierarchical Models
    • Spatial/Temporal correlation
    • Cutting edge stat techniques
  • Complex graphics showing important facets of the data.
  • Reproducibility. Save code to rerun an analysis with new data, or change things and rerun an analysis. The knitr package lets us combine the analysis, writeup, and output into a pdf, html, or oven work doc.

More about this course

Course Objectives

At the completion of this course, students will:

  • become literate in statistical programming using R and SAS,
  • learn to effectively communicate through visual presentations of data, and
  • understand and imitate good programming practices.

Prereqs and Textbooks:

Prerequisite: One of STAT 217Q, STAT 332, STAT 401, or equivalent.

Textbooks (all free or optional):

  • ModernDive: An introduction to Statistical and Data Sciences via R, by Chester Ismay and Albert Kim. Free at http://moderndive.com
  • R for Data Science, by Hadley Wickham and Garret Grolemund. Free at http://r4ds.had.co.nz.
  • Visualize This: The FlowingData Guide to Design, Visualization, and Statistics, by Nathan Yau, 2011.
  • Art of R Programming: A Tour of Statistical Software Design, by Norman Matloff, 2011.
  • The Little SAS Book: A Primer, by Lora Delwiche and Susan Slaughter.

Additional Resources

Course Outline

The course will be taught from a partially flipped perspective. Tuesdays will be group labs which focus on implementing the programming concepts covered during the week. Video lectures focused on computing techniques will be watched outside of class.

The course outline follows as:

  • (5 weeks) R: Intro to R, R Studio, and R Markdown.
  • (6 weeks) Data visualization principles and advanced R: ggplot2 and R Shiny.
  • (4 weeks) SAS: data storage, manipulation, SAS procedures, and SAS macros.

Quizzes

Quizzes will be worth 15% of the final grade.

  • While there is not a formal attendance policy for this class, but there will be weekly quizzes on Thursdays.
  • There will be no makeup exam for missed quizzes, but the worst score will be excluded from final grades.

Homework

Homework will be worth 20% of the final grade.

  • Weekly homework will accompany course material. Some of the computational elements of the course will be presented as video lectures. Homework will typically be qualitative questions or short programming exercises.
  • Homework will be due prior to class on the assigned days. Homework will typically be collected and evaluated online through D2L.

Labs

Labs will be worth 25% of the final grade.

  • Labs will be in-class group assignments conducted every Thursday. The labs will have a large computational element.
  • The labs will be designed to be completed in 75 minutes; however, there may be times that groups need to finish labs outside of class time.

Midterm Exam

The midterm exam will be worth 20% of the final grade.

  • The midterm exam will have two parts: an in-class exam on October 19th and a take-home exam due on October 24th.

Final Exam

The final exam will be worth 20% of the final grade.

  • The final exam will also have two parts with the take-home portion due no later than the day of the final exam period on December 11.

Introductions

I'll expect you to know all of your classmates names by the end of the year.

  • Name
  • Major/Minor
  • How much experience do you have with R and SAS?
  • Why are you taking this course?
  • What was the best thing about your summer?

Homework for Thursday

Homework #1 is available on the course webpage.

  • Create a folder on your primary computer to store STAT408 materials.

  • Install R and R Studio (videos available on course webpage).

  • Create a RMarkdown document and answer the story problem specified on the course webpage. Turn in your .HTML output to D2L prior to class on Tuesday.

Quiz 1

The final part of class today will be the first "quiz". This is designed for me to know you better and provide data for later parts of the course.