36-350, Statistical Computing, Fall 2014

I'm watching u...
Instructors Prof. Cosma Shalizi
Prof. Andrew Thomas
TAs Mr. Bryan Hooi
Mr. Samuel Ventura
Lecture Section 1, Mondays and Wednesdays 10:30--11:20, Gates 4102
Section 2, Mondays and Wednesdays 11:30--12:20, Gates 4102
Labs Sections A and B, Fridays, 10:30--11:20, Hunt Library computer labs
Sections C and D, Fridays, 11:30--12:20, Baker Hall 332P
Office hours Monday 9:20--10:20 Wean Hall 8110 (Mr. Hooi)
Monday 4:30--5:30 Baker Hall 132H (Prof. Thomas)
Thursday 1:00--2:30 Baker Hall 229A (Prof. Shalizi)
Friday 3:00--4:00 Wean Hall 8110 (Mr. Ventura)
Kukumav / Little Owl / Athena noctua

Description

Computational data analysis is an essential part of modern statistics. Competent statisticians must not just be able to run existing programs, but to understand the principles on which they work. They must also be able to read, modify and write code, so that they can assemble the computational tools needed to solve their data-analysis problems, rather than distorting problems to fit tools provided by others. This class is an introduction to programming, targeted at statistics majors with minimal programming knowledge, which will give them the skills to grasp how statistical software works, tweak it to suit their needs, recombine existing pieces of code, and when needed create their own programs.

Students will learn the core of ideas of programming — functions, objects, data structures, flow control, input and output, debugging, logical design and abstraction — through writing code to assist in numerical and graphical statistical analyses. Students will in particular learn how to write maintainable code, and to test code for correctness. They will then learn how to set up stochastic simulations, how to parallelize data analyses, how to employ numerical optimization algorithms and diagnose their limitations, and how to work with and filter large data sets. Since code is also an important form of communication among scientists, students will learn how to comment and organize code.

The class will be taught in the R language.

Pre-requisites

This is an introduction to programming for statistics students. Prior exposure to statistical thinking, to data analysis, and to basic probability concepts is essential. Previous programming experience is not assumed, but familiarity with the computing system is. Formally, the pre-requisites are "Computing at Carnegie Mellon" (or consent of instructor), plus one of either 36-202 or 36-208, with 36-225 as either a pre-requisite (preferable) or co-requisite (if need be).

The class may be unbearably redundant for those who already know a lot about programming. The class will be utterly incomprehensible for those who do not know statistics.

Course Mechanics and Grading

There will be two lectures every week (with exceptions only for holidays), and a weekly in-class lab. There will also be homework nearly every week, a mid-term programming project, and a final group project. Grades will be calculated as follows:

Final grades are based on demonstrated mastery of the material, not relative standing in the class.

R and RStudio

R is a free, open-source programming language for statistical computing. Almost all of our work in this class will be done using R. You will need regular, reliable access to a computer running an up-to-date version of R. If this is a problem, let the professors know right away.

RStudio is a free, open-source R programming environment. It contains a built-in code editor, many features to make working with R easier, and works the same way across different operating systems. Use of RStudio is required for the labs, and strongly recommended in general.

Assignment Formatting

All assignments must be turned in electronically, through Blackboard.

All assignments will involve writing a combination of code and actual prose. You must submit your assignment in a format which allows for the combination of the two, and the automatic execution of all your code. The easiest way to do this is to use R Markdown. Exceptions may be made, with prior permission, for those who want to use Sweave or (better) knitr. (If you don't know what those are, plan to use R Markdown.)

Work submitted as Word files, PDFs, unformatted plain text, etc., will receive an automatic grade of 0, without exceptions.

Every file you submit should have a name which includes your Andrew ID, and clearly indicates the type of assignment (homework, lab, etc.) and its number.

Homework

There will be a homework assignment nearly every week. Each homework will be graded out of three points: one point for making a good-faith effort at every part of the assignment; one point for technically-correct, working solutions to each part; and one point for clean, well-formatted, easily readable code.

Due dates: unless otherwise noted in the calendar, all homework is due at 11:59 pm on Thursday Monday, the week after it is assigned.

Revision: You are free to revise your homework assignments, after they have been graded, and re-submit them to be re-graded.

Labs

There will be a 50 minute lab period every week on Friday morning. The labs will be short exercises, generally related to that week's homework. Attendance is mandatory.

Pair of Little Owls

Pair programming: An important part of programming is collaboration. To help you practice this, the labs will be done through "pair programming". You will be randomly paired with a different partner for each lab, and during the first half of the lab, one of you will do all the actual typing, while the other monitors and comments; during the second half, you will switch roles with your partner.

Midterm Project

In place of an in-class midterm exam, there will be a solo programming project. You will have two weeks to do this project, and will have to submit a write-up containing both your executable code and its results, and an explanation of how you approached the problem and why you chose that approach. As with the homework, grading will give equal weight to completeness, correctness, and comprehensibility.

Final Project

You will be assigned to small groups to work on a final project. You will select project topics from a list provided by the professors. (Multiple groups can take on the same project.) Each group will cooperate on writing code, documenting it, writing a report, and making a presentation on the project in during the final exam period.

Little owl (Athene nocyua) (Explored 11.07.2013)

Peer assessment: One component of your final project grade will be based on your team-mates' assessment of your contribution to the project.

Textbooks

There are three required books:
  1. Norman Matloff, The Art of R Programming: A Tour of Statistical Software Design
  2. Phil Spector, Data Manipulation with R
  3. Paul Teetor, The R Cookbook
The first two will serve as our textbooks; the third is an extremely valuable reference work. You will need all three.

Four other books are optional but recommended:

All the books should be available at the university book store, and of course from online stores.

Some R Resources

There are many online resources for learning about it and working with it, in addition to the textbooks:

The website Software Carpentry is not specifically R related, but contains a lot of valuable advice and information on scientific programming.

Physically Disabled and Learning Disabled Students

Little owl (Athene nocyua) The Office of Equal Opportunity Services provides support services for both physically disabled and learning disabled students. For individualized academic adjustment based on a documented disability, contact Equal Opportunity Services at eos [at] andrew.cmu.edu or (412) 268-2012.

Collaboration, Copying and Plagiarism

You are encouraged to discuss course material, including assignments, with your classmates. All work you turn in, however, must be your own. This includes both writing and code. Copying from other students, from books, from websites, or from solutions for previous versions of the class, (1) does nothing to help you learn how to program, (2) is easy for us to detect, and (3) has serious negative consequences for you, as outlined in the university's policy on cheating and plagiarism. If, after reading the policy, you are unclear on what is acceptable, please ask an instructor.

The Old 36-350

If you came to this page by a search engine, you may be looking for the data-mining class which used to be numbered 36-350. It is now 36-462, and is taught in the spring semester.

You might also be looking for another year's iteration of this class.

Note to Instructors

If you would like to use these materials for your own class, you are welcome to do so, with attributions to the authors (see below) and links to this page. Asking for permission isn't necessary, though letting us know about it is appreciated.
Preferred citation: Shalizi, C. R. and Thomas, A. C. (2014), "Statistical Computing 36-350: Beginning to Advanced Techniques in R", http://www.stat.cmu.edu/~cshalizi/statcomp/14

Calendar and topics

Eagle Owl - Knebworth Country Show 2013

Subject to revision.

  1. Data types and data structures
    Lecture 1 (25 August): Simple data types and structures
    Course mechanics; the R console; basic data types; vectors, our first data structures
    Rpres file for this lecture (the R Markdown file used to build the presentation)
    Printable PDF
    Lecture 2 (27 August): Bigger data structures
    Arrays; matrices and matrix operations; lists; data frames; structures of structures
    Rpres
    Printable PDF
    Lab 1 (29 August)
    R Markdown file for the lab
    Homework: HW 1 assigned; nothing due
    R Markdown file for the assignment
    Reading for the week: chapters 1 and 2 of Matloff
  2. Flow control and looping
    Lecture 3 (Sept. 3): Data Frames and Control
    Data frames for tabular data; conditioning the calculation on the data; iteration to repeat similar calculations; avoiding iteration with "vectorized" operations and functions.
    Rpres file for this lecture
    Printable PDF
    Lab 2 (Sept. 5)
    R Markdown file for the lab
    Homework: HW 1 due; HW 2 assigned
    R Markdown file for the assignment
    Reading for the week: Chapters 3--5 of Matloff (sections marked "Extended Examples" optional); section 7.1 of Matloff
  3. Text
    Lecture 4 (Sept. 8): Text basics
    Characters, strings, text data. Extracting and replacing substrings; splitting strings; building strings; counting strings.
    Printable PDF
    Rpres file for the lecture
    Lecture 5 (Sept. 10): Regular expressions
    "Regular expressions" are patterns of strings. Rules for building regular expressions. R functions for finding matches, splitting strings, and substituting according to patterns.
    Handout for the lecture, with additional examples
    Rpres file for the lecture
    Printable PDF
    Lab 3 (Sept. 12)
    rich.html file; R Markdown file for the lab. (Do not include the text of the questions in your write-up.)
    Homework: HW 2 due; HW 3 assigned
    NHLHockeySchedule2.html file for the homework
    Reading for the week: Matloff, chapter 11; R Cookbook, chapter 7; Spector, chapter 7; handout for lecture 5
    Optional readings: Bradnam and Korf, sections 4.26--4.28, 5.3, 6.1
  4. Writing and calling functions
    Lecture 6 (Sept. 15): Writing functions
    Functions tie together related commands. Arguments (inputs) and return values (outputs). Named arguments and defaults. Interfaces.
    gmp.dat file for the example; Rpres file for presentation
    Printable PDF
    Lecture 7 (Sept. 17): Multiple functions
    Using multiple functions for related tasks; to re-use work; to break big problems down into smaller ones.
    Printable PDF
    R Markdown source for the slides
    Lab 4 (Sept. 19)
    R Markdown source for the lab
    Homework: HW 3 due; HW 4 assigned
    gmp-2013.dat data file for the last problem; R Markdown file for the assignment
    Reading for the week: sections 1.3, 7.3--7.5, 7.11, 7.13 of Matloff
  5. Data from elsewhere
    Lecture 8 (Sept. 22): Getting data
    Reading and writing non-R formats. Importing data from the Web. Scraping Web pages.
    Printable PDF version of slides; R Markdown source file for the slides
    Lecture 9 (Sept. 24): Dataframes with Regression Models
    Making dataframes readable. Plotting with dataframes. Basic statistics on dataframes. Fitting linear models with lm; formulas. Fitting generalized linear models with glm.
    R Markdown source file for the slides
    Printable PDF
    Lab 5 (Sept. 26)
    wtid-report.csv
    Homework: HW 4 due; HW 5 assigned
    R Markdown file for the assignment
    Reading for the week: Matloff, chapter 10; Spector, chapter 2 (skipping sections 2.8--2.10)
  6. Fitting and using statistical models
    Lecture 10 (Sept. 29): Random number generation
    Sources of actually (?) random numbers. Pseudo-random number generators; setting the seed. Basic R functions for parametric distributions.
    Printable PDF
    Lecture 11 (Oct. 1): Distributions as models
    Empirical-distribution-related R commands. R functions for parametric distributions: d*, p*, q*, r*. Fitting distributions: method of moments; generalized moments; maximum likelihood. Diagnostics for distributions.
    Printable PDF
    Lab 6 (Oct. 3)
    R Markdown source file
    Homework: HW 5 due; no new homework
    Mid-semester project assigned, due at 11:59 pm on Thursday, 16 October
    midterm.zip (61 MB), archive of IMDB pages for the midterm
    Reading for the week: R Cookbook, chapter 11
  7. Changing My Shape, I Feel Like an Accident
    Lecture 12 (Oct. 6): Transformations
    Selective access to data. Applying the same function to all parts of a data object. Transforming the data to suit the problem. Common numerical transformations. Summarizing subsets of the data. Sorting, and ordering dataframes. Transposition. Merging dataframes. Reshaping dataframes from wide to long or long to wide.
    Printable PDF of lecture; Rpres source file for slides
    Data sets used in examples: fha.csv, ua.txt, snoqualmie.csv
    Lecture 13 (Oct. 8): Debugging
    Debugging as differential diagnosis; characterizing and localizing bugs; common errors; programming now for debugging later. Tests and bugs.
    Printable PDF; Rpres source file for HTML slides
    Lab 7 (Oct. 10)
    Data files for lab: ckm_nodes.csv and ckm_network.dat
    Homework: no new homework due
    Reading for the week: chapters 8 and 9 in Spector (sections 9.3 and 9.7 optional); chapter 13 in Matloff; chapters 5 and 6 in The R Cookbook
    Optional reading: Hadley Wickham, "Reshaping Data with the reshape Package", Journal of Statistical Software 21 (2007): 12
  8. Leet Programming Skillz
    Lecture 14 (Oct. 13): Testing
    Why we test our code; tests of particular cases vs. cross-checking tests; cycling between testing and programming.
    Printable PDF; Rpres source file
    Lecture 15 (Oct. 15): Top-down design
    Recursively solving problems by writing functions to integrate the work of sub-functions that solve sub-problems. Advantages: demands less thought to write or to read; simpler to debug or extend. Re-factoring to make code which wasn't designed this way look like it was. Extended example with the jack-knife.
    Printable PDF; Rpres source file
    No lab (mid-semester break)
    Homework: HW 6 assigned
    hw-06-supplement.R (containing deliberately buggy code); R Markdown file for the assignment
    Mid-semester project due at 11:59 pm on Thursday, 16 October
    Reading: Sections 7.6, 7.9, and 14.1--14.3 in Matloff
    Optional reading: Chambers, TBD
  9. Functions of functions, and optimization
    Lecture 16 (Oct. 20): Functions as objects
    In R, functions are objects like everything else, so they can be arguments to other functions, and they can be returned by other functions. Examples with curve, grad, gradient descent, and writing surface, a 2D counterpart to curve
    Printable PDF, Rpres source
    Lecture 17 (Oct. 22): Simple optimization
    Basics from calculus about minima. Taylor series. Gradient descent and Newton's method. Scaling and big-O notation. Curve-fitting by optimization. Illustrations with optim and nls. Bonus: Nedler-Mead, a.k.a. the simplex method; coordinate descent.
    Printable PDF; Rpres source
    Lab 8 (Oct. 24)
    Homework: HW 6 due, HW 7 assigned
    R Markdown file for the homework
    Reading: Recipes 13.1--13.2 in The R Cookbook
    Optional reading: I.1, II.1 and II.2 in Red Plenty
  10. Optimization will continue while morale improves
    Lecture 18 (Oct. 27): Constrained optimization
    Optimization under constraints; using Lagrange multipliers to turn constrained problems into unconstrained ones. Lagrange multipliers as "shadow prices". Barrier methods for inequality constraints. The correspondence between constrained and penalized optimization ("a fine is a price"). Statistical uses of penalized optimization: ridge, lasso and spline regression as examples.
    Printable PDF, Rpres source
    Lecture 19 (Oct. 29): Stochastic optimization
    Optimization vs. "big data". Sampling as an alternative to using all the data at once: stochastic gradient descent et al. Peculiarities of optimizing statistical functionals: don't bother optimizing much within the margin of error; finding that margin.
    Lab 9 (Oct. 31)
    lab-09.RData
    Homework: HW 7 due
    Reading
    Optional reading: Red Plenty (cf.); Léon Bottou and Olivier Bosquet, "The Tradeoffs of Large Scale Learning"
  11. Split/apply/combine
    Lecture 20 (Nov. 3): The split, apply, combine pattern, using base R
    Design patterns in general. The split/apply/combine pattern: break up a large data set into smaller meaningful pieces; apply the same analysis to each piece; combine the answers. Iteration as painful, clumsy split/apply/combine. Tools for split/apply/combine in basic R: the apply function for arrays, lapply for lists, mapply, etc.; split; aggregate; subset.
    Lecture 21 (Nov. 5): Split/apply/combine, using plyr
    Abstracting the split/apply/combine pattern: using a single command to appropriately split up the input, apply the function, and combine the results, depending on the type of input and output data. Syntax details.
    Lab 10 (Nov. 7)
    debt.csv; R Markdown source file
    Homework: HW 8 assigned, none due
    hw-08.RData, R Markdown file
    Options for final project announced
    Preferences due by 12 November
    Reading for the week: Cookbook, chapter 6; Spector, chapter 8
    Optional: Hadley Wickham, "The Split-Apply-Combine Strategy for Data Analysis", Journal of Statistical Software 40 (2011): 1
  12. Databases
    Lecture 22 (Nov. 10): Split/Apply/Combine 3
    The high-level view of what split/apply/combine does. Thinking about how to split the data into pieces: concepts and R syntax. Thinking about the function to apply to each piece: concepts and R syntax. Illustrations.
    (No R Markdown source file for this lecture, because the lecturer couldn't figure out how to do the image manipulation in R Markdown; LaTeX source files available on request.)
    Homework: HW 8 due, HW 9 assigned
    Lecture 23 (Nov. 12): Databases
    Basic concepts of relational databases; how a database is like an R dataframe. The client/server model. The structured query language (SQL) and queries; SELECT and JOIN. R/SQL translations. Accessing databases through R.
    Handout: Databases, and Databases in R
    baseball.db database for examples from lecture and handout (30 Mb)
    Printable PDF; RPres source
    Final project preferences due
    Lab 11 (Nov. 14)
    Final project teams announced
    Reading for the week: Spector, chapter 3 (for databases)
  13. Simulation
    Lecture 24 (Nov. 17): Simulation I: Random variable generation and Markov chains
    R Markdown source file
    Homework: HW 9 due, HW 10 assigned
    R Markdown source file
    Lecture 25 (Nov. 19): Simulation II: Monte Carlo and Markov Chain Monte Carlo
    R Markdown source file
    Lab 12 (Nov. 21)
    Reading: Matloff, chapter 8; R Cookbook, chapter 8; handouts
  14. Markov Chain Monte Carlo
    Lecture 26 (Nov. 24): Simulation III: Simulations as Models
    Using simulations to replace probability calculations. Using simulations as statistical models. Live coding demo of a simulation model.
    Simulation code written by section 1 and by section 2
    Printable PDF; Rpres source file
    Homework: HW 10 due; no new homework
    No lab (Thanksgiving break)
    Reading for the week: Handouts
    Optional reading: Charles Geyer, "Practical Markov Chain Monte Carlo", Statistical Science 7 (1992): 473--483; "One Long Run"; Burn-In is Unnecessary; On the Bogosity of MCMC Diagnostics; Andrew Gelman and Donald Rubin, "Inference from Iterative Simulation Using Multiple Sequences", Statistical Science 7 (1992): 457--472
  15. Conclusion of the class
    Lecture 27 (Dec. 1): Beyond R
    Limitations of R. Connecting to other languages and specialized tools.
    Lecture 28 (Dec. 3): Computing for statistics
    No lab: work on your projects
    No homework: work on your projects
    Reading for the week: Matloff, sections 14.3--14.6, 15.1
    Optional readings: Spufford, Red Plenty, TBD; Bradnam and Karf, Unix and Perl to the Rescue, TBD; Chambers, TBD
  16. Final projects
    Final presentations will be held during our final exam period. Attendance for the whole period is mandatory. Submission instructions will be provided closer to the deadline.