36-350, Statistical Computing, Fall 2012

eniac.jennings_and_bilas_working_by_card_sorter_and_printer.c1940s.102649732
Instructors Prof. Cosma Shalizi
TAs Mr. Beau Dabbs
Mr. Samuel Ventura
Lecture Mondays and Wednesdays 10:30--11:20, Wean Hall 5415
Labs Fridays, 10:30--11:20, Baker Hall 140 CE
Office hours Wednesdays, 11:30--1:30, Baker Hall 229C
eniac.spence_and_goldstein_working_on_function_table.c1940s.102649731

Description

Computational data analysis is an essential part of modern statistics. Competent statisticians must not just be able to run existing programs, but to understand the principles on which they work. They must also be able to read, modify and write code, so that they can assemble the computational tools needed to solve their data-analysis problems, rather than distorting problems to fit tools provided by others. This class is an introduction to programming, targeted at statistics majors with minimal programming knowledge, which will give them the skills to grasp how statistical software works, tweak it to suit their needs, recombine existing pieces of code, and when needed create their own programs.

Students will learn the core of ideas of programming — functions, objects, data structures, flow control, input and output, debugging, logical design and abstraction — through writing code to assist in numerical and graphical statistical analyses. Students will in particular learn how to write maintainable code, and to test code for correctness. They will then learn how to set up stochastic simulations, how to parallelize data analyses, how to employ numerical optimization algorithms and diagnose their limitations, and how to work with and filter large data sets. Since code is also an important form of communication among scientists, students will learn how to comment and organize code.

The class will be taught in the R language.

Pre-requisites

This is an introduction to programming for statistics students. Prior exposure to statistical thinking, to data analysis, and to basic probability concepts is essential, as is some prior acquaintance with statistical software. Previous programming experience is not assumed, but familiarity with the computing system is. Formally, the pre-requisites are "Computing at Carnegie Mellon" (or consent of instructor), plus one of either 36-202 or 36-208, with 36-225 as either a pre-requisite (preferable) or co-requisite (if need be).

The class may be unbearably redundant for those who already know a lot about programming. The class will be utterly incomprehensible for those who do not know statistics.

Calendar and topics

eniac.addition_operation.c1940s.102649737 Subject to revision.
  1. Data types and data structures (lectures on 8/27, 8/29, lab on 8/31)
    Lecture 1: Introduction to the class, basic data types, vector and array data structures
    Lecture 2: Matrices and matrix operations; lists; data frames; structures of structures
    Homework assignment 1, due at 11:59 pm on Thursday, 6 September
    Reading for the week (before lab): lecture slides; chapters 1 and 2 of Matloff
    Lab 1; solutions
  2. Flow control and looping (lecture on 9/5, lab)
    Lecture 3: Conditioning the calculation on the data; iteration to repeat similar calculations; avoiding iteration with "vectorized" operations and functions.
    Homework assignment 2, due at 11:59 pm on Thursday, 13 September. R for this problem set.
    Reading for the week (by Thursday): lecture slides; chapters 3--5 of Matloff (sections marked "Extended Examples" optional); section 7.1 of Matloff
    Lab 2: gmp.dat, lab-02.R
  3. Writing and calling functions (9/10, 9/12, lab)
    Lecture 4: Declaring functions to tie together related commands. Arguments (inputs) and return values (outputs). Named arguments and defaults. Interfaces. R for examples, with comments
    Lecture 5: Using multiple functions for related tasks; to re-use work; to break big problems down into smaller ones. R for examples, with comments
    Homework 3, due at 11:59 pm on Thursday, 20 September
    Reading for the week (by Thursday): sections 1.3, 7.3--7.5, 7.11, 7.13 of Matloff
    Lab 3
  4. More function-writing: top-down design, testing (9/17, 9/19, lab)
    Lecture 6, Top-down design: recursively solving problems by writing functions to integrate the work of sub-functions that solve sub-problems. Example with linear regression.
    Lecture 7, Testing: purpose of testing; tests of particular cases vs. cross-checking tests; cycling between testing and programming.
    Notes on common problems: multiple return values; using arguments and defaults; \texttt{curve}
    Homework 4, due at 11:59 pm on Thursday, 27 September
    Reading for the week: sections 7.6 and 7.9 in Matloff
    Lab 4, lab-04.R
  5. Debugging and scope (9/24, 9/26, lab)
    Lecture 8, Debugging; characterizing and localizing bugs; common errors; programming for debugging.
    Lecture 9, Scope: Names, scoping rules and environments. Examples, including ones based on homework.
    Homework 5 (code, data), due at 11:59 pm on Thursday, 4 October
    Reading for the week: chapter 13 in Matloff
    Lab 5
  6. Functions as Objects (lecture on 10/1, lab)
    Lecture 10, Functions as Objects: in R, functions are objects like everything else, so they can be arguments to other functions (examples: gradient, gradient.descent), and they can be returned by other functions (examples: mathematical operators, predictors, making functions from expressions for plotting surfaces). R for examples
    NO LECTURE on Wednesday 3 October
    Office hours: 10--noon on Thursday, 4 October, Baker Hall 229C
    Homework 6, due at 11:59 pm on Tuesday, 9 October (note day!)
    Lab 6
  7. Split/apply/combine (10/8, 10/10); mid-term on 10/12 instead of lab
    Lecture 11, The split, apply, combine pattern, using base R. The idea of design patterns, and their benefits in clarity and re-use. The split/apply/combine pattern: break up a large data set into smaller meaningful pieces; apply the same analysis to each piece; combine the answers. Iteration as a painful and clumsy way of doing split/apply/combine. Tools for split/apply/combine in basic R: the apply function for arrays, lapply for lists, mapply, etc.; split. Detailed example with a complicated data set: the relation between strikes and parliamentary politics.
    Lecture 12, Split/apply/combine II: using plyr. Abstracting the split/apply/combine pattern: using a single function to appropriately split up the input, apply the function, and combine the results, depending on the type of input and output data. Syntax details. Examples: standardizing measurements for regularly-sampled spatial data; standardizing measurements for irregularly-sampled spatial data; more fun with strikes and left-wing politics.
    No homework
    No lab — go to mid-term! (Location Baker Hall 140C and 140E, by family name, as explained in e-mail)
    Exam 1 (Each student got a random combination of problem 1 or 2; problems 2--5 or 6--8; problem 9 or 11; problem 10 or 12)
  8. Refactoring, Graphics (10/15, 10/17, no lab)
    Lecture 13, Refactoring: Re-working your code to make it clearer, more meaningful, and more easily fixed and extended.
    Lecture 14, Graphics (canceled)
    Homework 7, due at 11:59 pm on Thursday, 25 October (canceled)
    No lab (mid-semester break)
  9. Simulation (10/22, 10/24, lab)
    Lecture 15, Simulation I: Random variable generation
    Lecture 16: Simulation II: Monte Carlo, Markov chains, Markov Chain Monte Carlo
    Reading: Matloff, chapter 8; R Cookbook, chapter 8; handouts on Markov chains and Monte Carlo and on Markov chain Monte Carlo
    Lab 7: moretti.csv data
    Homework 8, due at 11:59 on Thursday, 1 November
  10. Optimization (10/31, 11/2, lab)
    Lecture 17: Deterministic, Unconstrained Optimization. The trade-off of approximation versus time. Basics from calculus about minima. The gradient-descent method, its pros and cons. Taylor series as the motivation for Newton's method; Newton's method as gradient descent with adaptive step-size; pros and cons. Coordinate descent instead of multivariate optimization. Nelder-Mead/simplex method for derivative-free optimization. Peculiarities of optimizing statistical functionals: don't bother optimizing much within the margin of error; asymptotic calculations for said margin. Illustrations with optim.
    Lecture 18: Stochastic and Constrained Optimization. Difficulties of optimizing statistical functions when the data is large. Sampling as an alternative to averaging over the whole data. Stochastic gradient descent and stochastic Newton's method as an application of sampling. Simulated annealing to escape local minima. Constrained optimization: an example of why constraints matter. The method of Lagrange multipliers for equality constraints. Lagrange multipliers as shadow prices, indicating how much a marginal weakening of the constraint would improve the optimum. Inequality constraints and their Lagrange multipliers. Mathematical programming. Barrier methods for inequality constraints. The correspondence between constrained and penalized optimization.
    Optional reading: Francis Spufford, Red Plenty (cf.); Herbert Simon, The Sciences of the Artificial, especially chapters 5 and 8; Léon Bottou and Olivier Bosquet, "The Tradeoffs of Large Scale Learning"
    Lab 8; ckm.csv
    Homework 9, due at 11:59 pm on Thursday, 8 November
  11. Working with character data (11/7, 11/9, lab)
    Lecture 19: Basics of character manipulation. Characters and strings; length of a string vs. length of a character vector; strings as parts of larger data structures; extracting and replacing substrings; splitting strings into vectors; combining vectors into strings; tabulating counts of string tokens by string type; why we need flexible string patterns. Data: al2.txt
    Lecture 20: Regular Expressions I. Files used for examples: al2.txt, ANSS.csv.html, alice-bob.txt, ss.html.
    Lab 9; lab-09.R
    Homework 10, due at 11:59 pm on Thursday, 15 November
    Project Options. Your preferences are due by midnight on Wednesday, 14 November
  12. Regular expressions and web scraping (11/12, 11/14, lab)
    Lecture 21: Regular Expressions II (same notes as lecture 20)
    Lecture 22: Importing Data from Web Pages. Easy impoprts from structured data files. Considerations in scraping from unstructured HTML. Exercises, in building networks of political books.
    Lab 10. Data sources: rich-1.html, rich-2.html, rich-3.html, rich-4.html
    Final project selection
  13. Reshaping data (lecture 11/19, no lab)
    Lecture 23: Reshaping data and changes of representation
    No homework
    No lab
  14. Databases (11/26, 11/28, lab)
    Lecture 24: Guest lecture by Prof. Nugent
    Lecture 25: Databases
    Lab 11
    Homework 11, due at 11:59 pm on Thursday, 29 November
  15. Conclusion (12/3, 12/5, 12/7)
    Lecture 26: Programming for statistics
    No lecture 12/5: work on your projects
    No lab 12/7: work on your projects
    No homework: work on your projects
  16. Presentations: 12/14
    Final presentations will be held during our final exam period, 8:30--11:30 on Friday 14 December. Attendance for the whole period is mandatory.

Course Mechanics and Grading

There will be two lectures every week (with exceptions only for holidays), and a weekly in-class lab. There will also be homework nearly every week, an in-class mid-term in place of one of the labs, and a final group project. Grades will be calculated as follows:

Homework

There will be a homework assignment nearly every week. The lowest three homework grades will be dropped; no credit will be given for late work under any circumstances. No extensions will be given. If you cannot finish by the deadline, turn in what you have done for partial credit.

You will be required to do programming in R. You must have reliable access to a computer running a reasonably up to date version of R. If this is a problem, contact the instructors as soon as possible.

All homework must be turned in electronically as plain text files (no Word, no PDF, etc.), with file names clearly indicating your Andrew ID and the assignment number. No credit will be given for homework with the wrong format. For programming assignments, your code must be ready to run. Code which does not run may or may not be given partial credit, at our discretion; the more we have to work to figure out why your code does not run, the less credit you will get.

Final Project

Students will be assigned to small groups to work on a final project. You will select project topics from a list provided by the instructor. (Multiple groups can take on the same project.) Each group will cooperate on writing code, documenting it, writing a report, and making a presentation on the project in during the final exam period.

Textbooks

The two required books are Norman Matloff, The Art of R Programming: A Tour of Statistical Software Design, and Paul Teetor, The R Cookbook. The first will serve as our textbook; the second is an extremely valuable reference work. You will need both.

Two other books are optional: John M. Chambers, Software for Data Analysis: Programming with R, and Phil Spector, Data Manipulation with R.

All four should be available at the university book store, and of course from online stores.

Some R Resources

R is a free, open-source software package/programming language for statistical computing. There are many online resources for learning about it and working with it, in addition to the textbooks:

The website Software Carpentry is not specifically R related, but contains a lot of valuable advice and information on scientific programming.

RStudio is an "integrated development environment" (IDE) for R. Using it is not required, but strongly recommended, because it gives everyone, on all computers, a common working environment.

Physically Disabled and Learning Disabled Students

Jean Jennings Bartik and the ENIAC The Office of Equal Opportunity Services provides support services for both physically disabled and learning disabled students. For individualized academic adjustment based on a documented disability, contact Equal Opportunity Services at eos [at] andrew.cmu.edu or (412) 268-2012.

Collaboration, Copying and Plagiarism

You are encouraged to discuss course material, including assignments, with your classmates. All work you turn in, however, must be your own. This includes both writing and code. Copying from other students, from books, from websites, or from solutions for previous versions of the class, (1) does nothing to help you learn how to program, (2) is easy for us to detect, and (3) has serious negative consequences for you, as outlined in the university's policy on cheating and plagiarism. If, after reading the policy, you are unclear on what is acceptable, please ask an instructor.

The Old 36-350

If you came to this page by a search engine, you may be looking for the data-mining class which used to be numbered 36-350. It is now 36-462, and is taught in the spring semester.

You might also be looking for another iteration of this class.