36-350, Statistical Computing, Fall 2013

eniac.jennings_and_bilas_working_by_card_sorter_and_printer.c1940s.102649732
Instructors Prof. Cosma Shalizi
TAs Mr. Michael Vespe
Mr. Jerzy Wieczorek
Lecture Mondays and Wednesdays 10:30--11:20, Scaife Hall 219
Labs Fridays, 10:30--11:20, Baker Hall 140 CF
Office hours Mondays, 11:30--12:30, Baker Hall 229C
Wednesdays, 2:00--3:00, FMS 320
Thursdays, 1:00--2:00, Baker Hall 229C
Thursdays, 2:00--3:00, FMS 320
eniac.spence_and_goldstein_working_on_function_table.c1940s.102649731

Description

Computational data analysis is an essential part of modern statistics. Competent statisticians must not just be able to run existing programs, but to understand the principles on which they work. They must also be able to read, modify and write code, so that they can assemble the computational tools needed to solve their data-analysis problems, rather than distorting problems to fit tools provided by others. This class is an introduction to programming, targeted at statistics majors with minimal programming knowledge, which will give them the skills to grasp how statistical software works, tweak it to suit their needs, recombine existing pieces of code, and when needed create their own programs.

Students will learn the core of ideas of programming — functions, objects, data structures, flow control, input and output, debugging, logical design and abstraction — through writing code to assist in numerical and graphical statistical analyses. Students will in particular learn how to write maintainable code, and to test code for correctness. They will then learn how to set up stochastic simulations, how to parallelize data analyses, how to employ numerical optimization algorithms and diagnose their limitations, and how to work with and filter large data sets. Since code is also an important form of communication among scientists, students will learn how to comment and organize code.

The class will be taught in the R language.

Pre-requisites

This is an introduction to programming for statistics students. Prior exposure to statistical thinking, to data analysis, and to basic probability concepts is essential, as is some prior acquaintance with statistical software. Previous programming experience is not assumed, but familiarity with the computing system is. Formally, the pre-requisites are "Computing at Carnegie Mellon" (or consent of instructor), plus one of either 36-202 or 36-208, with 36-225 as either a pre-requisite (preferable) or co-requisite (if need be).

The class may be unbearably redundant for those who already know a lot about programming. The class will be utterly incomprehensible for those who do not know statistics.

Calendar and topics

eniac.addition_operation.c1940s.102649737 Subject to revision.
  1. Data types and data structures (first class meeting is lab on 8/30)
    Lectures 1 and 2 consolidated: Introduction to the class; basic data types; vector and array data structures; matrices and matrix operations; lists; data frames; structures of structures
    Homework assignment 1, due at 11:59 pm on Thursday, 5 September
    Reading for the week: lecture slides; chapters 1 and 2 of Matloff
    Lab 1
  2. Flow control and looping (lecture on 9/4, lab on 9/6)
    Lecture 3, Control: Conditioning the calculation on the data; iteration to repeat similar calculations; avoiding iteration with "vectorized" operations and functions.
    Reading for the week (before Friday lab): lecture slides; chapters 3--5 of Matloff (sections marked "Extended Examples" optional); section 7.1 of Matloff
    Lab 2: hw-02.R
    Homework assignment 2, due at 11:59 pm on Thursday, 12 September
  3. Writing and calling functions (9/9, 9/11, lab 9/13)
    Lecture 4, Writing Functions: Functions tie together related commands. Arguments (inputs) and return values (outputs). Named arguments and defaults. Interfaces. R for examples; gmp.dat
    Lecture 5, Multiple functions: Using multiple functions for related tasks; to re-use work; to break big problems down into smaller ones. R for examples
    Homework 3, due at 11:59 pm on Thursday, 19 September
    Reading for the week: sections 1.3, 7.3--7.5, 7.11, 7.13 of Matloff
    Lab 3
  4. More function-writing: top-down design, testing (9/16, 9/18, lab 9/20)
    Lecture 6, Top-down design: Recursively solving problems by writing functions to integrate the work of sub-functions that solve sub-problems. Example with linear regression.
    Lecture 7, Testing: Why we test our code; tests of particular cases vs. cross-checking tests; cycling between testing and programming.
    Reading for the week: sections 7.6 and 7.9 in Matloff
    Homework 4, due at 11:59 pm on Thursday, 26 September
    Lab 4 and R file
  5. Debugging and functions as objects (9/23, 9/25, lab 9/27)
    Lecture 8, Debugging: Debugging as differential diagnosis; characterizing and localizing bugs; common errors; programming now for debugging later.
    Undelivered optional lecture, Scope: Names, scoping rules and environments. Examples, including ones based on homework. R for examples
    Lecture 9, Functions as Objects: In R, functions are objects like everything else, so they can be arguments to other functions (examples: gradient, gradient.descent), and they can be returned by other functions (examples: mathematical operators, predictors, making functions from expressions for plotting surfaces). R code for examples
    Homework 5, due at 11:59 pm on Thursday, 3 October: assignment, R code, hw-05.RData data file.
    Reading for the week: chapter 13 in Matloff
    Lab 5: assignment
  6. Simple optimization and refactoring (9/30, 10/2, lab 10/4)
    Lecture 10, Simple Optimization: Basics from calculus about minima. Taylor series. Gradient descent and Newton's method. Curve-fitting by optimization. Illustrations with optim and nls. R for examples
    Lecture 11, Refactoring: Re-working your code to make it clearer, more meaningful, and more easily fixed and extended.
    Reading: recipes 13.1 and 13.2 in The R Cookbook; sections 14.1--14.3 in Matloff
    Optional reading: I.1, II.1 and II.2 in Red Plenty
    Homework 6, due at 11:59 pm on Tuesday, 8 October (note day!)
    Lab 6
  7. Split/apply/combine (10/7, 10/9); mid-term on 10/11 instead of lab
    Lecture 12, The split, apply, combine pattern, using base R. Design patterns in general. The split/apply/combine pattern: break up a large data set into smaller meaningful pieces; apply the same analysis to each piece; combine the answers. Iteration as painful, clumsy split/apply/combine. Tools for split/apply/combine in basic R: the apply function for arrays, lapply for lists, mapply, etc.; split. Detailed example with a complicated data set: the relation between strikes and parliamentary politics.
    Lecture 13, Split/apply/combine II: using plyr. Abstracting the split/apply/combine pattern: using a single function to appropriately split up the input, apply the function, and combine the results, depending on the type of input and output data. Syntax details. Examples: standardizing measurements for regularly-sampled spatial data; standardizing measurements for irregularly-sampled spatial data; more fun with strikes and left-wing politics.
    No homework
    No lab — go to mid-term! (Same location as labs)
    Exam 1
  8. Simulation (10/14, 10/16, no lab)
    Lecture 14, Simulation I: Random variable generation
    Lecture 15, Simulation II: Markov chains
    Reading: Matloff, chapter 8; R Cookbook, chapter 8; handout: Monte Carlo and Markov Chains
    No lab (mid-semester break)
  9. More Simulation (10/21, 10/23, lab 10/25)
    Lecture 16: Simulation III: Monte Carlo, Markov Chain Monte Carlo
    Lecture 17: Simulation IV: Quantifying uncertainty with simulations
    Reading: Handouts on Monte Carlo and Markov Chains and on Markov Chain Monte Carlo
    Optional reading: Charles Geyer, "Practical Markov Chain Monte Carlo", Statistical Science 7 (1992): 473--483; "One Long Run"; Burn-In is Unnecessary; On the Bogosity of MCMC Diagnostics
    Lab 7, moretti.csv
    Homework 7, due at 11:59 on Thursday, 31 October
  10. Optimization (10/28, 10/30, lab 11/1)
    Lecture 18: Deterministic, Unconstrained Optimization. The trade-off of approximation versus time. Derivative-based optimization. Nelder-Mead/simplex method for derivative-free optimization. Peculiarities of optimizing statistical functionals: don't bother optimizing much within the margin of error; finding that margin.
    Lecture 19: Stochastic and Constrained Optimization. Optimization under constraints; using Lagrange multipliers to turn constrained optimization into unconstrained optimization. The Lagrange method for inequality constraints. Lagrange multipliers as shadow prices. Mathematical programming. Barrier methods for inequality constraints. The correspondence between constrained and penalized optimization. Difficulties of optimizing statistical functions when the data is large. Sampling as an alternative to averaging over the whole data: stochastic gradient descent et al. Simulated annealing to escape local minima.
    Optional reading: Francis Spufford, Red Plenty (cf.); Léon Bottou and Olivier Bosquet, "The Tradeoffs of Large Scale Learning"
    Lab 8: assignment, ckm.csv
    Homework 8, due at 11:59 pm on Thursday, 7 November
  11. Working with character data (11/4, 11/6, lab 11/8)
    Lecture 20: Basics of character manipulation. Characters and strings; strings as parts of larger data structures; extracting and replacing substrings; splitting strings into vectors; combining vectors into strings; tabulating counts of string tokens by string type; why we need flexible string patterns. Example text: al2.txt
    Lecture 21: Regular Expressions: Regular expressions are symbolic representations of patterns of text. The grammar and interpretation of regular expressions. R functions using regular expressions: for splitting strings, for finding matches in strings, for extracting matching text, for substituting according to patterns. Example documents: ANSS.csv.html,alice-bob.txt, ss.html
    Lab 9
    Homework 9, due at 11:59 pm on Thursday, 14 November Cancelled
    Readings: Matloff, chapter 11; R Cookbook, chapter 7; Spector, chapter 7
    Optional readings: Bradnam and Korf, sections 4.26--4.28, 5.3, 6.1
    Final project options and instructions. Final project preferences due by 11:59 pm on Thursday, 14 November.
  12. Regular expressions and web scraping (11/11, 11/13, lab 11/15)
    Lecture 22: Importing Data from Web Pages
    Lecture 23: Review on text processing
    Readings: Spector, chapter 7
    Optional readings: Bradnam and Korf, 4.26--4.28, 5.3, 6.1
    Lab 10: assignment, rich-1.html, rich-2.html, rich-3.html, rich-4.html
    Homework 10, due at 11:59 pm on Thursday, 21 November: assignment, hw-10.R
  13. Reshaping data and database access (lecture 11/18, 11/20, lab 11/22)
    Lecture 24: Change of representation: dealing with text as vectors
    Lecture 25: Databases: tables, fields, records; queries; the structured query language (SQL); SELECT; JOIN; accessing through R. Example database file (30MB): baseball.db
    Readings: Spector, chapter 3
    Final project selection
    Lab 11
    Homework 11, due at 11:59 pm on Tuesday, 26 November (note date)
  14. Simulation encore (11/25, no lab)
    Lecture 26: Simulation IV: Matching simulation models to data
    No lab
    No new homework
  15. Conclusion (12/2, 12/4, 12/7)
    Lecture 27: Computational complexity; going beyond R
    Lecture 28: Computing for statistics
    No lab 12/7: work on your projects
    No homework: work on your projects
    Readings: Spector, TBD; Matloff, TBD
    Optional readings: Spufford, Red Plenty; Bradnam and Karf, Unix and Perl to the Rescue; Chambers
  16. Final projects
    Final presentations will be held during our final exam period, 5:30--8:30 pm on Monday, 9 December, in Scaife Hall 219 (our usual classroom). Attendance for the whole period is mandatory. Make sure your presentation is turned in, as a PDF file, by 5 pm on that day.
    Final reports, and working code, are due by 5 pm on Friday, 13 December 2013. Remember to include your estimate of the share of the work done by each member of your team, and the grade you think each of your team-mates should get.

Course Mechanics and Grading

There will be two lectures every week (with exceptions only for holidays), and a weekly in-class lab. There will also be homework nearly every week, an in-class mid-term in place of one of the labs, and a final group project. Grades will be calculated as follows:

Homework

There will be a homework assignment nearly every week. The lowest three homework grades will be dropped; no credit will be given for late work under any circumstances. No extensions will be given. If you cannot finish by the deadline, turn in what you have done for partial credit.

You will be required to do programming in R. You must have reliable access to a computer running a reasonably up to date version of R. If this is a problem, contact the instructors as soon as possible.

All homework must be turned in electronically as plain text files (no Word, no PDF, etc.), with file names clearly indicating your Andrew ID and the assignment number. No credit will be given for homework with the wrong format. For programming assignments, your code must be ready to run. Code which does not run may or may not be given partial credit, at our discretion; the more we have to work to figure out why your code does not run, the less credit you will get.

Final Project

Students will be assigned to small groups to work on a final project. You will select project topics from a list provided by the instructor. (Multiple groups can take on the same project.) Each group will cooperate on writing code, documenting it, writing a report, and making a presentation on the project in during the final exam period.

Textbooks

There are three required books:
  1. Norman Matloff, The Art of R Programming: A Tour of Statistical Software Design
  2. Phil Spector, Data Manipulation with R
  3. Paul Teetor, The R Cookbook
The first two will serve as our textbooks; the third is an extremely valuable reference work. You will need all three.

Four other books are optional but recommended:

All the books should be available at the university book store, and of course from online stores.

Some R Resources

R is a free, open-source software package/programming language for statistical computing. There are many online resources for learning about it and working with it, in addition to the textbooks:

The website Software Carpentry is not specifically R related, but contains a lot of valuable advice and information on scientific programming.

RStudio is an "integrated development environment" (IDE) for R. Using it is not required, but strongly recommended, because it gives everyone, on all computers, a common working environment.

Physically Disabled and Learning Disabled Students

Jean Jennings Bartik and the ENIAC The Office of Equal Opportunity Services provides support services for both physically disabled and learning disabled students. For individualized academic adjustment based on a documented disability, contact Equal Opportunity Services at eos [at] andrew.cmu.edu or (412) 268-2012.

Collaboration, Copying and Plagiarism

You are encouraged to discuss course material, including assignments, with your classmates. All work you turn in, however, must be your own. This includes both writing and code. Copying from other students, from books, from websites, or from solutions for previous versions of the class, (1) does nothing to help you learn how to program, (2) is easy for us to detect, and (3) has serious negative consequences for you, as outlined in the university's policy on cheating and plagiarism. If, after reading the policy, you are unclear on what is acceptable, please ask an instructor.

The Old 36-350

If you came to this page by a search engine, you may be looking for the data-mining class which used to be numbered 36-350. It is now 36-462, and is taught in the spring semester.

You might also be looking for the 2011 or 2012 iterations of this class.