36-350, Statistical Computing, Fall 2011

eniac.jennings_and_bilas_working_by_card_sorter_and_printer.c1940s.102649732
Instructors Prof. Cosma Shalizi
Prof. Vincent Vu
TAs Mr. Devon Shurick
Mr. Darren Hommrighausen
Mr. F. Spencer Koerner
Lecture Mondays and Wednesdays, Porter Hall A18B
Labs Fridays, Wean Hall 5202
Office hours Mondays, 3–4 pm (Prof. Shalizi) and
Tuesdays, 2–3 pm (Prof. Vu)
Tuesdays, 5:30--7:30 pm (TA, in FMS 320)
eniac.spence_and_goldstein_working_on_function_table.c1940s.102649731
This is the 2011 version of the class. You are probably looking for the latest one.

Description

Computational data analysis is an essential part of modern statistics. Competent statisticians must not just be able to run existing programs, but to understand the principles on which they work. They must also be able to read, modify and write code, so that they can assemble the computational tools needed to solve their data-analysis problems, rather than distorting problems to fit tools provided by others. This class is an introduction to programming, targeted at statistics majors with minimal programming knowledge, which will give them the skills to grasp how statistical software works, tweak it to suit their needs, recombine existing pieces of code, and when needed create their own programs.

Students will learn the core of ideas of programming — functions, objects, data structures, flow control, input and output, debugging, logical design and abstraction — through writing code to assist in numerical and graphical statistical analyses. Students will in particular learn how to write maintainable code, and to test code for correctness. They will then learn how to set up stochastic simulations, how to parallelize data analyses, how to employ numerical optimization algorithms and diagnose their limitations, and how to work with and filter large data sets. Since code is also an important form of communication among scientists, students will learn how to comment and organize code.

The class will be taught in the R language.

Pre-requisites

This is an introduction to programming for statistics students. Prior exposure to statistical thinking, to data analysis, and to basic probability concepts is essential, as is some prior acquaintance with statistical software. Previous programming experience is not assumed, but familiarity with the computing system is. Formally, the pre-requisites are "Computing at Carnegie Mellon" (or consent of instructor), plus one of either 36-202 or 36-208, with 36-225 as either a pre-requisite (preferable) or co-requisite (if need be).

Calendar and topics

eniac.addition_operation.c1940s.102649737 Subject to revision.
  1. Data types and data structures (lectures on 8/29, 8/31, lab)
    Lecture 1: Introduction to the class, basic data types, vector and array data structures
    Lecture 2: Matrices and matrix operations; lists; data frames; structures of structures
    Homework assignment 1, due Wednesday, 7 September; solutions
    Lab 1, solutions
  2. Flow control and looping (lecture on 9/7, lab)
    Lecture 3: Conditioning the calculation on the data; iteration to repeat similar calculations; avoiding iteration with "vectorized" operations and functions.
    Homework assignment 2, due Wednesday, 14 September; accompanying R file. Solutions and their R.
    Lab 2, solutions
  3. Writing and calling functions (9/12, 9/14, lab)
    Lecture 4: Declaring functions to tie together related commands. Arguments (inputs) and return values (outputs). Named arguments and defaults. Interfaces. (R code)
    Lecture 5: Using multiple functions for related tasks; to re-use work; to break big problems down into smaller ones. (R)
    Homework 3, due at the start of class on Wednesday, 21 September 2011. Solutions, R
    Lab 3, solutions
  4. More function-writing: top-down design, scoping (9/19, 9/21, lab)
    Lecture 6, Top-down design: recursively solving problems by writing functions to integrate the work of sub-functions that solve sub-problems. Example with linear regression.
    Lecture 7, Scope: Names, scoping rules and environments. Example with the homework.
    Homework 4, due at 11:59 pm on Tuesday, 27 September 2011. Solutions (R)
    Lab 4 (R), solutions (their R)
  5. Debugging and testing (9/26, 9/28, lab)
    Lecture 8, Debugging; characterizing and localizing bugs; common errors; programming for debugging.
    Lecture 9, Testing: purpose of testing; tests of particular cases vs. cross-checking tests; cycling between testing and programming.
    Homework 5 (R), due at 11:59 pm on Tuesday, 4 October 2011. Solutions, R
    Lab 5, solutions (R)
  6. Functions as objects (10/3, 10/5, lab)
    Lecture 10, Functions as arguments; in R, functions are objects like everything else, so they can be arguments to other functions; examples like gradient and gradient.descent. R
    Lecture 11, Functions as values. In R, functions are objects, so they can be returned by other functions. Examples of predictors, mathematical operators, and the creation of functions from expressions for plotting surfaces. R
    Homework 6 (R), due at 11:59 pm on Tuesday, 11 October. Solutions (R)
    Lab 6, solutions
  7. Split/apply/combine (10/10, 10/12); mid-term on 10/14 instead of lab
    Lecture 12, The split, apply, combine pattern I; base R approaches. Example using Masters 2011 Golf Tournament. (R, data)
    Lecture 13: Review and Q& A
    No homework
    No lab — go to mid-term! (Hunt Library Cluster, lower level)
    Midterm: exam and solution
  8. Split/apply/combine, abstraction (10/17, 10/19, no lab)
    Lecture 14, split/apply/combine II. (R, data)
    Lecture 15, abstraction and refactoring (R)
    Homework 7 (data), due at 11:59 pm on Tuesday, 1 November. Solutions (R)
    No lab (mid-semester break)
  9. Simulation (10/24, 10/26, lab)
    Lecture 16, Simulation I: Random variable generation
    Lecture 17: exam de-briefing
    Lab 7
  10. Optimization (10/31, 11/2, lab)
    Lecture 18, simulation II: Monte Carlo and Markov chains
    Lecture 19, simulation III: Mixing and Markov Chain Monte Carlo
    Homework 8, due at 11:59 pm on Thursday, 10 November. Solutions (R)
    Lab 8 (solutions)
  11. Working with character data (11/7, 11/9, lab)
    Lecture 20, Basics of character manipulation
    Lecture 21, Regular Expressions I
    Lab 9 (partial solutions)
    Final project: project descriptions
  12. Regular expressions and web scraping (11/14, 11/16, lab)
    Lecture 22, Regular Expressions II. (R)
    Lecture 23, Importing Data from Web Pages I. (R)
    Homework 9, due at 11:59 pm on Wednesday, 23 November. Solutions (R)
    Lab 10, Importing Data from Web Pages II
  13. More on web scraping (lecture 11/21, no lab)
    Lecture 24, Importing Data from Web Pages II+. (R)
    No homework. Happy Thanksgiving!
    No lab
  14. Databases (11/28, 11/30, no lab)
    Lecture 25, Databases I: Overview and Intro to SQL
    Lecture 26, Databases II: Intro to SQL, joining tables, accessing DBs from R
    Homework 10 (data), due at 11:59 pm on Friday, 9 December. Solutions (R)
    Lab 11 (CANCELLED)
  15. Presentations(12/5, 12/7, 12/9) -- attendance is mandatory this week
    Presentations
    Presentations
    Presentations --- Note that the location for this day is Wean 5415!

Course Mechanics and Grading

There will be two lectures every week (with exceptions only for holidays), and a weekly in-class lab. There will also be homework nearly every week, an in-class mid-term in place of one of the labs, and a final exam. Grades will be calculated as follows:

Homework

There will be a homework assignment nearly every week. The lowest three homework grades will be dropped; no credit will be given for late work under any circumstances.

You will be required to do programming in R. You must have reliable access to a computer running a reasonably up to date version of R. If this is a problem, contact the instructors as soon as possible.

All homework must be turned in electronically as plain text files (no Word, no PDF, etc.), with file names clearly indicating your Andrew ID and the assignment number. No credit will be given for homework with the wrong format. For programming assignments, your code must be ready to run. Code which does not run may or may not be given partial credit, at our discretion; the more we have to work to figure out why your code does not run, the less credit you will get.

Textbooks

The two required books are W. John Braun and Duncan J. Murdoch, A First Course in Statistical Programming with R, and Paul Teetor, The R Cookbook. The first will serve as our textbook; the second is an extremely valuable reference work. We expect you to have both. A third book, John M. Chambers, Software for Data Analysis: Programming with R, is optional. All three should be available at the university book store, and of course online.

Some R Resources

R is a free, open-source software package/programming language for statistical computing. There are many online resources for learning about it and working with it, in addition to the textbooks:

The website Software Carpentry is not specifically R related, but contains a lot of valuable advice and information on scientific programming.

Physically Disabled and Learning Disabled Students

Jean Jennings Bartik and the ENIAC The Office of Equal Opportunity Services provides support services for both physically disabled and learning disabled students. For individualized academic adjustment based on a documented disability, contact Equal Opportunity Services at eos [at] andrew.cmu.edu or (412) 268-2012.

Collaboration, Copying and Plagiarism

You are encouraged to discuss course material, including assignments, with your classmates. All work you turn in, however, must be your own. This includes both writing and code. Copying from other students (1) does nothing to help you learn how to program, (2) is easy for us to detect, and (3) has serious negative consequences for you, as outlined in the university's policy on cheating and plagiarism. If, after reading the policy, you are unclear on what is acceptable, please ask an instructor.

The Old 36-350

If you came to this page by a search engine, you may be looking for the data-mining class which used to be numbered 36-350. It is now 36-462, and is taught in the spring semester.