36-350, Statistical Computing, Fall 2013
|
Instructors | Prof. Cosma Shalizi |
TAs | Mr. Michael Vespe |
| Mr. Jerzy Wieczorek |
Lecture | Mondays and Wednesdays 10:30--11:20, Scaife Hall 219 |
Labs | Fridays, 10:30--11:20, Baker Hall 140 CF |
Office hours | Mondays, 11:30--12:30, Baker Hall 229C |
| Wednesdays, 2:00--3:00, FMS 320 |
| Thursdays, 1:00--2:00, Baker Hall 229C |
| Thursdays, 2:00--3:00, FMS 320 |
|
|
Description
Computational data analysis is an essential part of modern statistics.
Competent statisticians must not just be able to run existing programs, but to
understand the principles on which they work. They must also be able to read,
modify and write code, so that they can assemble the computational tools needed
to solve their data-analysis problems, rather than distorting problems to fit
tools provided by others. This class is an introduction to programming,
targeted at statistics majors with minimal programming knowledge, which will
give them the skills to grasp how statistical software works, tweak it to suit
their needs, recombine existing pieces of code, and when needed create their
own programs.
Students will learn the core of ideas of programming — functions,
objects, data structures, flow control, input and output, debugging, logical
design and abstraction — through writing code to assist in numerical and
graphical statistical analyses. Students will in particular learn how to write
maintainable code, and to test code for correctness. They will then learn how
to set up stochastic simulations, how to parallelize data analyses, how to
employ numerical optimization algorithms and diagnose their limitations, and
how to work with and filter large data sets. Since code is also an important
form of communication among scientists, students will learn how to comment and
organize code.
The class will be taught in the R
language.
Pre-requisites
This is an introduction to programming for statistics students. Prior exposure
to statistical thinking, to data analysis, and to basic probability concepts is
essential, as is some prior acquaintance with statistical software. Previous
programming experience is not assumed, but familiarity with the
computing system is. Formally, the pre-requisites are "Computing at Carnegie
Mellon" (or consent of instructor), plus one of either 36-202 or 36-208, with
36-225 as either a pre-requisite (preferable) or co-requisite (if need be).
The class may be unbearably redundant for those who already know a
lot about programming. The class will be utterly incomprehensible for
those who do not know statistics.
Calendar and topics
Subject to revision.
- Data types and data structures (first class meeting is lab on 8/30)
Lectures 1 and 2 consolidated: Introduction to the
class; basic data types; vector and array data structures; matrices and matrix
operations; lists; data frames; structures of structures
Homework assignment 1, due at 11:59 pm on Thursday, 5 September
Reading for the week: lecture slides; chapters 1 and 2 of Matloff
Lab 1
- Flow control and looping (lecture on 9/4, lab on 9/6)
Lecture 3, Control:
Conditioning the calculation on the data; iteration to repeat similar
calculations; avoiding iteration with "vectorized" operations and functions.
Reading for the week (before Friday lab): lecture slides; chapters 3--5 of Matloff (sections marked "Extended Examples" optional); section 7.1 of Matloff
Lab 2: hw-02.R
Homework assignment 2, due at 11:59 pm on Thursday, 12 September
- Writing and calling functions (9/9, 9/11, lab 9/13)
Lecture 4, Writing Functions:
Functions tie together related commands. Arguments (inputs) and return values
(outputs). Named arguments and defaults.
Interfaces. R for
examples; gmp.dat
Lecture 5, Multiple
functions: Using multiple functions for related tasks; to re-use work; to
break big problems down into smaller ones. R
for examples
Homework 3, due at 11:59 pm on Thursday, 19 September
Reading for the week: sections 1.3, 7.3--7.5, 7.11, 7.13 of Matloff
Lab 3
- More function-writing: top-down design, testing (9/16, 9/18, lab 9/20)
Lecture 6, Top-down design:
Recursively solving problems by writing functions to integrate the work of
sub-functions that solve sub-problems. Example with linear regression.
Lecture 7, Testing: Why we
test our code; tests of particular cases vs. cross-checking tests; cycling
between testing and programming.
Reading for the week: sections 7.6 and 7.9 in Matloff
Homework 4, due at 11:59 pm on Thursday, 26 September
Lab 4 and R file
- Debugging and functions as objects (9/23, 9/25, lab 9/27)
Lecture 8, Debugging:
Debugging as differential diagnosis; characterizing and localizing bugs; common
errors; programming now for debugging later.
Undelivered optional
lecture, Scope: Names, scoping rules and environments. Examples,
including ones based on homework. R for examples
Lecture 9, Functions as
Objects: In R, functions are objects like everything else, so they can be
arguments to other functions
(examples: gradient, gradient.descent), and they can be
returned by other functions (examples: mathematical operators, predictors,
making functions from expressions for plotting surfaces). R code for examples
Homework 5, due at 11:59 pm on Thursday, 3 October: assignment, R code, hw-05.RData data file.
Reading for the week: chapter 13 in Matloff
Lab 5: assignment
- Simple optimization and refactoring (9/30, 10/2, lab 10/4)
Lecture 10, Simple
Optimization: Basics from calculus about minima. Taylor series. Gradient
descent and Newton's method. Curve-fitting by optimization. Illustrations
with optim and nls. R for examples
Lecture 11, Refactoring:
Re-working your code to make it clearer, more meaningful, and more easily
fixed and extended.
Reading: recipes 13.1 and 13.2 in The R Cookbook;
sections 14.1--14.3 in Matloff
Optional reading: I.1, II.1 and II.2 in Red Plenty
Homework 6, due at 11:59 pm on Tuesday, 8 October (note day!)
Lab 6
- Split/apply/combine (10/7, 10/9); mid-term on 10/11 instead of lab
Lecture 12, The split, apply, combine pattern, using base R. Design
patterns in general. The split/apply/combine pattern: break up a large data
set into smaller meaningful pieces; apply the same analysis to each piece;
combine the answers. Iteration as painful, clumsy split/apply/combine. Tools
for split/apply/combine in basic R: the apply function for
arrays, lapply for lists, mapply, etc.; split.
Detailed example with a complicated data set: the relation between strikes and
parliamentary politics.
Lecture 13, Split/apply/combine II: using plyr. Abstracting
the split/apply/combine pattern: using a single function to appropriately split
up the input, apply the function, and combine the results, depending on the
type of input and output data. Syntax details. Examples: standardizing
measurements for regularly-sampled spatial data; standardizing measurements for
irregularly-sampled spatial data; more fun with strikes and left-wing politics.
No homework
No lab — go to mid-term! (Same location as labs)
Exam 1
- Simulation (10/14, 10/16, no lab)
Lecture 14, Simulation I: Random variable generation
Lecture 15, Simulation II: Markov chains
Reading: Matloff, chapter 8; R Cookbook, chapter 8;
handout: Monte Carlo and Markov Chains
No lab (mid-semester break)
- More Simulation (10/21, 10/23, lab 10/25)
Lecture 16: Simulation III: Monte Carlo, Markov Chain Monte Carlo
Lecture 17: Simulation IV: Quantifying uncertainty with simulations
Reading: Handouts on Monte Carlo and Markov Chains and on Markov Chain Monte Carlo
Optional reading: Charles Geyer, "Practical Markov Chain Monte Carlo",
Statistical
Science 7 (1992):
473--483; "One
Long
Run"; Burn-In is
Unnecessary; On
the Bogosity of MCMC Diagnostics
Lab 7, moretti.csv
Homework 7, due at 11:59 on Thursday, 31 October
- Optimization (10/28, 10/30, lab 11/1)
Lecture 18: Deterministic,
Unconstrained Optimization. The trade-off of approximation versus time.
Derivative-based optimization. Nelder-Mead/simplex method for derivative-free
optimization. Peculiarities of optimizing statistical functionals: don't bother
optimizing much within the margin of error; finding that margin.
Lecture 19: Stochastic and
Constrained Optimization. Optimization under constraints; using Lagrange
multipliers to turn constrained optimization into unconstrained optimization.
The Lagrange method for inequality constraints. Lagrange multipliers as shadow
prices. Mathematical programming. Barrier methods for inequality constraints.
The correspondence between constrained and penalized optimization.
Difficulties of optimizing statistical functions when the data is large.
Sampling as an alternative to averaging over the whole data: stochastic
gradient descent et al. Simulated annealing to escape local minima.
Optional reading: Francis
Spufford, Red
Plenty
(cf.);
Léon Bottou and Olivier
Bosquet, "The
Tradeoffs of Large Scale Learning"
Lab 8: assignment, ckm.csv
Homework 8, due at 11:59 pm on Thursday, 7 November
- Working with character data (11/4, 11/6, lab 11/8)
Lecture 20: Basics of
character manipulation. Characters and strings; strings as parts of larger
data structures; extracting and replacing substrings; splitting strings into
vectors; combining vectors into strings; tabulating counts of string tokens by
string type; why we need flexible string patterns. Example text: al2.txt
Lecture 21: Regular
Expressions: Regular expressions are symbolic representations of patterns
of text. The grammar and interpretation of regular expressions. R functions
using regular expressions: for splitting strings, for finding matches in
strings, for extracting matching text, for substituting according to patterns.
Example
documents: ANSS.csv.html,alice-bob.txt,
ss.html
Lab 9
Homework 9, due at 11:59 pm on Thursday, 14 November Cancelled
Readings: Matloff, chapter 11; R Cookbook, chapter 7; Spector, chapter 7
Optional readings: Bradnam and Korf, sections 4.26--4.28, 5.3, 6.1
Final project options and
instructions. Final project preferences due by 11:59 pm on Thursday, 14
November.
- Regular expressions and web scraping (11/11, 11/13, lab 11/15)
Lecture 22: Importing Data from Web Pages
Lecture 23: Review on text processing
Readings: Spector, chapter 7
Optional readings: Bradnam and Korf, 4.26--4.28, 5.3, 6.1
Lab 10: assignment,
rich-1.html,
rich-2.html,
rich-3.html,
rich-4.html
Homework 10, due at 11:59 pm on Thursday, 21 November:
assignment, hw-10.R
- Reshaping data and database access (lecture 11/18, 11/20, lab 11/22)
Lecture 24: Change of representation: dealing with text as vectors
Lecture 25: Databases:
tables, fields, records; queries; the structured query language (SQL); SELECT;
JOIN; accessing through R. Example database file (30MB): baseball.db
Readings: Spector, chapter 3
Final project selection
Lab 11
Homework 11, due at 11:59 pm on Tuesday, 26 November (note date)
- Simulation encore (11/25, no lab)
Lecture 26: Simulation IV: Matching simulation models to data
No lab
No new homework
- Conclusion (12/2, 12/4, 12/7)
Lecture 27: Computational complexity; going beyond R
Lecture 28: Computing for statistics
No lab 12/7: work on your projects
No homework: work on your projects
Readings: Spector, TBD; Matloff, TBD
Optional readings: Spufford, Red Plenty; Bradnam and Karf, Unix and Perl to the Rescue; Chambers
- Final projects
Final presentations will be held during our final exam period, 5:30--8:30 pm on Monday, 9 December, in Scaife Hall 219 (our usual classroom). Attendance for the whole period is mandatory. Make sure your presentation is turned in, as a PDF file, by 5 pm on that day.
Final reports, and working code, are due by 5 pm on Friday, 13
December 2013. Remember to include your estimate of the share of the work
done by each member of your team, and the grade you think each of your team-mates should get.
Course Mechanics and Grading
There will be two lectures every week (with exceptions only for holidays), and
a weekly in-class lab. There will also be homework nearly every week, an
in-class mid-term in place of one of the labs, and a final group project.
Grades will be calculated as follows:
- Labs: 10%
- Homework: 30%
- Midterm: 20%
- Final project: 40%
Homework
There will be a homework assignment nearly every week. The lowest three
homework grades will be dropped; no credit will be given for late work under
any circumstances. No extensions will be given. If you cannot finish by the
deadline, turn in what you have done for partial credit.
You will be required to do programming in R. You must have reliable access
to a computer running a reasonably up to date version of R. If this is a
problem, contact the instructors as soon as possible.
All homework must be turned in electronically as plain text files (no Word,
no PDF, etc.), with file names clearly indicating your Andrew ID and the
assignment number. No credit will be given for homework with the wrong format.
For programming assignments, your code must be ready to run. Code which does
not run may or may not be given partial credit, at our discretion; the more we
have to work to figure out why your code does not run, the less credit you will
get.
Final Project
Students will be assigned to small groups to work on a final project. You
will select project topics from a list provided by the instructor. (Multiple
groups can take on the same project.) Each group will cooperate on writing
code, documenting it, writing a report, and making a presentation on the
project in during the final exam period.
Textbooks
There are three required books:
- Norman Matloff,
The Art of R Programming: A Tour
of Statistical Software Design
- Phil Spector, Data Manipulation with R
- Paul Teetor, The R
Cookbook
The first two will serve as our textbooks; the third is an
extremely valuable reference work. You will need all three.
Four other books are optional but recommended:
All the books should be available at the university book store, and of
course from online stores.
Some R Resources
R is a free, open-source software
package/programming language for statistical computing. There are many online
resources for learning about it and working with it, in addition to the
textbooks:
- The official intro, "An Introduction to R", available online in
HTML
and PDF
- John Verzani, "simpleR",
in PDF
- Google R Style Guide offers some rules for naming, spacing, etc., which are generally good ideas
- Quick-R. This is
primarily aimed at those who already know a commercial statistics package like
SAS, SPSS or Stata, but it's very clear and well-organized, and others may find
it useful as well.
- Patrick
Burns, The R
Inferno. "If you are using R and you think you're in hell, this is a map
for you."
- Thomas Lumley, "R Fundamentals and Programming Techniques"
(large
PDF)
The website Software Carpentry is
not specifically R related, but contains a lot of valuable advice and
information on scientific programming.
RStudio is an "integrated development
environment" (IDE) for R. Using it is not required, but strongly recommended,
because it gives everyone, on all computers, a common working environment.
Physically Disabled and Learning Disabled Students
The Office of Equal Opportunity Services provides support services for both
physically disabled and learning disabled students. For individualized
academic adjustment based on a documented disability, contact Equal Opportunity
Services at eos [at] andrew.cmu.edu or (412) 268-2012.
Collaboration, Copying and Plagiarism
You are encouraged to discuss course material, including assignments, with
your classmates. All work you turn in, however, must be your own. This
includes both writing and code. Copying from other students, from
books, from websites, or from solutions for previous versions of the class, (1)
does nothing to help you learn how to program, (2) is easy for us to detect,
and (3) has serious negative consequences for you, as outlined in the
university's policy
on cheating and plagiarism. If, after reading the policy, you are unclear
on what is acceptable, please ask an instructor.
The Old 36-350
If you came to this page by a search engine, you may be looking for
the data-mining class which used to be numbered 36-350.
It is now 36-462, and is taught in the spring semester.
You might also be looking for the 2011
or 2012 iterations of this class.