36-350, Statistical Computing, Fall 2011
This is the 2011 version of the class. You are probably looking for the latest one.
Description
Computational data analysis is an essential part of modern statistics.
Competent statisticians must not just be able to run existing programs, but to
understand the principles on which they work. They must also be able to read,
modify and write code, so that they can assemble the computational tools needed
to solve their data-analysis problems, rather than distorting problems to fit
tools provided by others. This class is an introduction to programming,
targeted at statistics majors with minimal programming knowledge, which will
give them the skills to grasp how statistical software works, tweak it to suit
their needs, recombine existing pieces of code, and when needed create their
own programs.
Students will learn the core of ideas of programming — functions,
objects, data structures, flow control, input and output, debugging, logical
design and abstraction — through writing code to assist in numerical and
graphical statistical analyses. Students will in particular learn how to write
maintainable code, and to test code for correctness. They will then learn how
to set up stochastic simulations, how to parallelize data analyses, how to
employ numerical optimization algorithms and diagnose their limitations, and
how to work with and filter large data sets. Since code is also an important
form of communication among scientists, students will learn how to comment and
organize code.
The class will be taught in the R
language.
Pre-requisites
This is an introduction to programming for statistics students. Prior exposure
to statistical thinking, to data analysis, and to basic probability concepts is
essential, as is some prior acquaintance with statistical software. Previous
programming experience is not assumed, but familiarity with the
computing system is. Formally, the pre-requisites are "Computing at Carnegie
Mellon" (or consent of instructor), plus one of either 36-202 or 36-208, with
36-225 as either a pre-requisite (preferable) or co-requisite (if need be).
Calendar and topics
Subject to revision.
- Data types and data structures (lectures on 8/29, 8/31, lab)
Lecture 1: Introduction to
the class, basic data types, vector and array data structures
Lecture 2: Matrices and matrix
operations; lists; data frames; structures of structures
Homework assignment 1, due Wednesday, 7
September; solutions
Lab 1,
solutions
- Flow control and looping (lecture on 9/7, lab)
Lecture 3: Conditioning the
calculation on the data; iteration to repeat similar calculations; avoiding
iteration with "vectorized" operations and functions.
Homework assignment 2, due Wednesday, 14
September; accompanying R
file. Solutions
and their R.
Lab 2,
solutions
- Writing and calling functions (9/12, 9/14, lab)
Lecture 4: Declaring
functions to tie together related commands. Arguments (inputs) and return
values (outputs). Named arguments and defaults. Interfaces.
(R code)
Lecture 5: Using multiple
functions for related tasks; to re-use work; to break big problems down into
smaller ones. (R)
Homework 3, due at the start of class on
Wednesday, 21 September
2011. Solutions, R
Lab 3,
solutions
- More function-writing: top-down design, scoping (9/19, 9/21, lab)
Lecture 6, Top-down design:
recursively solving problems by writing functions to integrate the work of
sub-functions that solve sub-problems. Example with linear regression.
Lecture 7, Scope: Names,
scoping rules and environments. Example with the homework.
Homework 4, due at 11:59 pm on Tuesday,
27 September 2011. Solutions
(R)
Lab 4
(R),
solutions (their R)
- Debugging and testing (9/26, 9/28, lab)
Lecture 8, Debugging;
characterizing and localizing bugs; common errors; programming for debugging.
Lecture 9, Testing: purpose
of testing; tests of particular cases vs. cross-checking tests; cycling between
testing and programming.
Homework 5
(R), due at 11:59 pm on Tuesday, 4 October
2011. Solutions, R
Lab 5, solutions (R)
- Functions as objects (10/3, 10/5, lab)
Lecture 10, Functions as
arguments; in R, functions are objects like everything else, so they can be
arguments to other functions; examples like gradient
and gradient.descent. R
Lecture 11, Functions as
values. In R, functions are objects, so they can be returned by other
functions. Examples of predictors, mathematical operators, and the creation of
functions from expressions for plotting surfaces.
R
Homework 6
(R), due at 11:59 pm on Tuesday, 11 October.
Solutions (R)
Lab
6, solutions
- Split/apply/combine (10/10, 10/12); mid-term on 10/14 instead of lab
Lecture 12, The
split, apply, combine pattern I; base R approaches. Example using Masters 2011 Golf
Tournament. (R,
data)
Lecture 13: Review and Q& A
No homework
No lab — go to mid-term! (Hunt Library Cluster, lower level)
Midterm: exam and solution
- Split/apply/combine, abstraction (10/17, 10/19, no lab)
Lecture 14, split/apply/combine II. (R,
data)
Lecture 15, abstraction and refactoring (R)
Homework 7 (data), due at 11:59 pm on Tuesday, 1 November.
Solutions (R)
No lab (mid-semester break)
- Simulation (10/24, 10/26, lab)
Lecture 16, Simulation I: Random variable generation
Lecture 17: exam de-briefing
Lab 7
- Optimization (10/31, 11/2, lab)
Lecture 18, simulation II:
Monte Carlo and Markov chains
Lecture 19, simulation III: Mixing and Markov Chain Monte Carlo
Homework 8, due at 11:59 pm on Thursday, 10 November. Solutions (R)
Lab 8 (solutions)
- Working with character data (11/7, 11/9, lab)
Lecture 20, Basics of character manipulation
Lecture 21, Regular Expressions I
Lab 9 (partial solutions)
Final project: project descriptions
- Regular expressions and web scraping (11/14, 11/16, lab)
Lecture 22, Regular Expressions II. (R)
Lecture 23, Importing Data from Web Pages I. (R)
Homework 9, due at 11:59 pm on Wednesday, 23 November.
Solutions (R)
Lab 10, Importing Data from Web Pages II
- More on web scraping (lecture 11/21, no lab)
Lecture 24, Importing Data from Web Pages II+. (R)
No homework. Happy Thanksgiving!
No lab
- Databases (11/28, 11/30, no lab)
Lecture 25, Databases I: Overview and Intro to SQL
Lecture 26, Databases II: Intro to SQL, joining tables, accessing DBs from R
Homework 10 (data), due at 11:59 pm on Friday, 9 December.
Solutions (R)
Lab 11 (CANCELLED)
- Presentations(12/5, 12/7, 12/9) -- attendance is mandatory this week
Presentations
Presentations
Presentations --- Note that the location for this day is Wean 5415!
Course Mechanics and Grading
There will be two lectures every week (with exceptions only for holidays), and
a weekly in-class lab. There will also be homework nearly every week, an
in-class mid-term in place of one of the labs, and a final exam. Grades will
be calculated as follows:
- Labs: 10%
- Homework: 30%
- Midterm: 20%
- Final exam: 40%
Homework
There will be a homework assignment nearly every week. The lowest three
homework grades will be dropped; no credit will be given for late work under
any circumstances.
You will be required to do programming in R. You must have reliable access
to a computer running a reasonably up to date version of R. If this is a
problem, contact the instructors as soon as possible.
All homework must be turned in electronically as plain text files (no Word,
no PDF, etc.), with file names clearly indicating your Andrew ID and the
assignment number. No credit will be given for homework with the wrong format.
For programming assignments, your code must be ready to run. Code which does
not run may or may not be given partial credit, at our discretion; the more we
have to work to figure out why your code does not run, the less credit you will
get.
Textbooks
The two required books are W. John Braun and Duncan
J. Murdoch, A First Course
in Statistical Programming with R, and Paul
Teetor, The R
Cookbook. The first will serve as our textbook; the second is an
extremely valuable reference work. We expect you to have both. A third book,
John
M. Chambers, Software
for Data Analysis: Programming with R, is optional.
All three should be available at the university book store, and of course
online.
Some R Resources
R is a free, open-source software
package/programming language for statistical computing. There are many online
resources for learning about it and working with it, in addition to the
textbooks:
- The official intro, "An Introduction to R", available online in
HTML
and PDF
- John Verzani, "simpleR",
in PDF
- Quick-R. This is
primarily aimed at those who already know a commercial statistics package like
SAS, SPSS or Stata, but it's very clear and well-organized, and others may find
it useful as well.
- Patrick
Burns, The R
Inferno. "If you are using R and you think you're in hell, this is a map
for you."
- Thomas Lumley, "R Fundamentals and Programming Techniques"
(large
PDF)
The website Software Carpentry is
not specifically R related, but contains a lot of valuable advice and
information on scientific programming.
Physically Disabled and Learning Disabled Students
The Office of Equal Opportunity Services provides support services for both
physically disabled and learning disabled students. For individualized
academic adjustment based on a documented disability, contact Equal Opportunity
Services at eos [at] andrew.cmu.edu or (412) 268-2012.
Collaboration, Copying and Plagiarism
You are encouraged to discuss course material, including assignments, with
your classmates. All work you turn in, however, must be your own. This
includes both writing and code. Copying from other students (1) does
nothing to help you learn how to program, (2) is easy for us to detect, and (3)
has serious negative consequences for you, as outlined in the
university's policy
on cheating and plagiarism. If, after reading the policy, you are unclear
on what is acceptable, please ask an instructor.
The Old 36-350
If you came to this page by a search engine, you may be looking for
the data-mining class which used to be numbered 36-350.
It is now 36-462, and is taught in the spring semester.