36-350, Statistical Computing, Fall 2014
|
Instructors | Prof. Cosma Shalizi |
| Prof. Andrew Thomas |
TAs | Mr. Bryan Hooi |
| Mr. Samuel Ventura |
Lecture | Section 1, Mondays and Wednesdays 10:30--11:20, Gates 4102 |
| Section 2, Mondays and Wednesdays 11:30--12:20, Gates 4102 |
Labs | Sections A and B, Fridays, 10:30--11:20, Hunt Library computer labs |
| Sections C and D, Fridays, 11:30--12:20, Baker Hall 332P |
Office hours | Monday 9:20--10:20 Wean Hall 8110 (Mr. Hooi) |
| Monday 4:30--5:30 Baker Hall 132H (Prof. Thomas) |
| Thursday 1:00--2:30 Baker Hall 229A (Prof. Shalizi) |
| Friday 3:00--4:00 Wean Hall 8110 (Mr. Ventura) |
|
|
Description
Computational data analysis is an essential part of modern statistics.
Competent statisticians must not just be able to run existing programs, but to
understand the principles on which they work. They must also be able to read,
modify and write code, so that they can assemble the computational tools needed
to solve their data-analysis problems, rather than distorting problems to fit
tools provided by others. This class is an introduction to programming,
targeted at statistics majors with minimal programming knowledge, which will
give them the skills to grasp how statistical software works, tweak it to suit
their needs, recombine existing pieces of code, and when needed create their
own programs.
Students will learn the core of ideas of programming — functions,
objects, data structures, flow control, input and output, debugging, logical
design and abstraction — through writing code to assist in numerical and
graphical statistical analyses. Students will in particular learn how to write
maintainable code, and to test code for correctness. They will then learn how
to set up stochastic simulations, how to parallelize data analyses, how to
employ numerical optimization algorithms and diagnose their limitations, and
how to work with and filter large data sets. Since code is also an important
form of communication among scientists, students will learn how to comment and
organize code.
The class will be taught in the R
language.
Pre-requisites
This is an introduction to programming for statistics students. Prior exposure
to statistical thinking, to data analysis, and to basic probability concepts is
essential. Previous programming experience is not assumed, but
familiarity with the computing system is. Formally, the pre-requisites are
"Computing at Carnegie Mellon" (or consent of instructor), plus one of either
36-202 or 36-208, with 36-225 as either a pre-requisite (preferable) or
co-requisite (if need be).
The class may be unbearably redundant for those who already know a
lot about programming. The class will be utterly incomprehensible for
those who do not know statistics.
Course Mechanics and Grading
There will be two lectures every week (with exceptions only for holidays), and
a weekly in-class lab. There will also be homework nearly every week, a
mid-term programming project, and a final group project.
Grades will be calculated as follows:
- Labs: 10%
- Homework: 30%
- Midterm project: 20%
- Final project: 40%
Final grades are based on demonstrated mastery of the material, not
relative standing in the class.
R and RStudio
R is a free, open-source programming
language for statistical computing. Almost all of our work in this class will
be done using R. You will need regular, reliable access to a computer running
an up-to-date version of R. If this is a problem, let the professors know
right away.
RStudio is a free, open-source R
programming environment. It contains a built-in code editor, many features to
make working with R easier, and works the same way across different operating
systems. Use of RStudio is
required for the labs, and strongly recommended in general.
Assignment Formatting
All assignments must be turned in electronically, through Blackboard.
All assignments will involve writing a combination of code and actual prose.
You must submit your assignment in a format which allows for the combination of
the two, and the automatic execution of all your code. The easiest
way to do this is to use R Markdown.
Exceptions may be made, with prior permission, for those who want to
use Sweave or
(better) knitr. (If you don't know what
those are, plan to use R Markdown.)
Work submitted as Word files, PDFs, unformatted plain text, etc., will
receive an automatic grade of 0, without exceptions.
Every file you submit should have a name which includes your Andrew ID, and
clearly indicates the type of assignment (homework, lab, etc.) and its number.
Homework
There will be a homework assignment nearly every week. Each homework will
be graded out of three points: one point for making a good-faith effort at
every part of the assignment; one point for technically-correct, working
solutions to each part; and one point for clean,
well-formatted, easily readable code.
Due dates: unless otherwise noted in the calendar, all homework is
due at 11:59 pm on Thursday Monday, the week after it is assigned.
Revision: You are free to revise your homework assignments, after
they have been graded, and re-submit them to be re-graded.
Labs
There will be a 50 minute lab period every week on Friday morning. The labs
will be short exercises, generally related to that week's homework. Attendance
is mandatory.
Pair programming: An important part of programming is
collaboration. To help you practice this, the labs will be done through
"pair programming". You will be randomly paired with a different partner for
each lab, and during the first half of the lab, one of you will do all
the actual typing, while the other monitors and comments; during the second
half, you will switch roles with your partner.
Midterm Project
In place of an in-class midterm exam, there will be a solo programming
project. You will have two weeks to do this project, and will have to submit a
write-up containing both your executable code and its results, and an
explanation of how you approached the problem and why you chose that approach.
As with the homework, grading will give equal weight to completeness,
correctness, and comprehensibility.
Final Project
You will be assigned to small groups to work on a final project. You will
select project topics from a list provided by the professors. (Multiple groups
can take on the same project.) Each group will cooperate on writing code,
documenting it, writing a report, and making a presentation on the project in
during the final exam period.
Peer assessment: One component of your final project grade will be
based on your team-mates' assessment of your contribution to the project.
Textbooks
There are three required books:
- Norman Matloff,
The Art of R Programming: A Tour
of Statistical Software Design
- Phil Spector, Data Manipulation with R
- Paul Teetor, The R
Cookbook
The first two will serve as our textbooks; the third is an
extremely valuable reference work. You will need all three.
Four other books are optional but recommended:
All the books should be available at the university book store, and of
course from online stores.
Some R Resources
There are many online resources for learning about it and working with it,
in addition to the textbooks:
- The official intro, "An Introduction to R", available online in
HTML
and PDF
- John Verzani, "simpleR",
in PDF
- Google R Style Guide offers some rules for naming, spacing, etc., which are generally good ideas
- Quick-R. This is
primarily aimed at those who already know a commercial statistics package like
SAS, SPSS or Stata, but it's very clear and well-organized, and others may find
it useful as well.
- Patrick
Burns, The R
Inferno. "If you are using R and you think you're in hell, this is a map
for you."
- Thomas Lumley, "R Fundamentals and Programming Techniques"
(large
PDF)
The website Software Carpentry is
not specifically R related, but contains a lot of valuable advice and
information on scientific programming.
Physically Disabled and Learning Disabled Students
The Office of Equal Opportunity Services provides support services for both
physically disabled and learning disabled students. For individualized
academic adjustment based on a documented disability, contact Equal Opportunity
Services at eos [at] andrew.cmu.edu or (412) 268-2012.
Collaboration, Copying and Plagiarism
You are encouraged to discuss course material, including assignments, with
your classmates. All work you turn in, however, must be your own. This
includes both writing and code. Copying from other students, from
books, from websites, or from solutions for previous versions of the class, (1)
does nothing to help you learn how to program, (2) is easy for us to detect,
and (3) has serious negative consequences for you, as outlined in the
university's policy
on cheating and plagiarism. If, after reading the policy, you are unclear
on what is acceptable, please ask an instructor.
The Old 36-350
If you came to this page by a search engine, you may be looking for
the data-mining class which used to be numbered 36-350.
It is now 36-462, and is taught in the spring semester.
You might also be looking for another year's iteration
of this class.
Note to Instructors
If you would like to use these materials for your own class, you are welcome to
do so, with attributions to the authors (see below) and links to this page. Asking for
permission isn't necessary, though letting us know about it is appreciated.
Preferred citation: Shalizi, C. R. and Thomas, A. C. (2014), "Statistical Computing
36-350: Beginning to Advanced Techniques in R", http://www.stat.cmu.edu/~cshalizi/statcomp/14
Calendar and topics
Subject to revision.
- Data types and data structures
- Lecture 1 (25 August): Simple data types and structures
- Course mechanics; the R console; basic data types; vectors, our first data structures
- Rpres file for this lecture (the R Markdown file used to build the presentation)
- Printable PDF
- Lecture 2 (27 August): Bigger data structures
- Arrays; matrices and matrix operations; lists; data frames; structures of structures
- Rpres
- Printable PDF
- Lab 1 (29 August)
- R Markdown file for the lab
- Homework: HW 1 assigned; nothing due
- R Markdown file for the assignment
- Reading for the week: chapters 1 and 2 of Matloff
- Flow control and looping
- Lecture 3 (Sept. 3): Data Frames and Control
- Data frames for tabular data; conditioning the calculation on the
data; iteration to repeat similar calculations; avoiding iteration with
"vectorized" operations and functions.
- Rpres file for this lecture
- Printable PDF
- Lab 2 (Sept. 5)
- R Markdown file for the lab
- Homework: HW 1 due; HW 2 assigned
- R Markdown file for the assignment
- Reading for the week: Chapters 3--5 of Matloff (sections marked "Extended Examples" optional); section 7.1 of Matloff
- Text
- Lecture 4 (Sept. 8): Text basics
- Characters, strings, text data. Extracting and replacing
substrings; splitting strings; building strings; counting strings.
- Printable PDF
- Rpres file for the lecture
- Lecture 5 (Sept. 10): Regular expressions
- "Regular expressions" are patterns of strings. Rules for building
regular expressions. R functions for finding matches, splitting strings, and
substituting according to patterns.
- Handout for the
lecture, with additional examples
- Rpres file for
the lecture
- Printable PDF
- Lab 3 (Sept. 12)
- rich.html file; R Markdown file for the lab. (Do not include the text of the questions in your write-up.)
- Homework: HW 2 due; HW 3 assigned
- NHLHockeySchedule2.html file for the homework
- Reading for the week: Matloff, chapter 11; R Cookbook, chapter 7; Spector, chapter 7; handout for lecture 5
- Optional readings: Bradnam and Korf, sections 4.26--4.28, 5.3, 6.1
- Writing and calling functions
- Lecture 6 (Sept. 15): Writing functions
- Functions tie together related commands. Arguments (inputs) and
return values (outputs). Named arguments and defaults. Interfaces.
- gmp.dat file for the example; Rpres file for presentation
- Printable PDF
- Lecture 7 (Sept. 17): Multiple functions
- Using multiple functions for related tasks; to re-use work; to
break big problems down into smaller ones.
- Printable PDF
- R Markdown source for the slides
- Lab 4 (Sept. 19)
- R Markdown source for the lab
- Homework: HW 3 due; HW 4 assigned
- gmp-2013.dat data file for the last problem; R Markdown file for the assignment
- Reading for the week: sections 1.3, 7.3--7.5, 7.11, 7.13 of Matloff
- Data from elsewhere
- Lecture 8 (Sept. 22): Getting data
- Reading and writing non-R formats. Importing data from the Web.
Scraping Web pages.
- Printable PDF version of slides; R Markdown source file for the slides
- Lecture 9 (Sept. 24): Dataframes with Regression Models
- Making dataframes readable. Plotting with dataframes. Basic statistics on dataframes. Fitting linear models with lm; formulas.
Fitting generalized linear models with glm.
- R Markdown source file for the slides
- Printable PDF
- Lab 5 (Sept. 26)
- wtid-report.csv
- Homework: HW 4 due; HW 5 assigned
- R Markdown file for the assignment
- Reading for the week: Matloff, chapter 10; Spector, chapter 2 (skipping sections 2.8--2.10)
- Fitting and using statistical models
- Lecture 10 (Sept. 29): Random number generation
- Sources of actually (?) random numbers. Pseudo-random number
generators; setting the seed. Basic R functions for parametric distributions.
- Printable PDF
- Lecture 11 (Oct. 1): Distributions as models
- Empirical-distribution-related R commands. R functions for
parametric distributions: d*, p*, q*, r*.
Fitting distributions: method of moments; generalized moments; maximum
likelihood. Diagnostics for distributions.
- Printable PDF
- Lab 6 (Oct. 3)
- R Markdown source file
- Homework: HW 5 due; no new homework
- Mid-semester project assigned, due at 11:59 pm on Thursday, 16 October
- midterm.zip (61 MB), archive of IMDB pages for the midterm
- Reading for the week: R Cookbook, chapter 11
- Changing My Shape, I Feel Like an Accident
- Lecture 12 (Oct. 6): Transformations
- Selective access to data. Applying the same function to all parts
of a data object. Transforming the data to suit the problem. Common numerical
transformations. Summarizing subsets of the data. Sorting, and ordering
dataframes. Transposition. Merging dataframes. Reshaping
dataframes from wide to long or long to wide.
- Printable PDF of lecture;
Rpres source file for slides
- Data sets used in examples: fha.csv,
ua.txt, snoqualmie.csv
- Lecture 13 (Oct. 8): Debugging
- Debugging as differential diagnosis; characterizing and localizing
bugs; common errors; programming now for debugging later. Tests and bugs.
- Printable PDF; Rpres source file for HTML slides
- Lab 7 (Oct. 10)
- Data files for lab: ckm_nodes.csv and ckm_network.dat
- Homework: no new homework due
- Reading for the week: chapters 8 and 9 in Spector (sections 9.3 and 9.7 optional); chapter 13 in Matloff; chapters 5 and 6 in The R Cookbook
- Optional reading: Hadley Wickham, "Reshaping Data with the reshape Package", Journal of Statistical Software 21 (2007): 12
- Leet Programming Skillz
- Lecture 14 (Oct. 13): Testing
- Why we test our code; tests of particular cases vs. cross-checking
tests; cycling between testing and programming.
- Printable PDF; Rpres source file
- Lecture 15 (Oct. 15): Top-down design
- Recursively solving problems by writing functions to integrate the
work of sub-functions that solve sub-problems. Advantages: demands less
thought to write or to read; simpler to debug or extend. Re-factoring to make
code which wasn't designed this way look like it was. Extended example with
the jack-knife.
- Printable PDF;
Rpres source file
- No lab (mid-semester break)
- Homework: HW 6 assigned
- hw-06-supplement.R (containing deliberately buggy code); R Markdown file for the assignment
- Mid-semester project due at 11:59 pm on Thursday, 16 October
- Reading: Sections 7.6, 7.9, and 14.1--14.3 in Matloff
- Optional reading: Chambers, TBD
- Functions of functions, and optimization
- Lecture 16 (Oct. 20): Functions as objects
- In R, functions are objects like everything else, so they can be
arguments to other functions, and they can be returned by other functions.
Examples with curve, grad, gradient descent, and writing surface, a 2D counterpart to curve
- Printable PDF, Rpres source
- Lecture 17 (Oct. 22): Simple
optimization
- Basics from calculus about minima. Taylor series. Gradient
descent and Newton's method. Scaling and big-O notation. Curve-fitting by
optimization. Illustrations with optim and nls. Bonus:
Nedler-Mead, a.k.a. the simplex method; coordinate descent.
- Printable PDF;
Rpres source
- Lab 8 (Oct. 24)
- Homework: HW 6 due, HW 7 assigned
- R Markdown file for the homework
- Reading: Recipes 13.1--13.2 in The R Cookbook
- Optional reading: I.1, II.1 and II.2 in Red Plenty
- Optimization will continue while morale improves
- Lecture 18 (Oct. 27): Constrained optimization
- Optimization under constraints; using Lagrange multipliers to turn
constrained problems into unconstrained ones. Lagrange multipliers as "shadow
prices". Barrier methods for inequality constraints. The correspondence
between constrained and penalized optimization ("a fine is a price").
Statistical uses of penalized optimization: ridge, lasso and spline regression
as examples.
- Printable PDF,
Rpres source
- Lecture 19 (Oct. 29): Stochastic optimization
- Optimization vs. "big data". Sampling as an alternative to using all
the data at once: stochastic gradient descent et al. Peculiarities of
optimizing statistical functionals: don't bother optimizing much within the
margin of error; finding that margin.
- Lab 9 (Oct. 31)
- lab-09.RData
- Homework: HW 7 due
- Reading
- Optional reading: Red
Plenty
(cf.);
Léon Bottou and Olivier
Bosquet, "The
Tradeoffs of Large Scale Learning"
- Split/apply/combine
- Lecture 20 (Nov. 3): The split, apply, combine pattern, using base R
- Design patterns in general. The split/apply/combine pattern: break up
a large data set into smaller meaningful pieces; apply the same analysis to
each piece; combine the answers. Iteration as painful, clumsy
split/apply/combine. Tools for split/apply/combine in basic R:
the apply function for arrays, lapply for
lists, mapply, etc.; split; aggregate; subset.
- Lecture 21 (Nov. 5): Split/apply/combine, using plyr
- Abstracting the split/apply/combine pattern: using a single command
to appropriately split up the input, apply the function, and combine the
results, depending on the type of input and output data. Syntax details.
- Lab 10 (Nov. 7)
- debt.csv; R Markdown source file
- Homework: HW 8 assigned, none due
- hw-08.RData, R Markdown file
- Options for final project announced
- Preferences due by 12 November
- Reading for the week: Cookbook, chapter 6; Spector, chapter 8
- Optional: Hadley Wickham, "The Split-Apply-Combine Strategy for Data Analysis", Journal of Statistical Software 40 (2011): 1
- Databases
- Lecture 22 (Nov. 10): Split/Apply/Combine 3
- The high-level view of what split/apply/combine does. Thinking about how to split the data into pieces: concepts and R syntax. Thinking about the function to apply to each piece: concepts and R syntax. Illustrations.
- (No R Markdown source file for this lecture, because the lecturer couldn't figure out how to do the image manipulation in R Markdown; LaTeX source files available on request.)
- Homework: HW 8 due, HW 9 assigned
- Lecture 23 (Nov. 12): Databases
- Basic concepts of relational databases; how a database is like an R
dataframe. The client/server model. The structured query language (SQL) and
queries; SELECT and JOIN. R/SQL translations. Accessing databases through R.
- Handout: Databases, and Databases in R
- baseball.db database for examples from lecture and handout (30 Mb)
- Printable PDF;
RPres source
- Final project preferences due
- Lab 11 (Nov. 14)
- Final project teams announced
- Reading for the week: Spector, chapter 3 (for databases)
- Simulation
- Lecture 24 (Nov. 17): Simulation I: Random variable generation and Markov chains
- R Markdown source file
- Homework: HW 9 due, HW 10 assigned
- R Markdown source file
- Lecture 25 (Nov. 19): Simulation II: Monte Carlo and Markov Chain Monte Carlo
- R Markdown source file
- Lab 12 (Nov. 21)
- Reading: Matloff, chapter 8; R Cookbook, chapter 8;
handouts
- Markov Chain Monte Carlo
- Lecture 26 (Nov. 24): Simulation III: Simulations as Models
- Using simulations to replace probability calculations. Using simulations as statistical models. Live coding demo of a simulation model.
- Simulation code written by section 1 and by section 2
- Printable PDF;
Rpres source file
- Homework: HW 10 due; no new homework
- No lab (Thanksgiving break)
- Reading for the week: Handouts
- Optional reading: Charles Geyer, "Practical Markov Chain Monte Carlo",
Statistical
Science 7 (1992):
473--483; "One
Long
Run"; Burn-In is
Unnecessary; On
the Bogosity of MCMC Diagnostics; Andrew Gelman and Donald Rubin, "Inference from Iterative Simulation Using Multiple Sequences", Statistical Science 7 (1992): 457--472
- Conclusion of the class
- Lecture 27 (Dec. 1): Beyond R
- Limitations of R. Connecting to other languages and specialized tools.
- Lecture 28 (Dec. 3): Computing for statistics
- No lab: work on your projects
- No homework: work on your projects
- Reading for the week: Matloff, sections 14.3--14.6, 15.1
- Optional readings: Spufford, Red Plenty, TBD; Bradnam and Karf, Unix and Perl to the Rescue, TBD; Chambers, TBD
- Final projects
Final presentations will be held during our final exam period. Attendance for the whole period is mandatory. Submission instructions will be provided closer to the deadline.