Computational data analysis is an essential part of modern statistics. Competent statisticians must not just be able to run existing programs, but to understand the principles on which they work. They must also be able to read, modify and write code, so that they can assemble the computational tools needed to solve their data analysis problems, rather than distorting problems to fit tools provided by others. This class is an introduction to programming, targeted at statistics majors with minimal programming knowledge, which will give them the skills to grasp how statistical software works, tweak it to suit their needs, recombine existing pieces of code, and when needed, create their own programs.
Students will learn the core of ideas of programming—functions, objects, data structures, input and output, debugging, logical design and abstraction—through writing code to assist in numerical and graphical statistical analyses. Students will in particular learn how to write maintainable code, and to test code for correctness. They will then learn how to set up stochastic simulations, how to parallelize data analyses, how to employ basic numerical optimization algorithms and diagnose their limitations, and how to work with and filter large data sets. Since code is also an important form of communication among scientists, students will learn how to comment and organize code.
The class will be taught in the R language.
This is an introduction to programming for statistics students. Prior exposure to statistical thinking, to data analysis, and to basic probability concepts is essential. Previous programming experience is not assumed, but familiarity with the computing system is. Formally, the pre-requisites are “Computing at Carnegie Mellon” (or consent of instructor), plus one of either 36-202 or 36-208, with 36-225 as either a pre-requisite (preferable) or co-requisite (if need be).
The class may be unbearably redundant for those who already know a lot about programming. The class will be utterly incomprehensible for those who do not know statistics.
Each week, there will belectures on Monday and Wednesday (with exceptions only for holidays), and an in-class lab on Friday. There will also be homework nearly every week, due Wednesday night at midnight.There will be midterm programming project, and a final group project.
Grades will be calculated as follows:
Final grades are based on demonstrated mastery of the material, not relative standing in the class.
R is a free, open-source programming language for statistical computing. Almost all of our work in this class will be done using R. You will need regular, reliable access to a computer running an up-to-date version of R. If this is a problem, let the professors know right away.
R Studio is a free, open-source R programming environment. It contains a built-in code editor, many features to make working with R easier, and works the same way across different operating systems. Most importantly it integrates R Markdown seamlessly. Use of R Studio is required for the labs, and strongly recommended in general.
The required textbook is: The R Cookbook, by Paul Teetor.
Other optional readings and supplementary materials can be found on the course website.
There will be 4 office hours held each week, spread out. The times and locations can be found on the course website.
Piazza will be used for class discussions. We highly encourage you to sign up; the signup link is: here.
Piazza can be a very successful medium for helpful, class-wide discussions, but without rules, discussions can also quickly get out of hand. Here are the rules for our Piazza group:
Private emails to the TAs and Professor about truly private matters (e.g., a request for an extension due to a family emergency) are of course OK. However, private emails that ask questions about course materials are discouraged. Such emails may be sent, but replies are not guaranteed. In seeking help, please use the Piazza discussion group and/or office hours.
All assignments (homeworks, labs, midterm and final project reports) must be turned in electronically, through Blackboard.
All assignments must be submitted in R Markdown format. Since assignments will involve writing a combination of code and written prose, the R Markdown format is crucial since it allows for a combination of the two.
Work submitted in R Markdown format that does not compile, i.e., fails “Knit HTML”, will receive an automatic grade of 0. Therefore, you must be absolutely certain that your submission passes “Knit HTML” before turning it in.
Work submitted as Word files, PDFs, unformatted plain text, etc., will receive an automatic grade of 0, without exceptions.
Every file you submit should have your name, your Andrew ID, and clearly indicates the type of assignment (homework, lab, etc.) and its number, if appropriate.
There will be a homework assignment nearly every week. Each homework will be graded out of 8 points: 1 point for making a good-faith effort at every part of the assignment; 2 points for clean, well-formatted, easily readable code; and 5 points for technically-correct, working solutions to each part.
Unless otherwise noted, all homework is due at 11:59pm on Wednesday nights (submitted on Blackboard) one week after it is assigned. No late homework will be accepted. The lowest homework score of the semester will be dropped.
There will be a 50 minute lab period every week on Friday morning. The labs will be short exercises, generally related to that week’s homework. Attendance is mandatory.
An important part of programming is collaboration. To help you practice this, the labs will be done through “pair programming”. You will be randomly paired with a different partner for each lab, and during the first half of the lab, one of you will do all the actual typing, while the other monitors and comments; during the second half, you will switch roles with your partner.
Unless otherwise noted, all labs are due at 11:59pm on Friday nights (submitted on Blackboard) the day they are released. Each student will submit their own write-up, with their partner’s name atop. No late work will be accepted. The lowest lab score of the semester will be dropped.
In place of an in-class midterm exam, there will be a programming project. You will have two weeks to do this project, with a randomly assigned partner, and will have to submit a write-up containing both your executable code and its results, and an explanation of how you approached the problem and why you chose that approach. The grading scheme will be similar that of the homework (some weight on completeness and comprehensibility, more weight on correctness).
In place of an in-class final exam, there will again be a programming project, done with a randomly assigned partner. This is basically a longer and more challenging version of the midterm project. A fun competition for bonus points is likely to come out of this as well. More details to come.
You are encouraged to discuss course material, including assignments, with your classmates. All work you turn in, however, must be your own. This includes both written explanations, and code. Copying from other students, books, websites, or solutions from previous versions of the class, (1) does nothing to help you learn how to program, (2) is easy for us to detect, and (3) has serious negative consequences for you, as outlined in the university’s policy on cheating and plagiarism. If, after reading the policy, you are unclear on what is acceptable, please ask the instructor.
(Note: the labs and final project operate a little differently, since they are explicitly done in groups, and each group will submit a single write-up. But for these works, the above still applies to copying material from books, websites, or solutions from previous versions of the class.)