Cosma Shalizi

36-402, Undergraduate Advanced Data Analysis, Section A

Spring 2019

Section A
Tuesdays and Thursdays, 10:30--11:50, Porter Hall 100
Keen-eyed fellow investigators

The goal of this class is to train you in using statistical models to analyze data — as data summaries, as predictive instruments, and as tools for scientific inference. We will build on the theory and applications of the linear model, introduced in 36-401, extending it to more general functional forms, and more general kinds of data, emphasizing the computation-intensive methods introduced since the 1980s. After taking the class, when you're faced with a new data-analysis problem, you should be able to (1) select appropriate methods, (2) use statistical software to implement them, (3) critically evaluate the resulting statistical models, and (4) communicate the results of your analyses to collaborators and to non-statisticians.

During the class, you will do data analyses with existing software, and write your own simple programs to implement and extend key techniques. You will also have to write reports about your analyses.

36-608 A small number of well-prepared graduate students from other departments can take this course by registering for it as 36-608. (Graduate students enrolling in 36-402, or undergraduates enrolling in 36-608, will be automatically dropped from the roster.) If you want to do so, please contact me, to discuss whether you have the necessary preparation.

Section B

This year, there are two sections of 36-402. This syllabus is for Section A, taught by Prof. Shalizi; section B is taught by Prof. Lee. The two sections are completely independent.

Prerequisites

36-401, with a grade of C or better. Exceptions are only granted for graduate students in other departments taking 36-608.

Instructors

Professors Cosma Shalizi cshalizi [at] cmu.edu
Baker Hall 229C
Teaching assistants TBD

Topics

Model evaluation: statistical inference, prediction, and scientific inference; in-sample and out-of-sample errors, generalization and over-fitting, cross-validation; evaluating by simulating; the bootstrap; penalized fitting; mis-specification checks
Yet More Linear Regression: what is regression, really?; what ordinary linear regression actually does; what it cannot do; extensions
Smoothing: kernel smoothing, including local polynomial regression; splines; additive models; kernel density estimation
Generalized linear and additive models: logistic regression; generalized linear models; generalized additive models.
Latent variables and structured data: principal components; factor analysis and latent variables; latent cluster/mixture models; graphical models in general
Causality: graphical causal models; causal inference from randomized experiments; identification of causal effects from observations; estimation of causal effects; discovering causal structure
See the end of this syllabus for the current lecture schedule, subject to revision. Lecture handouts, slides, etc., will be linked there, as available.

Course Mechanics and Grading

Grades will not go away if you avert your eyes (photo by laurent KB on Flickr) There are three reasons you will get assignments in this course. In order of decreasing importance:
  1. Practice. Practice is essential to developing the skills you are learning in this class. It also actually helps you learn, because some things which seem murky clarify when you actually do them, and sometimes trying to do something shows that you only thought you understood it.
  2. Feedback. By seeing what you can and cannot do, and what comes easily and what you struggle with, I can help you learn better, by giving advice and, if need be, adjusting the course.
  3. Evaluation. The university is, in the end, going to stake its reputation (and that of its faculty) on assuring the world that you have mastered the skills and learned the material that goes with your degree. Before doing that, it requires an assessment of how well you have, in fact, mastered the material and skills being taught in this course.

To serve these goals, there will be three kinds of assignment in this course.

In-class exercises
Most lectures will have in-class exercises. These will be short (10--20 minutes) assignments, emphasizing problem solving, done in class in small groups of two to five people. The assignments will be given out in class, and must be handed in on paper by the end of class. On most days, a randomly-selected group will be asked to present their solution to the class.
Homework
Most weeks will have a homework assignment, divided into a series of questions or problems. These will have a common theme, and will usually build on each other, but different problems may involve statistical theory, analyzing real data sets on the computer, and communicating the results. The in-class exercises will either be problems from that week's homework, or close enough that seeing how to do the exercise should tell you how to do some of the problems.
All homework will be submitted electronically through Canvas. Most weeks, homework will be due at 6:00 pm on Wednesday. There will be a few weeks, clearly noted on the syllabus and on the assignments, when Thursday lecture will be canceled and homework will be due at noon on Thursday, i.e., the end of the lecture period. (When this means that there are only six days for the next homework, it will be shortened accordingly.)
There are specific formatting requirements for homework --- see below.
Exams
There will be both a midterm and a final exam. Each of these will require you to analyze a real-world data set, answering questions posed about it in the exam, and to write up your analysis in the form of a scientific report. The exam assignments will provide the data set, the specific questions, and a rubric for your report.
Both exams will be take-home, and you will have at least one week to work on each, without homework (from this class anyway). Both exams will be cumulative.
Exams are to be submitted through Canvas, and follow the same formatting requirements as the homework --- see below.
The mid-term and final exam due dates be set by the first day of classes, and will not change thereafter. If they present difficulties for you, please contact me as soon as possible.

Time expectations: You should expect to spend 5--7 hours on assignments every week, averaging over the semester. (This follows from the university's rules about how course credits translate into hours of student time.) If you find yourself spending significantly more time than that on the class, please come to talk to me.

Grading

Grades will be broken down as follows:

Grade boundaries will be as follows:
A [90, 100]
B [80, 90)
C [70, 80)
D [60, 70)
R < 60

To be fair to everyone, these boundaries will be held to strictly.

Grade changes and regrading: If you think that particular assignment was wrongly graded, tell me as soon as possible. Direct any questions or complaints about your grades to me; the teaching assistants have no authority to make changes. (This also goes for your final letter grade.) Complaints that the thresholds for letter grades are unfair, that you deserve a higher grade, etc., will accomplish much less than pointing to concrete problems in the grading of specific assignments.

As a final word of advice about grading, "what is the least amount of work I need to do in order to get the grade I want?" is a much worse way to approach higher education than "how can I learn the most from this class and from my teachers?".

Lectures

Lectures will be used to amplify the readings, provide examples and demos, and answer questions and generally discuss the material. They are also when you will do the graded in-class assignments which will help consolidate your understanding of the material, and help with your homework.

You will generally find it helpful to do the readings before coming to class.

Please don't use any electronic devices during lecture: no laptops, no tablets, no phones, no recording devices, no watches that do more than tell time. The only exception to this rule is for electronic assistive devices, properly documented with CMU's Office of Equal Opportunity Services.

(The no-electronics rule is not arbitrary meanness on my part. Experiments show, pretty clearly, that students learn more in electronics-free classrooms, not least because your device isn't distracting your neighbors, who aren't as good at multitasking as you are.)

Exams

There will be a take-home mid-term exam (15% of your final grade), which will be due before spring break. You will have one week to work on the midterm, and there will be no homework that week. There will also be a take-home final exam (25%), due during exam week. The exams may require you to use any material already covered in the readings, lectures or assignments; the final will be cumulative.

Exams must also be submitted through Canvas, under the same rules about file formats as homework.

The due dates for exams will be fixed by the first day of classes, and will not change thereafter. Please try to schedule your obligations around them. If you know that the dates will be a problem for you, please contact me as soon as possible.

R, R Markdown, and Reproducibility

Caught in a thicket of syntax (photo by missysnowkitten on Flickr) R is a free, open-source software package/programming language for statistical computing. You should have begun to learn it in 36-401 (if not before). In this class, you'll be using it for every homework and exam assignment. If you are not able to use R, or do not have ready, reliable access to a computer on which you can do so, let me know at once.

Communicating your results to others is as important as getting good results in the first place. Every homework assignment will require you to write about that week's data analysis and what you learned from it; this writing is part of the assignment and will be graded. Raw computer output and R code is not acceptable; your document must be humanly readable.

All homework and exam assignments are to be written up in R Markdown. (If you know what knitr is and would rather use it, ask first.) R Markdown is a system that lets you embed R code, and its output, into a single document. This helps ensure that your work is reproducible, meaning that other people can re-do your analysis and get the same results. It also helps ensure that what you report in your text and figures really is the result of your code (and not some brilliant trick that you then forget, or obscure bug that you didn't notice). For help on using R Markdown, see "Using R Markdown for Class Reports".

Format Requirements for Homework and Exams

For each assignment, you should generally submit two, and only two, files: an R Markdown source file, integrating text, generated figures and R code, and the "knitted", humanly-readable document, in either PDF (preferred) or HTML format. (I cannot read Word files, and you will lose points if you submit them.) I will be re-running the R Markdown file of randomly selected students; you should expect to be picked for this about once in the semester. You will lose points if your R Markdown file does not, in fact, generate your knitted file (making obvious allowances for random numbers, etc.).

Some problems in the homework will require you to do math. R Markdown provides a simple but powerful system for type-setting math. (It's based on the LaTeX document-preparation system widely used in the sciences.) If you can't get it to work, you can hand-write the math and include scans or photos of your writing in the appropriate places in your R Markdown document. (You should upload these additional files to Canvas.) You will, however, lose points for doing so, starting with no penalty for homework 1, and growing to a 90% penalty (for those problems) by homework 12.

Canvas and Piazza

Homework and exams will be submitted electronically through Canvas, which will also be used as the gradebook. Some readings and course materials will also be distributed through Canvas.

We will be using the Piazza website for question-answering. You will receive an invitation within the first week of class. Anonymous-to-other-students posting of questions and replies will be allowed, at least initially. (Postings will not be anonymous to instructors.) Anonymity will go away for everyone if it is abused.

Solutions

Solutions for all homework will be available, after their due date, through Canvas. Please don't share them with anyone, even after the course has ended.

Office Hours

If you want help with computing, please bring a laptop.

Mondays 1:00--2:00 Ms. Dunn Porter Hall 223B
Tuesdays 3:00--4:00 Ms. Dunn Porter Hall 223B
Wednesdays 2:00--3:30 Prof. Shalizi Baker Hall 229C
Fridays 3:00--4:00 Ms. Dunn Porter Hall 223B

If you cannot make the regular office hours, or have concerns you'd rather discuss privately, please e-mail me about making an appointment.

Textbook

The primary textbook for the course will be the draft Advanced Data Analysis from an Elementary Point of View. Chapters will be linked to here as they become needed. Reading these chapters will very greatly help your ability to do the assignments.

In addition, Paul Teetor, The R Cookbook (O'Reilly Media, 2011, ISBN 978-0-596-80915-7) is strongly suggested as a reference.

Cox and Donnelly, Principles of Applied Statistics (Cambridge University Press, 2011, ISBN 978-1-107-64445-8); Faraway, Extending the Linear Model with R (Chapman Hall/CRC Press, 2006, ISBN 978-1-58488-424-8; errata); and Venables and Ripley, Modern Applied Statistics with S (Springer, 2003; ISBN 9780387954578) will be optional. The campus bookstore should have copies of all of these.

Collaboration, Cheating and Plagiarism

Cheating leads to desolation and ruin (photo by paddyjoe on Flickr)

Except for explicit group exercises, everything you turn in for a grade must be your own work, or a clearly acknowledged borrowing from an approved source; this includes all mathematical derivations, computer code and output, figures, and text. Any use of permitted sources must be clearly acknowledged in your work, with citations letting the reader verify your source. You are free to consult the textbook and recommended class texts, lecture slides and demos, any resources provided through the class website, solutions provided to this semester's previous assignments in this course, books and papers in the library, or legitimate online resources, though again, all use of these sources must be acknowledged in your work. (Websites which compile course materials are not legitimate online resources.)

In general, you are free to discuss homework with other students in the class, though not to share work; such conversations must be acknowledged in your assignments. You may not discuss the content of assignments with anyone other than current students or the instructors until after the assignments are due. (Exceptions can be made, with prior permission, for approved tutors.) You are, naturally, free to complain, in general terms, about any aspect of the course, to whomever you like.

During the take-home exams, you are not allowed to discuss the content of the exams with anyone other than the instructors; in particular, you may not discuss the content of the exam with other students in the course.

Any use of solutions provided for any assignment in this course in previous years is strictly prohibited, both for homework and for exams. This prohibition applies even to students who are re-taking the course. Do not copy the old solutions (in whole or in part), do not "consult" them, do not read them, do not ask your friend who took the course last year if they "happen to remember" or "can give you a hint". Doing any of these things, or anything like these things, is cheating, it is easily detected cheating, and those who thought they could get away with it in the past have failed the course. Even more importantly: doing any of those things means that the assignment doesn't give you a chance to practice; it makes any feedback you get meaningless; and of course it makes any evaluation based on that assignment unfair.

If you are unsure about what is or is not appropriate, please ask me before submitting anything; there will be no penalty for asking. If you do violate these policies but then think better of it, it is your responsibility to tell me as soon as possible to discuss how your mis-deeds might be rectified. Otherwise, violations of any sort will lead to severe, formal disciplinary action, under the terms of the university's policy on academic integrity.

On the first day of class, every student will receive a written copy of the university's policy on academic integrity, a written copy of these course policies, and a "homework 0" on the content of these policies. This assignment will not factor into your grade, but you must complete it before you can get any credit for any other assignment.

Accommodations for Students with Disabilities

The Office of Equal Opportunity Services provides support services for both physically disabled and learning disabled students. For individualized academic adjustment based on a documented disability, contact Equal Opportunity Services at eos [at] andrew.cmu.edu or (412) 268-2012; they will coordinate with me.

Other Iterations of the Class

Some material is available from versions of this class taught in other years. As stated above, any use of solutions provided in earlier years is not only cheating, it is very easily detected cheating.

Schedule

Subject to revision. Lecture notes, assignments and solutions will all be linked here, as they are available. Identifying significant features from background (photo by Gord McKenna on Flickr)

Current revision of the complete textbook

January 15 (Tuesday): Lecture 1, Introduction to the class; regression
Reading: Chapter 1, plus materials for homework 0
Optional reading: Cox and Donnelly, chapter 1; Faraway, chapter 1 (especially up to p. 17).
Homework 0 (on collaboration and plagiarism): assignment. Readings for homework 0:
January 17 (Thursday): Lecture 2, The truth about linear regression
Reading: Chapter 2
R for what was to have been the in-class examples, had the projector worked
Handout: "predict and Friends: Common Methods for Predictive Models in R" (PDF, R Markdown)
Optional reading: Faraway, rest of chapter 1
Homework 1: Assignment
January 22 (Tuesday): Lecture 3, Evaluation of Models: Error and inference
Reading: Chapter 3 (R code for that chapter)
R for in-class demo
Optional reading: Cox and Donnelly, ch. 6
January 24 (Thursday): Lecture 4, Smoothing methods in regression
Reading: Chapter 4 (R code for that chapter)
R for in-class demos
Optional readings: Faraway, section 11.1; Hayfield and Racine, "Nonparametric Econometrics: The np Package"; Gelman and Pardoe, "Average Predictive Comparisons for Models with Nonlinearity, Interactions, and Variance Components" [PDF]
Homework 0 due (at 6 pm the night before)
Homework 1 due (at 6 pm the night before)
Homework 2: assignment, uval.csv data set
January 29 (Tuesday): Lecture 5, Writing R Code
Slides from class, including solution to the in-class exercise (.Rmd)
Reading: Appendix on writing R code
January 31 (Thursday): Lecture 6, Simulation
Reading: Chapter 5 (R code for examples)
In-class examples (if class hadn't been canceled by the weather)
Homework 2 due (at 6:00 pm the night before)
Homework 3: assignment, stock_history.csv data set
February 5 (Tuesday): Lecture 7, The Bootstrap I
Slides (.Rmd)
Reading: Chapter 6 (R for selected examples in that chapter)
Optional reading: Cox and Donnelly, chapter 8
February 7 (Thursday): Lecture 8, The Bootstrap II Hypothesis Testing
Slides (.Rmd)
Homework 3 due (at 6:00 pm the night before)
Homework 4: assignment, gmp-2006.csv data set
February 12 (Tuesday): Lecture 9, More Bootstrap Examples
Slides (.Rmd
Reading: Chapter 6
February 14 (Thursday): Lecture 10, Multidimensional smoothing and the curse of dimensionality
Reading: Chapter 7 (R for selected examples in that chapter) and Chapter 8 (R for selected examples)
Optional reading: Faraway, section 11.2; Faraway, chapter 12
Homework 4 due (at 6:00 pm the night before)
Homework 5: assignment
February 19 (Tuesday): Lecture 11, Splines and Additive models
Slides (.Rmd for the slides)
Reading: As for Lecture 10
February 21 (Thursday): Lecture 12, Testing Regression Specifications
Slides (.Rmd)
Reading: Chapter 9
Optional reading: Cox and Donnelly, chapter 7
Homework 5 due (at 6:00 pm the night before)
Homework 6: Assignment
February 26 (Tuesday): Lecture 13, Heteroskedasticity, weighted least squares, and variance estimation
Reading: Chapter 10
Optional reading: Faraway, section 11.3
February 28 (Thursday): Lecture 14, Logistic Regression
Reading: Chapter 11
Optional reading: Faraway, chapter 2 (omitting sections 2.11 and 2.12)
Homework 6 due (at 6:00 pm the night before)
Midterm exam: assignment, ch.csv data set
March 5 (Tuesday): Lecture 15, Midterm review
March 7 (Thursday): Lecture 16, Generalized linear models and generalized additive models
Slides (.Rmd)
Reading: Chapter 12
Optional reading: Faraway, section 3.1 and chapter 6
Midterm exam due (at 6:00 pm the night before)
No homework this week --- enjoy spring break
March 12 and 14: Spring break, no class
March 19 (Tuesday): Lecture 17, Multivariate distributions
Reading: Appendix on multivariate distributions
March 21 (Thursday): Lecture 18, Density Estimation
Slides (.Rmd)
Reading: Chapter 14
Homework 7: assignment, n90_pol.csv data set
March 26 (Tuesday): Lecture 19, Factor Models
Reading: Chapter 17
March 28 (Thursday): Lecture 20, Factor Models II
Reading: Chapter 17
Homework 7 due (at 6:00 pm the night before)
Homework 8: Assignment; stockData.RData data file
April 2 (Tuesday): Lecture 21, Graphical Models
Reading: Chapter 20
April 4 (Thursday): Lecture 22, Graphical Causal Models I
Reading: Chapter 21
Optional reading: Cox and Donnelly, chapter 5
Homework 8 due (at 6:00 pm the night before)
Homework 9: Assignment; smoke.csv data file
April 9 (Tuesday): Lecture 23, Graphical Causal Models II
Reading: Chapter 21
Optional reading: Cox and Donnelly, chapters 6 and 9; Pearl, "Causal Inference in Statistics", section 1, 2, and 3 through 3.2
April 11 (Thursday): Carnival, no class
Homework 9 due (at 6:00 pm the night before)
Homework 10: assignment, sesame.csv data set
April 16 (Tuesday): Lecture 24, Identifying Causal Effects from Observations I
Reading: Chapter 22
Optional reading: Pearl, "Causal Inference in Statistics", sections 3.3--3.5, 4, and 5.1
April 18 (Thursday): Lecture 25, Estimating Causal Effects from Observations
Reading: Chapter 23
Homework 10 due (at 6 pm the night before)
Homework 11: Assignment
April 23 (Tuesday): Lecture 26, Discovering Causal Structure from Observations
Reading: Chapter 24
April 25 (Thursday): Lecture 27, Limitations of Causal Inference
Reading: Chapter 24
Homework 11 due (at 6:00 pm the night before)
Homework 12: assignment
April 30 (Tuesday): Lecture 28, Summary of the course
May 2 (Thursday): No lecture (extra time to work on final)
Homework 12 due (at 6:00 pm the night before)
Final exam assigned
May 9 (Thursday)
Final exam due at 6 pm
photo by barjack on Flickr