Lecture 1 — Introduction to the Course

36-465/665, Conceptual Foundations of Statistical Learning

2 February 2021

Conceptual Foundations of Statistical Learning

Statistical learning:
- how to fit predictive models
- to training data
- usually by solving an optimization problem
- so the model will probably predict well
- on average
- on new data
Conceptual foundations:
- What are the essential concepts we need to make those words precise?
- What are the basic tools that will let us reason about these concepts?
- What are the key results we can achieve with those tools?

What This Class Is Not

Greatest hits collection of different models, fitting techniques and applications
- Take 36-462/662 or 10-601 or 10-701 instead
Latest and hottest models, fitting techniques, and applications
How to get the top score in a prediction contest
Ordinary statistical theory with the serial numbers filed off
Mathematically rigorous

Understanding \(>\) Rigor

Most classes on this are very mathematically advanced and rigorous
- Typically presume: measure-theoretic probability, stochastic processes, functional analysis, combinatorics
This class aims to be easy to understand
Starting point: the math and stats you need to have taken 36-401, modern linear regression
- Undergrad probability, mathematical statistics, calculus in multiple variables, linear algebra
- Also: linear regression, so you have some feel of predictive modeling
- We’ll build some additional math from there
Pro: it should be a lot easier to grasp the big ideas
Cons:
- Loss of detail / precision
- Many claims will be only roughly right (many conditions / qualifications needed to be exactly right)
- You’ll need the rigor if you want to advance the theory

So what are we actually going to cover?

Prediction as a decision problem \(\Rightarrow\) basic decision theory
Fitting models to training data
Why the fit to training data is over-optimistic about future performance
Controlling that optimism using:
- Probability theory
- Measures of model complexity
- Generalization-error bounds
The actual fitting/optimization process
Three case studies:
1. “Kernel machines” (looking for similar cases)
2. “Random features” (just about any function of the data will work)
3. Mixture models for densities (complicated curves are collections of bumps)

Course Mechanics

Syllabus: [http://www.stat.cmu.edu/~cshalizi/sml/21]
- All assignments will be posted there
- Information about readings and other course resources ditto
Canvas: gradebook, some readings that can’t go on the public web
Gradescope: turn in assignments
Piazza for question-answering

Assignments

Homework: Every week, Thursdays at 6 pm (Pittsburgh)
After-class exercises and questions: The day after most classes, short answers

Office hours

Me: Zoom Tuesdays 3:40–4:30 (= right after class)
Piazza Wednesdays 2:00–3:00
TA: On Piazza, day/time TBD (hopefully Thursdays)

Lectures

Clarification, amplification, examples, alternatives
Please ask questions
I will often ask you to solve some problem during lecture
- not graded, but
- students find the discussion of the solutions very helpful
- and more helpful if you’ve tried it yourself
Not recorded

Assignments/Grading

After-class review questions and exercises (10%)
Weekly homework (90%)

NO exams

After-class review questions and exercises

10% of your total grade
Specific questions about stuff we went over in class, and/or in the readings
Turn in the day after each class
- Due at 6 pm (Pittsburgh time) the next day so \(> 24\) hr
Should take \(\approx 10\) minutes
- But it’s untimed
Usually short-answer questions, may sometimes be multiple choice
Hand-written math OK as needed
- Scan or just take a picture with your phone and upload
- Dark ink on white paper works best
All equally weighted
Lowest 4 dropped, no questions asked

Homework

90% of your total grade
Mostly theory / math, occasional computing
- You can use any programming language, but don’t expect help for anything other than R
Always turned in electronically, as PDF, via Gradescope
- Strongly advise using LaTeX or at least R Markdown
Due at 6 pm on Thursday (Pittsburgh time) every week
- Except this first week (no homework)
- and 13 April (to accommodate Carnival)
Lowest 3 dropped, no questions asked
- If you turn in all 13, and get at least 60% on each, lowest 4 dropped
NO LATE HOMEWORK FOR ANY REASON
- Turn in as many incomplete or rough versions as you like well ahead of the deadline

Collaboration, Cheating & Plagiarism

Everything you turn in for a grade needs to be your own work, or an acknowledged borrowing from an approved source
Full policy for this class is part of the syllabus
Read that policy and the CMU policy on academic integrity
Turn in Homework 0 on these policies by Thursday next week
- Not part of your final grade, but you won’t get credit for any work until you complete it successfully

What are the big issues?

We want to make predictions
- We don’t care about parameters
We want a model to do the prediction
- We don’t care if the model is “true”
We want the model to predict well
- What does that mean, exactly?
We want the model to predict well on average
- What does that mean, exactly?
We want the model to predict well, on average, on new data
- How on Earth could we guarantee that?

Statistical Learning

Start with a class of models or “machines”
- Functions that map input values (“features”) into predictions
And a data set
And a way of measuring how good or bad a prediction is
Pick the model in the class that bits fits the training data
- Perhaps subject to some constraints
- Perhaps requiring this to run in polynomial time
- These are optimization problems
We care about expected fit to new data (from the same distribution)
Theory: prove future performance isn’t much worse than training
- Or even: isn’t much worse than the best possible
Practice: cross-validation
- But why do we think CV works?

Linear regression, for example

Models: All linear functions of the input variables \(X\), trying to guess \(Y\)
Data set: \((x_1, y_1), \ldots (x_n, y_n)\) pairs
Badness of predictions: mean squared error
Best fit: ordinary least squares
- Constrained/penalized fits: ridge regression, lasso

Linear regression, for example

You know a lot about estimating and interpreting the coefficients
In this class, we don’t care about the coefficients
In this class, all we care about are the predictions
What’s the expected fit on new data?
Can we guarantee that model will fit new data well?
Can we estimate how well it will fit new data?

Next time

Making ideas like “predictive model” and “fit to the data” precise, using decision theory
One framework will handle regression, classification, ranking, density estimation, …

Backup: Where did statistical learning come from?

Statistics + theoretical computer science + optimization
- Plus some theoretical biology, cognitive science, physics…
- Made practical by desktop computers
Didn’t exist in 1980, definitely there by 1995
- Someone should write a history of this!
Statistics: cross-validation, convergence of random functions, decision theory
Computer science: models, emphasis on classifiers, concern with efficient algorithms
Mathematical programming: posing and solving optimization problems

Lecture 1 — Introduction to the Course

Conceptual Foundations of Statistical Learning

What This Class Is Not

Understanding \(>\) Rigor

So what are we actually going to cover?

Course Mechanics

Assignments

Office hours

Lectures

Assignments/Grading

After-class review questions and exercises

Homework

Collaboration, Cheating & Plagiarism

Any questions?

What are the big issues?

Statistical Learning

Linear regression, for example

Linear regression, for example

Next time

Backup: Where did statistical learning come from?