Lecture 1 — Introduction to the Course
36-465/665, Conceptual Foundations of Statistical Learning
2 February 2021
Conceptual Foundations of Statistical Learning
Statistical learning:
how to fit predictive models
to training data
usually by solving an optimization problem
so the model will probably predict well
on average
on new data
Conceptual foundations:
What are the essential concepts we need to make those words precise?
What are the basic tools that will let us reason about these concepts?
What are the key results we can achieve with those tools?
What This Class Is Not
Greatest hits collection of different models, fitting techniques and applications
Take 36-462/662 or 10-601 or 10-701 instead
Latest and hottest models, fitting techniques, and applications
How to get the top score in a prediction contest
Ordinary statistical theory with the serial numbers filed off
Mathematically rigorous
Understanding
\(>\)
Rigor
Most classes on this are very mathematically advanced and rigorous
Typically presume: measure-theoretic probability, stochastic processes, functional analysis, combinatorics
This class aims to be easy to understand
Starting point: the math and stats you need to have taken 36-401, modern linear regression
Undergrad probability, mathematical statistics, calculus in multiple variables, linear algebra
Also: linear regression, so you have some feel of predictive modeling
We’ll build some additional math from there
Pro: it should be a lot easier to grasp the big ideas
Cons:
Loss of detail / precision
Many claims will be only
roughly
right (many conditions / qualifications needed to be exactly right)
You’ll need the rigor if you want to advance the theory
So what are we actually going to cover?
Prediction as a decision problem
\(\Rightarrow\)
basic decision theory
Fitting models to training data
Why the fit to training data is over-optimistic about future performance
Controlling that optimism using:
Probability theory
Measures of model complexity
Generalization-error bounds
The actual fitting/optimization process
Three case studies:
“Kernel machines” (looking for similar cases)
“Random features” (just about any function of the data will work)
Mixture models for densities (complicated curves are collections of bumps)
Course Mechanics
Syllabus: [
http://www.stat.cmu.edu/~cshalizi/sml/21
]
All assignments will be posted there
Information about readings and other course resources ditto
Canvas: gradebook, some readings that can’t go on the public web
Gradescope: turn in assignments
Piazza for question-answering
Assignments
Homework: Every week, Thursdays at 6 pm (Pittsburgh)
After-class exercises and questions: The day after most classes,
short
answers
Office hours
Me: Zoom Tuesdays 3:40–4:30 (= right after class)
Piazza Wednesdays 2:00–3:00
TA: On Piazza, day/time TBD (hopefully Thursdays)
Lectures
Clarification, amplification, examples, alternatives
Please ask questions
I will
often
ask you to solve some problem during lecture
not graded,
but
students find the discussion of the solutions very helpful
and more helpful if you’ve tried it yourself
Not
recorded
Assignments/Grading
After-class review questions and exercises (10%)
Weekly homework (90%)
NO exams
After-class review questions and exercises
10% of your total grade
Specific questions about stuff we went over in class, and/or in the readings
Turn in the day after each class
Due at 6 pm (Pittsburgh time) the next day so
\(> 24\)
hr
Should take
\(\approx 10\)
minutes
But it’s untimed
Usually short-answer questions, may sometimes be multiple choice
Hand-written math OK as needed
Scan or just take a picture with your phone and upload
Dark ink on white paper works best
All equally weighted
Lowest 4 dropped, no questions asked
Homework
90% of your total grade
Mostly theory / math, occasional computing
You can use any programming language, but don’t expect help for anything other than R
Always turned in electronically, as PDF, via Gradescope
Strongly advise using LaTeX or at least R Markdown
Due at 6 pm on Thursday (Pittsburgh time) every week
Except this first week (no homework)
and 13 April (to accommodate Carnival)
Lowest 3 dropped, no questions asked
If you turn in all 13, and get at least 60% on each, lowest
4
dropped
NO LATE HOMEWORK FOR ANY REASON
Turn in as many incomplete or rough versions as you like well ahead of the deadline
Collaboration, Cheating & Plagiarism
Everything you turn in for a grade needs to be your own work,
or
an acknowledged borrowing from an approved source
Full policy for this class is part of the syllabus
Read that policy and the
CMU policy on academic integrity
Turn in Homework 0 on these policies by Thursday next week
Not part of your final grade, but you won’t get credit for any work until you complete it successfully
Any questions?
What are the big issues?
We want to make
predictions
We don’t care about parameters
We want a
model
to do the prediction
We don’t care if the model is “true”
We want the model to predict
well
What does that mean, exactly?
We want the model to predict well
on average
What does
that
mean, exactly?
We want the model to predict well, on average, on
new data
How on Earth could we guarantee
that
?
Statistical Learning
Start with a class of models or “machines”
Functions that map input values (“features”) into predictions
And a data set
And a way of measuring how good or bad a prediction is
Pick the model in the class that bits fits the training data
Perhaps subject to some constraints
Perhaps requiring this to run in polynomial time
These are optimization problems
We care about expected fit to
new
data (from the same distribution)
Theory: prove future performance isn’t much worse than training
Or even: isn’t much worse than the best possible
Practice: cross-validation
But why do we think CV works?
Linear regression, for example
Models: All linear functions of the input variables
\(X\)
, trying to guess
\(Y\)
Data set:
\((x_1, y_1), \ldots (x_n, y_n)\)
pairs
Badness of predictions: mean squared error
Best fit: ordinary least squares
Constrained/penalized fits: ridge regression, lasso
Linear regression, for example
You know a lot about estimating and interpreting the coefficients
In this class, we don’t
care
about the coefficients
In this class,
all
we care about are the predictions
What’s the expected fit on new data?
Can we
guarantee
that model will fit new data well?
Can we
estimate
how well it will fit new data?
Next time
Making ideas like “predictive model” and “fit to the data” precise, using decision theory
One framework will handle regression, classification, ranking, density estimation, …
Backup: Where did statistical learning come from?
Statistics + theoretical computer science + optimization
Plus some theoretical biology, cognitive science, physics…
Made practical by desktop computers
Didn’t exist in 1980, definitely there by 1995
Someone should write a history of this!
Statistics: cross-validation, convergence of random functions, decision theory
Computer science: models, emphasis on classifiers, concern with
efficient
algorithms
Mathematical programming: posing and solving optimization problems