\[ \newcommand{\Prob}[1]{\mathbb{P}\left( #1 \right)} \newcommand{\Expect}[1]{\mathbb{E}\left[ #1 \right]} \newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]} \newcommand{\Cov}[1]{\mathrm{Cov}\left[ #1 \right]} \newcommand{\Risk}{r} \newcommand{\Loss}{\ell} \newcommand{\OptimalStrategy}{\sigma} \DeclareMathOperator*{\argmin}{argmin} \]

Previously

We want to use data to learn rules which we can be confident will predict well on average on new cases
All the terms in that phrase have to be made precise
Today we’re going to focus on “rules” and “predict well on average”

Prediction

Prediction is a guess about some event we haven’t seen yet, but could see
- Inference, but to an observable, not a parameter of the distribution
- “The next roll of these 3 dice will be 18” vs. “The probability of getting 18 is 1/216”
We’re interested in predictions done according to rules
Rules are functions from inputs to outputs
- Doesn’t presume the actual target is a function of the inputs

Good and bad predictions

We need a way of saying whether a rule is working well or not
Predictions that come true are better than those that don’t
But are all mistakes equally bad?
- Predicting 6 inches of snow when the reality is 5 seems not as bad as predicting 10 inches, or 0 inches
- Predicting someone’s healthy when they’re sick seems worse than the other way around
This is where decision theory comes in

The elements of a decision problem

Possible actions \(A\)
Information \(X\), which we get to see before taking an action
States \(Y\) picked by Nature
A strategy \(s\) is a function from \(X\) (information) to \(A\) (action)
- There is usually some class of strategies \(S\) available
A loss function \(\Loss(y,a)\): how much it hurts to take action \(a\) when the state turns out to be \(y\)

The loss function is crucial but not enough on its own

The risk of a strategy

The risk of a strategy is its expected loss, averaging over \(X\) and \(Y\) \[ \Risk(s) = \Expect{\Loss(Y, s(X))} \]
This assumes that \(X\) and \(Y\) are both random variables with a joint distribution, say \(P(X,Y)\)
- For now, our actions and strategy don’t change \(P\)
- We’ll come back to decisions where our actions matter later in the course

Risk minimization

Loss is bad, risk is expected loss \(\Rightarrow\) try to minimize risk
Use the law of total expectations: \[ \Expect{\Loss(Y, s(X))} = \Expect{\Expect{\Loss(Y, s(X))|X}} \]
- Inner expectation is the conditional risk
Now define \[ \OptimalStrategy(x) \equiv \argmin_{a \in A}{\Expect{\Loss(Y, a)|X=x}} \]
- Take the action that minimizes the conditional expected loss
- “Do what’s best, given what you know”

Minimizing the conditional risk really is optimal

Minimizing the conditional risk everywhere minimizes the over-all risk: \[ \OptimalStrategy = \argmin_{s: X \mapsto A}{\Expect{\Loss(Y, s(X))}} \]
This is worth proving
It’s enough to show that for any other strategy \(s\), \(\Risk(s) - \Risk(\OptimalStrategy) \geq 0\) (why?) \[\begin{eqnarray} \Risk(s) - \Risk(\OptimalStrategy) & = & \Expect{\Loss(Y, s(X)) - \Loss(Y, \OptimalStrategy(X))}\\ & = & \Expect{\Expect{\Loss(Y, s(X)) - \Loss(Y, \OptimalStrategy(X))|X}} \end{eqnarray}\] but for each \(x\), \[ \Expect{\Loss(Y, s(x))|X=x} \geq \Expect{\Loss(Y, \OptimalStrategy(x))|X=x} \] (why?)
Write \(\Risk_0\) for the minimal risk \(\Risk(\OptimalStrategy)\)
- Generally not 0

Minimizing the risk in a class of strategies

Remember \(S\) is the strategies we can actually use
Typically doesn’t contain \(\OptimalStrategy\) so we do the best we can: \[ s^* = \argmin_{s \in S}{\Risk(s)} \]
\(\Risk(s^*) \geq \Risk_0\), maybe much larger, maybe only a little

The approximation-estimation trade-off

A basic decomposition: for any strategy \(s\), \[ \Risk(s) = \Risk_0 + (\Risk(s^*) - \Risk_0) + (\Risk(s) - \Risk(s^*)) \]
\(\Risk_0 =\) true minimum risk
\(\Risk(s^*) - \Risk_0 =\) approximation error (due to using \(S\))
\(\Risk(s) - \Risk(s^*) =\) estimation error (due to not using \(s^*\))
Generally:
- Making \(S\) larger reduces approximation error (better optimum)
- Making \(S\) larger increases estimation error (harder to find the optimum)
We will come back to this over and over through the course

Back to prediction problems

Actions = predictions
Information = covariates, regressors, features (etc.)
States = the target variable we’re trying to predict
Strategy = prediction rule = function from information to actions
Loss function = ?

Different loss functions will give us different risks for the same strategy
Different loss functions will lead to different optimal prediction rules

Regression, for example

Actions = predictions = real numbers = guesses at the regressand
Information = vectors of real numbers = covariates, regressors (“independent variables”)
States = “dependent variable”, “regressand”
Strategy = prediction rule = regression function
Loss function = ?

The usual loss function is squared error, \(\Loss(y,a) = (a-y)^2\)
Risk then is expected squared error
The minimizer of \(\Expect{(Y-a)^2}\) is \(a=\Expect{Y}\)
The minimizer of \(\Expect{(Y-a)^2|X=x}\) is \(a=\Expect{Y|X=x}\)
The true or optimal regression function is \(m(x) = \Expect{Y|X=x}\), the conditional mean function

Linear regression, for example

Generally the conditional mean function is very nonlinear in \(x\)
What if we’re only allowed to use linear functions of \(x\)?
We can do the algebra if \(X\) is a scalar (using \(\Expect{Z^2} = (\Expect{Z})^2 + \Var{Z}\)): \[\begin{eqnarray} \Expect{(Y- b_0 - b_1 X)^2} & = & \left(\Expect{Y - b_0 - b_1 X}\right)^2 + \Var{Y - b_0 - b_1 X}\\ & = & \left(\Expect{Y} - b_0 - b_1 \Expect{X}\right)^2 + \Var{Y - b_1 X}\\ & = & \left(\Expect{Y} - b_0 - b_1 \Expect{X}\right)^2 + \Var{Y} + b_1^2 \Var{X} - 2b_1 \Cov{X,Y} \end{eqnarray}\]
Now minimize (and use Greek letters to mark the minimum): \[\begin{eqnarray} \beta_0 &= & \Expect{Y} - \beta_1 \Expect{X}\\ \beta_1 & = & \Cov{X,Y}/\Var{X}\\ s^*(x) & = & \beta_0 + \beta_1 x\\ & = & \Expect{Y} + \frac{\Cov{X,Y}}{\Var{X}}(x - \Expect{X}) \end{eqnarray}\] The expected squared error is \[ \Expect{(Y-s^*(X))^2} = \Var{Y} - \frac{(\Cov{X,Y})^2}{\Var{X}} = \Risk(s^*) \]
(Similarly for multivariate \(X\) but more linear algebra)

Alternative loss functions

Remember all this is with squared error as the loss function
Absolute error, \(\Loss(y,a) = |y-a|\)
- Risk minimized with median, not mean
0-1 or Hamming error: \(0\) if \(y=a\), \(1\) if \(y\neq a\)
- Risk minimized with the mode
Huber’s robust error, continuously switch over from absolute error to squared error
- No closed form
Tolerance region: zero error if \(|y-a| \leq \epsilon\), then growing (say) linearly in \(|y-a|\)
- Also no closed form
Asymmetric errors if over-shooting is better (or worse) than under-shooting
Some of these are easier to work with than others, but that doesn’t make them application-appropriate

Some losses for classifiers

Classification = predicting a categorical variable
0-1 loss: \(\Loss(y,a) = 0\) if \(a=y\), \(\Loss(y,a)=1\) if \(y\neq a\)
- Makes sense when the actions are class labels
- Minimized by predicting the most probable class
Weighted losses: \(\Loss(y,a) = L_{ya}\) for some matrix, says how bad it is to predict \(a\) when the reality is \(y\)
- e.g. “you said this person didn’t have cancer when they really did” vs. “you made this person go in for additional tests when they were fine”
- also makes sense when the actions are class labels
Maybe we predict the probability that \(Y=1\) (rather than \(Y=0\)) so \(A=[0,1]\)
Log loss: \(\Loss(y,a) = -y\log{a} - (1-y)\log{(1-a)}\)

0-1 loss vs. log loss

0-1 loss just cares about whether your probability is on the correct side of \(1/2\)
Log loss wants you to get the probability just right, and is more upset when you’re confident and wrong
Smooth functions (like log loss) are often easier to work with theoretically and computationally, but 0-1 is more forgiving of getting the distribution wrong…
Choosing a loss function is not something decision theory helps us with…

Connecting to data

I promised we’d focus on the “rules” and “predict well on average” parts of “learn rules from data that will predict well, on average, on new cases”
Rules are strategies
“predict well on average” = low risk
Risk is defined as an expectation using the true distribution
We don’t know the true distribution
We just have limited data
Next time: trying to minimize risk using empirical data (“what could possibly go wrong?”)

Back-up: Alternatives to minimizing risk

Risk is expected loss
Other things we could minimize:
- Median loss
- 95th (99th, 99.9999th) percentile of loss
- Maximum loss (“minimax”)
- Probability of one specific type of error
We could not minimize at all:
- Any strategy with a risk (median loss, etc.) below some threshold is OK (“satisficing” instead of optimizing)
- Any strategy where \(\Prob{\Loss(Y, s(X)) > \epsilon} < \delta\) is OK
But risk is traditional:
- It makes sense if you’re working “actuarially”, looking for rules that will be OK applied across a large population
- Minimax can get pretty paranoid
- The math is clean
- Preferences that meet some axioms can be “rationalized” as minimizing risk
  - Some of the axioms are hard to swallow
- There’s a lot of tradition to draw on

Back-up: Why decision theory?

Jerzy Neyman (2nd greatest statistician of the 20th century): forget about inductive inference, study rules of inductive behavior
Abraham Wald: reformulates inference as decision problems, shows how to connect to practical things like quality control and how to fight WWII
Statistical theorists everywhere after the war: yes! use decision theory to find optimal procedures for all the inference problems!
Statistical learning: inherited decision theory from theoretical statistics
- The people coming from computer science were, at least to begin with, fixated on what we’d call 0-1 loss for classification, and situations where the minimum risk was exactly 0

Predictions and Decision Theory