Reminder: statistical (regression) models

You have some data \(X_1,\ldots,X_p,Y\): the variables \(X_1,\ldots,X_p\) are called predictors, and \(Y\) is called a response. You’re interested in the relationship that governs them

So you posit that \(Y|X_1,\ldots,X_p \sim P_\theta\), where \(\theta\) represents some unknown parameters. This is called regression model for \(Y\) given \(X_1,\ldots,X_p\). Goal is to estimate parameters. Why?

To assess model validity, predictor importance (inference)
To predict future \(Y\)’s from future \(X_1,\ldots,X_p\)’s (prediction)

Shifting tides: a focus on prediction

Classically, statistics has focused in large part on inference. The tides are shifting (at least in some part), and in many modern problems, the following view is taken:

Models are only approximations; some methods need not even have underlying models; let’s evaluate prediction accuracy, and let this determine model/method usefulness

This is (in some sense) one of the basic tenets of machine learning

“Early” influential paper

versus ?

Statistical prediction machines

Some methods for predicting \(Y\) from \(X_1,\ldots,X_p\) have (in a sense) no parameters at all. Perhaps better said: they are not motivated from writing down a statistical model like \(Y|X_1,\ldots,X_p \sim P_\theta\)

We’ll call these statistical prediction machines. Admittedly: not a real term, but it’s evocative of what they are doing, and there’s no real consensus terminology. You might also see these described as:

Model-free methods
Distribution-free methods
Machine learning methods

Comment: in a broad sense, most of these methods would have been completely unthinkable before the rise of high-performance computing

\(k\)-nearest neighbors

One of the simplest prediction machines: \(k\)-nearest neighbors regression

Given training data \(X_i=(X_{i1},\ldots,X_{ip})\) and \(Y_i\), \(i=1,\ldots,n\)
Given a new test point \(X^*=(X^*_1,\ldots,X^*_p)\)
Find \(k\)-nearest training points \(X_{(1)},\ldots,X_{(k)}\) to \(X^*\)
Use as our prediction \(\hat{Y^*}=\sum_{i=1}^k Y_{(i)}/k\)

Ask yourself: what happens when \(k=1\)? What happens when \(k=n\)?

Advantages: simple and flexible. Disadvantages: can be slow and cumbersome

From \(k\)-nearest neighbors to trees

Can think of \(k\)-nearest neighbors predictions as being simply given by averages within each element of what is called as Voronoi tesellation: these are polyhedra that partition the predictor space

Regression trees are similar but somewhat different. In a nutshell, they use (nested) rectangles instead of polyhedra. These rectangles are fit through sequential (greedy) split-point determinations

Advantage: easier to make predictions (from split-points). Disadvantage: less flexible

From trees to boosting

Boosting is a method built on top of regression trees in a clever way. To make predictions, can think of taking predictions from a sequence of trees, and combining them with weights (coefficients)

\(\beta_1 \cdot\) \(+\) \(\beta_2 \cdot\) \(+\ldots\)

Advantage: much more flexible than a single tree. Disadvantage: not generally interpretable …

Many, many others

There are many, many other statistical prediction methods out there; examples below. If you’re interesting in learning more, take 36-462 Data Mining, or one of the Introduction to Machine Learning Courses 10-401, 10-601, 10-701

Kernel machines
Random forests
Neural networks
etc.

Two world views?

versus

Conclusion

Neither view is completely right, of course
Inference, prediction are still both extremely important in their own right
And even still, there is much more to statistics than these two
Best you can do: be well-informed about all of your modeling options
Also be well-informed about how you can fit them and assess them!

Statistical Prediction Machines

Statistical Computing, 36-350

Friday December 2, 2016