You have some data \(X_1,\ldots,X_p,Y\): the variables \(X_1,\ldots,X_p\) are called predictors, and \(Y\) is called a response. You’re interested in the relationship that governs them
So you posit that \(Y|X_1,\ldots,X_p \sim P_\theta\), where \(\theta\) represents some unknown parameters. This is called regression model for \(Y\) given \(X_1,\ldots,X_p\). Goal is to estimate parameters. Why?
Classically, statistics has focused in large part on inference. The tides are shifting (at least in some part), and in many modern problems, the following view is taken:
Models are only approximations; some methods need not even have underlying models; let’s evaluate prediction accuracy, and let this determine model/method usefulness
This is (in some sense) one of the basic tenets of machine learning
versus ?
Some methods for predicting \(Y\) from \(X_1,\ldots,X_p\) have (in a sense) no parameters at all. Perhaps better said: they are not motivated from writing down a statistical model like \(Y|X_1,\ldots,X_p \sim P_\theta\)
We’ll call these statistical prediction machines. Admittedly: not a real term, but it’s evocative of what they are doing, and there’s no real consensus terminology. You might also see these described as:
Comment: in a broad sense, most of these methods would have been completely unthinkable before the rise of high-performance computing
One of the simplest prediction machines: \(k\)-nearest neighbors regression
Ask yourself: what happens when \(k=1\)? What happens when \(k=n\)?
Advantages: simple and flexible. Disadvantages: can be slow and cumbersome
Can think of \(k\)-nearest neighbors predictions as being simply given by averages within each element of what is called as Voronoi tesellation: these are polyhedra that partition the predictor space
Regression trees are similar but somewhat different. In a nutshell, they use (nested) rectangles instead of polyhedra. These rectangles are fit through sequential (greedy) split-point determinations
Advantage: easier to make predictions (from split-points). Disadvantage: less flexible
Boosting is a method built on top of regression trees in a clever way. To make predictions, can think of taking predictions from a sequence of trees, and combining them with weights (coefficients)
\(\beta_1 \cdot\) \(+\) \(\beta_2 \cdot\) \(+\ldots\)
Advantage: much more flexible than a single tree. Disadvantage: not generally interpretable …
There are many, many other statistical prediction methods out there; examples below. If you’re interesting in learning more, take 36-462 Data Mining, or one of the Introduction to Machine Learning Courses 10-401, 10-601, 10-701
versus