\[ \newcommand{\Expect}[1]{\mathbb{E}\left[ #1 \right]} \newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]} \newcommand{\Prob}[1]{\Pr\left( #1 \right)} \]

Setting: Prediction, including both classification and regression

Let’s fix our setting. As usual, we have a database of \(n\) items, represented as vectors of \(p\) features. Following the usual notation for regression courses, we’ll write this as an \(n\times p\) matrix \(\mathbf{x}\); the vector for data-point \(i\) will be \(\vec{x}_i\). (To get to this point, we may have done some dimension reduction as a pre-processing step, but that won’t matter for us here.) Beyond these features, we have an additional variable for each item that we want to predict, based on the features. We’ll write it \(y_i\) for data-point \(i\), compiled into the \(n\times 1\) matrix \(\mathbf{y}\) (again, this is regression notation). This variable is called the label, outcome, target, output or (oddly) dependent variable (sometimes even just called the predictand).

A prediction here is going to be a function of the features which outputs a guess (“point prediction”) about the outcome or label.

Prediction quality, risk, optimal risk and optimal predictors

How good is the guess?

  • Regression: measure with expected squared error, so \(\Expect{(Y-m(\vec{X}))^2}\) indicates how bad the function \(m\) is.
  • Classification: measure with inaccuracy or error rate, so \(\Prob{Y \neq m(\vec{X})}\) indicates how bad the function \(m\) is.

This expected error on new data is called the risk. In predictive modeling, we want to learn functions with low risk.

There’s an optimal function, i.e., one which has a lower risk than any other function.

  • Regression: The optimum function is \(\mu(\vec{x}) = \Expect{Y|\vec{X}=\vec{x}}\). This is called the true regression function.
  • Classification: The optimum function1 depends on the conditional probability \(\Prob{Y=1|\vec{X}=\vec{x}} \equiv p(\vec{x})\). The function is \(c(\vec{x}) = 1\) if \(p(\vec{x}) \geq 0.5\) and \(c(\vec{x}) = 0\) otherwise.

Even the optimal function will not, in general, have zero risk.

  • Regression: Suppose \(Y=\mu(\vec{X}) + \epsilon\), where the noise \(\epsilon\) has \(\Expect{\epsilon|\vec{X}} = 0\), \(\Var{\epsilon|\vec{X}=\vec{x}} = \sigma^2(\vec{x})\). (In your linear-regression class, you would have assumed this, plus that \(\sigma^2\) is constant.) Then the risk of the true regression function is \(\Expect{\sigma^2(\vec{X})} > 0\).
  • Classification: Suppose \(p(\vec{x})\) isn’t either 0 or 1 everywhere. Then the probability of mis-classifying at \(\vec{x}\) is \(p(\vec{x})\) if \(p(\vec{x}) < 0.5\) (because there \(c(\vec{x}) = 0\), and \(1-p(\vec{x})\) if \(p(\vec{x} \geq 0.5\). A little thought shows that we can write the conditional probability in a unified way2 as \(\min{\left{p(\vec{x}),1-p(\vec{x})\right}}\). The risk of the optimal classifier is then \(\Expect{\min{\left{p(\vec{X}),1-p(\vec{X})\right}}}\). This minimal risk will be \(>0\) unless \(p(\vec{x})=0\) or \(=1\) almost everywhere.

Unfortunately, the optimal function depends on the true distribution generating the data. So does the risk of the optimal function. What we want, then, is some way of estimating a function from the data which can learn what the true function is, or at least learn to predict almost as well as the true function, without having to know in advance too much about that function.

Nearest neighbors as a predictor

This is where nearest neighbors comes in.

In this context, “distance” always refers to distances between the \(p\)-dimensional feature vectors. The nearest neighbor of a vector \(\vec{x}\) is the \(\vec{x}_i\) closest to it. The \(k\) nearest neighbors are the \(k\) vectors \(\vec{x}_i\) closest to \(\vec{x}\). (Notice that these definitions make sense whether or not \(\vec{x}\) is also one of the \(\vec{x}_i\).) We will often need a way of keeping track of the indices of the neighbors, so we’ll write \(NN(\vec{x}, j)\) for the index of the \(j^{\mathrm{th}}\) nearest neighbor of \(\vec{x}\).

The k-nearest-neighbor estimate of the regression function is then the average value of the response over the \(k\) nearest neighbors: \[ \hat{\mu}(\vec{x}) = \frac{1}{k}\sum_{j=1}^{k}{y_{NN(\vec{x}, j)}} \] For classification, we similarly average the labels of neighbors to estimate \(p(\vec{x})\), \[ \hat{p}(\vec{x}) = \frac{1}{k}\sum_{j=1}^{k}{y_{NN(\vec{x}, j)}} \] and then threshold it: \[ \hat{c}(\vec{x}) = \mathbf{1}(\hat{p}(\vec{x}) \geq 0.5) \]

Analysis of 1-Nearest-Neighbors for Learning Noise-Free Functions

There are lots of estimation methods we could use. To decide on using this one, nearest neighbors, we should have some reason to think it will predict well. This is where theory comes in.

Start with the simplest, most extreme setting, to build ideas. We’ll assume that there is no noise in the outcomes (responses, labels). This means for regression that \(y_i = \mu(\vec{x}_i)\), and for classification that \(y_i = c(\vec{x}_i)\). If nearest neighbors can’t learn to predict here, it’s got to be toast; if it can, we’ll add noise back in. Let’s also make our lives simple by only looking at 1-nearest-neighbors, \(k=1\). To simplify notation, I’ll write \(NN\) as the index for the nearest neighbor of \(\vec{x}\), leaving the dependence on \(\vec{x}\) implicit.

In this setting — no noise, \(k=1\) — the error nearest neighbors will make for regression at \(\vec{x}\) will be \[ \mu(\vec{x}) - y_NN = \mu(\vec{x}) - \mu(x_{NN}) \] so the risk will be \[ \Expect{(\mu(\vec{X}) - \mu(\vec{X}_{NN}))^2} \] Similarly, for classification, the risk will be \[ \Prob{c(\vec{X}) \neq c(\vec{X}_{NN})} \] We’d like these risks to go to 0 as \(n\rightarrow\infty\) (because here the optimal risk is zero). The equations above suggests that for regression we want the true function to be continuous, but for classification we want it to be piecewise constant. (In fact, piecewise continuity is usually enough for regression.) But even with this, we need to see that \(\vec{x}_{NN} \rightarrow \vec{x}\), otherwise continuity won’t help.

Convergence of the nearest neighbor

Requiring \(\vec{x}_{NN} \rightarrow \vec{x}\) is the same as want \(\|\vec{x}_{NN} - \vec{x}\| \rightarrow 0\). When will this happen?

Well, pick some positive distance \(\epsilon > 0\). What is the probability that \(\|\vec{x}_{NN} - \vec{x}\| > \epsilon\)? Ideally, we’d like this to go to zero as \(n\rightarrow\infty\), no matter how small that \(\epsilon\) might be; that would indicate that the nearest neighbor is approaching the point of interest.

A little thought should convince you that the nearest neighbor is more than \(\epsilon\) away from \(\vec{x}\) if and only if every \(\vec{x}_i\) is more than \(\epsilon\) away. So \[ \Prob{\|\vec{x}_{NN} - \vec{x}\| > \epsilon} = \Prob{\forall i, \|\vec{x}_i -\vec{x}\| > \epsilon} \]

At this point, we need to make an assumption about the feature vectors. We’ll assume they’re IID. Then the probability of all the feature vectors doing the same thing (being far from \(\vec{x}\)) turns into the product of each of them doing that thing: \[\begin{eqnarray} \Prob{\|\vec{X}_{NN} - \vec{x}\| > \epsilon} & = & \Prob{\forall i, \|\vec{X}_i -\vec{x}\| > \epsilon}\\ & = & \prod_{i=1}^{n}{\Prob{ \|\vec{X}_i -\vec{x}\| > \epsilon}}\\ & = & \left(\Prob{ \|\vec{X} -\vec{x}\| > \epsilon}\right)^n \end{eqnarray}\] This has got to go to zero as \(n\rightarrow\infty\) (which is what we want), unless the probability we’re raising to the power \(n\) is exactly 1. To get a handle on that, let’s re-write it a bit more: \[\begin{eqnarray} \Prob{\|\vec{X}_{NN} - \vec{x}\| > \epsilon} & = & \left(\Prob{ \|\vec{X} -\vec{x}\| > \epsilon}\right)^n\\ & = & \left(1 - \Prob{\|\vec{X} - \vec{x}\| \leq \epsilon}\right)^n \end{eqnarray}\] So all we need is for there to be {} probability of being within \(\epsilon\) of \(\vec{x}\). If we’re asking for a prediction at a point in the middle of a region of zero probability, nearest neighbors is not a great idea, but otherwise, we’re set.

We can be a little bit more detailed by approximating the probability in question. Assume \(\vec{X}\) follows a pdf \(f(\vec{u})\). Then the probability integrates the pdf over the radius-\(\epsilon\) ball centered on \(\vec{x}\), \[ \Prob{\|\vec{X} - \vec{x}\| \leq \epsilon} = \int_{\vec{u}: \|\vec{u} - \vec{x}\| \leq \epsilon}{f(\vec{u}) d\vec{u}} \] Let’s assume \(\epsilon\) is small. (Ultimately we want it to shrink towards zero, after all.) Over a small enough ball, \(f(\vec{u})\) will be nearly constant, and equal to \(f(\vec{x})\). So \[ \Prob{\|\vec{X} - \vec{x}\| \leq \epsilon} \approx c_p \epsilon^p f(\vec{x}) \] where \(c_p\) is a constant, geometrical factor (\(c_2 = \pi\), \(c_3 = \frac{4}{3}\pi\), etc.). That is, the probability of a small ball centered around \(\vec{x}\) is about \(f(\vec{x})\) times the volume of the ball.

Putting all this together, \[ \Prob{\|\vec{X}_{NN} - \vec{x}\| > \epsilon} \approx (1-c_p \epsilon^p f(\vec{x}))^n \]

Let’s make use of one last approximation, that \((1+h)^b \approx 1+bh\) when \(|h| \ll 1\). (Use the binomial theorem if you don’t believe me.) Then we get \[ \Prob{\|\vec{X}_{NN} - \vec{x}\| > \epsilon} \approx 1- n c_p \epsilon^p f(\vec{x}) \] This is going to zero for each fixed \(\epsilon\). If we want this to be constant — say, if we want to find an \(\epsilon\) which bounds the distance to the nearest neighbor with 50% confidence, the median nearest-neighbor distance — we’d need to say \[ 1- n c_p \epsilon^p f(\vec{x}) = \delta \] or \[ \epsilon = n^{-1/p} \left(\frac{(1-\delta)}{c_p f(\vec{x})}\right)^{1/p} \] So the typical distance to the nearest neighbor is shrinking to 0, at rate \(n^{-1/p}\). This \(\rightarrow 0\) as \(n\rightarrow\infty\), as desired.

1NN is consistent for noise-free functions

To recap, because \(\|\vec{x}_NN - \vec{x}\| \rightarrow 0\) as \(n\rightarrow\infty\), if the true function is (piecewise) continuous, then 1NN will approximate it arbitrarily well given enough data. When an estimator converges on the truth as \(n\rightarrow\infty\), it’s called “consistent”, so we’ve just shown that nearest neighbors is consistent for learning noise-free functions.

Putting the noise back in for 1NN

What happens if we add in noise, but still use 1NN? In the regression case, \(Y=\mu(X)+\epsilon\), so \[\begin{eqnarray} \hat{\mu}(\vec{x}) & = & y_{NN}\\ & = & \mu(\vec{X}_{NN}) + \epsilon_{NN} \end{eqnarray}\] The error in predicting a new response at \(\vec{x}\), \(Y_{new}\), is thus \[\begin{eqnarray} Y_{new} - \hat{\mu}(\vec{x}) & = & \mu(\vec{x}) + \epsilon_{new} - \mu(\vec{X}_{NN}) - \epsilon_{NN}\\ & = & (\mu(\vec{x}) - \mu(\vec{X}_{NN})) + \epsilon_{new} - \epsilon_{NN} \end{eqnarray}\] As \(n\rightarrow\infty\), the \(\mu\) term in parentheses \(\rightarrow 0\), since \(\mu\) is continuous (by assumption) and the nearest neighbor converges on the point. So the error approaches \(\epsilon_{new} - \epsilon_{NN}\). Squaring, taking expectations, and remembering that the noises are uncorrelated, we get that the risk of 1NN regression at \(\vec{x}\) approaches \[ \Var{\epsilon|\vec{X}=\vec{x}} + (-1)^2\Var{-\epsilon|\vec{X}=\vec{x}} = 2\sigma^2(\vec{x}) \] The over-all risk of 1NN regression thus approaches \[ 2\Expect{\sigma^2(\vec{X})}$ \] as \(n\rightarrow\infty\). \(2\Var{\epsilon}\). But the risk of the true regression function is already \(\Expect{\sigma^2(\vec{X})}\), so we’ve come within a factor of two of the optimum risk.3

For classification, the risk at a particular point \(\vec{x}\) is \[\begin{eqnarray} \Prob{Y_{new} \neq \hat{c}(\vec{x}_{new})} & = & \Prob{Y_{new} \neq Y_{NN}}\\ & = & \Prob{Y_{new}=1, Y_{NN}=0} + \Prob{Y_{new}=0, Y_{NN}=1}\\ & = & p(\vec{x}(1-p(\vec{x}_{NN})) + (1-p(\vec{x}))p(\vec{x}_{NN}) \end{eqnarray}\] As \(\vec{x}_{NN} \rightarrow \vec{x}\), this approaches \[ 2p(\vec{x})(1-p(\vec{x})) \] provided \(p\) is a continuous function (or at least piecewise continuous). Recall from earlier that the conditional risk of the optimal classification function is \(\min{\left\{p(\vec{x}), 1-p(\vec{x})\right}}\), say \(r(\vec{x})\). So the conditional risk of 1NN approaches \(2r(\vec{x})(1-r(\vec{x}))\) 2r()$. The over-all risk will thus approach \[ 2\Expect{p(\vec{X})(1-p(\vec{X}))} \] which is at most twice the risk of the optimal classifier.

What about multiple neighbors?

Recall how we defined the predictions for \(k\)-nearest-neighbor regression4: \[ \hat{\mu}(\vec{x}) = \frac{1}{k}\sum_{j=1}^{k}{Y_{NN(\vec{x}, j)}} \] For every data point, \(Y=\mu(\vec{X})+\epsilon\), where quite generally \(\Expect{\epsilon|\vec{X}=\vec{x}} = 0\). So we can write \[ \hat{\mu}(\vec{x}) = \frac{1}{k}\sum_{j=1}^{k}{\mu(\vec{x}_{NN(\vec{x}, j)}} + \frac{1}{k}\sum_{j=1}^{k}{\epsilon_{NN(\vec{x}, j)}} \] What we’d like the prediction to be is of course \(\mu(\vec{x})\), as before.

The last equation makes it clear that the error in kNN-regression has two sources:

  1. Evaluating the true regression function at the nearest neighbors. That is, we’re approximating the quantity we want (\(\mu(\vec{x})\)) by something else, namely \(\frac{1}{k}\sum_{j=1}^{k}{\mu(\vec{x}_{NN(\vec{x}, j)})}\). We’ll call this the approximation error5.
  2. The noise in the response values for the nearest neighbors \(\left( \frac{1}{k}\sum_{j=1}^{k}{\epsilon_{NN(\vec{x}, j)}}\right)\). This is pure noise.

For 1NN, we controlled the approximation error by realizing that it goes to zero as \(\vec{x}\)’s nearest neighbor converges on \(\vec{x}\). You6 can extend the argument to show that the \(k^{\mathrm{th}}\) nearest neighbor does too, for any fixed \(k\). If the \(k^{\mathrm{th}}\) nearest neighbor is within \(\epsilon\) of \(\vec{x}\), then all of \(k\) nearest neighbor_s_ must be too. Then continuity of \(\mu\) says that the approximation error \(\rightarrow 0\) as \(n\rightarrow\infty\).

As for the noise, it’s the average of \(k\) noise terms. If we assume the \(\epsilon\)s are uncorrelated across data points, we can say that \[ \Var{\frac{1}{k}\sum_{j=1}^{k}{\epsilon_{NN(\vec{x}, j)}}} = \frac{1}{k}\sum_{j=1}^{k^2}{\Var{\epsilon_{NN(\vec{x}, j)}}} \] If \(\Var{\epsilon|\vec{X}=\vec{u}} = \sigma^2(\vec{u})\), then all of those variances are converging on \(\sigma^2(\vec{x})\), and we get \[ \frac{\sigma^2(\vec{x})}{k} \] for the variance of the noise.

The over-all risk of kNN-regression at \(\vec{x}\) will thus tend, as \(n\rightarrow\infty\), to \[ (\text{system noise}) + (\text{approximation error}) + (\text{estimation noise}) \rightarrow \sigma^2(\vec{x}) + 0 + \frac{\sigma^2(\vec{x})}{k} = \left(1+\frac{1}{k}\right)\sigma^2(\vec{x}) \] That is, rather than having twice the optimum risk with \(k=1\), kNN regression gets only \(1+1/k\) of the optimum risk — at least as \(n\rightarrow\infty\).

That last phrase is of course why we don’t just automatically set \(k\) to be as large as possible. At any finite \(n\), we face a trade-off:

This is a manifestation of one of the fundamental issues in statistics, the bias-variance tradeoff. When we are doing prediction, we don’t (usually) care about whether our errors come from bias or from variance, just about the over-all magnitude of the error. We will usually find that we want methods with some bias, because the error added by the bias is more than compensated for by the reduction in variance. We need some practical way of deciding how much bias we want to trade for less variance; this is what we’ll tackle next time.

Background

The pioneering theoretical analysis of nearest neighbors, covering both regression and classification as special cases of prediction-in-general, was done by Cover in the 1960s (Cover and Hart 1967; Cover 1968a, 1968b). What I’ve done above is basically “Cover made even simpler”. For more refined analyses of kNN classification and regression, see the appropriate chapters of Devroye, Györfi, and Lugosi (1996) and Györfi et al. (2002), respectively.

Historical note: “Find the most similar case with a known outcome, and guess that a new case will be similar” is such a natural idea that it’s almost impossible to trace its earliest history. The recognition that this idea could be a general, explicit statistical method, along with the name “nearest neighbors”, seems to go back to the 1950s (see Cover (1968a) for references). But because it’s such a natural idea that it keeps getting re-invented in different subjects: in nonlinear dynamics and the physics of chaotic systems, for instance, it was introduced in the 1980s as the “method of analogs” (see Kantz and Schreiber (1997) for references).

References

Cover, Thomas M. 1968a. “Estimation by the Nearest Neighbor Rule.” IEEE Transactions on Information Theory 14:50–55. http://www-isl.stanford.edu/~cover/papers/transIT/0050cove.pdf.

———. 1968b. “Rates of Convergence for Nearest Neighbor Procedures.” In Proceedings of the Hawaii International Conference on Systems Sciences, edited by B. K. Kinariwala and F. F. Kuo, 413–15. Honolulu: University of Hawaii Press. http://www-isl.stanford.edu/~cover/papers/paper009.pdf.

Cover, Thomas M., and P. E. Hart. 1967. “Nearest Neighbor Pattern Classification.” IEEE Transactions on Information Theory 13:21–27. http://www-isl.stanford.edu/~cover/papers/transIT/0021cove.pdf.

Devroye, Luc, László Györfi, and Gábor Lugosi. 1996. A Probabilistic Theory of Pattern Recognition. Berlin: Springer-Verlag.

Györfi, László, Michael Kohler, Adam Krzyżak, and Harro Walk. 2002. A Distribution-Free Theory of Nonparametric Regression. New York: Springer-Verlag.

Kantz, Holger, and Thomas Schreiber. 1997. Nonlinear Time Series Analysis. Cambridge, England: Cambridge University Press.


  1. Some people call it the Bayes rule, even though it has nothing to do with Bayes’s rule, with Bayesian inference, or with Thomas Bayes.

  2. An alternative expression would be \(\frac{1}{2} - \left|p(\vec{x})-\frac{1}{2}\right|\), but this will be less useful later on.

  3. If the noise variance is constant, \(\sigma^2(\vec{x}) = \sigma^2\), this simplifes: the risk of 1NN regression approach \(2\sigma^2\), while the risk of the true regression function is just \(\sigma^2\).

  4. The analysis for kNN-classification is very similar and comes to the same conclusion, but I don’t feel like writing everything out twice.

  5. If we think of the locations of the nearest neighbors as fixed, and only the responses \(Y\) as random, then we can call this “bias” in the technical sense, as the expected difference between the estimate \(\hat{\mu}(\vec{x})\) and the truth \(\mu(\vec{x})\). If we treat the locations of the nearest neighbors as random, then the bias would be \(\Expect{\frac{1}{k}\sum_{j=1}^{k}{\mu(\vec{X}_{NN(\vec{x}, j)}) - \mu(\vec{x})}}\), which is a bit of a mess, though fortunately not something we’ll need to know in detail, as the next paragraph will explain.

  6. “You”, meaning “not me, at least not now”. The trick is however to realize that if the \(k\)th neighbor is more than \(\epsilon\) away from \(\vec{x}\), at least \(n-k+1\) of the data points must be more than \(\epsilon\) away. (Said differently, if the \(k\)th neighbor is within \(\epsilon\), then at least \(k-1\) other data points must also be within \(\epsilon\).) The probability of this happening is something we can calculate from a binomial distribution, with \(n\) trials and a success probability depending on the probability of a random point being in the \(\epsilon\)-ball around \(\vec{x}\).