\[ \newcommand{\Yhat}{\hat{Y}} \newcommand{\Prob}[1]{\mathbb{P}\left( #1 \right)} \newcommand{\indep}{\perp} \]
We’re trying to predict a categorical response, label or class \(Y\), usually binary (\(0\) or \(1\)), using a feature or vector of features, \(X\). The prediction \(\Yhat\) is a function of the feature \(x\), so we’ll also write \(\Yhat(x)\) for the guess at the label of a case with features \(x\). We will be very interested in \(\Prob{Y=1|X=x}\), and abbreviate it by \(p(x)\). The over-all mis-classification error rate, or inaccuracy, is \(\Prob{Y \neq \Yhat}\). This can be decomposed two ways. One way is in terms of the false positive rate, the false negative rate, and the “base rate” at which the two classes occur: \[ \Prob{Y \neq \Yhat} = \Prob{\Yhat=1|Y=0} \Prob{Y=0} + \Prob{\Yhat=0|Y=1}\Prob{Y=1} = FPR \times \Prob{Y=0} + FNR \times \Prob{Y=1} \] The other decomposition is in terms of (one minus) the positive predictive value, the negative predictive value, and the probability of making each prediction: \[ \Prob{Y \neq \Yhat} = \Prob{Y=1|\Yhat=0}\Prob{\Yhat=0} + \Prob{Y=0|\Yhat=1}\Prob{\Yhat=1} = (1-PPV) \Prob{\Yhat=0} + (1-NPV) \Prob{\Yhat=1} \]
If we want to minimize the inaccuracy, we should use the rule1 that \(\Yhat=1\) if \(p(x) \geq 0.5\) and \(\Yhat=0\) if \(p(x) < 0.5\). If we attach a cost to each error, say \(c_+\) to false positives and \(c_-\) to false negatives, then we should fix \(\Yhat=1\) if \(p(x) \geq t(c_+, c_-)\) and \(\Yhat=0\) if \(p(x) < t(c_+, c_-)\). The shape of the threshold function \(t\) is such that \(t=0.5\) when and only when \(c_+ = c_-\). (Can you remember, or work out, how to find the threshold in terms of the costs \(c_+\) and \(c_-\)?)
\(p(x)\) is the distribution of the class conditional on the features, \(\Prob{Y=1|X=x}\). It turns out that the success of classification depends on the “inverse” conditional probability, \(\Prob{X=x|Y=y}\). Let’s introduce the abbreviations \[\begin{eqnarray} f(x) \equiv \Prob{X=x|Y=1}\\ g(x) \equiv \Prob{X=x|Y=0} \end{eqnarray}\] (Some people would write \(f_+\) and \(f_-\), or \(f_1\) and \(f_0\), etc.) Now let’s re-write the conditional probability of the class given the features, in terms of the distribution of features in each class: \[\begin{eqnarray} p(x) & \equiv & \Prob{Y=1|X=x}\\ & = & \frac{\Prob{Y=1, X=x}}{\Prob{X=x}}\\ & = & \frac{\Prob{Y=1, X=x}}{\Prob{Y=1, X=x} + \Prob{Y=0, X=x}}\\ & = & \frac{\Prob{X=x|Y=1}\Prob{Y=1}}{\Prob{X=x|Y=1}\Prob{Y=1} + \Prob{X=x|Y=0}\Prob{Y=0}} & = & \frac{f(x) \Prob{Y=1}}{f(x) \Prob{Y=1} + g(x)\Prob{Y=0}} \end{eqnarray}\]
This is a bit of a mouthful, but we can get a handle on it by considering some extreme cases.
Suppose that \(f(x) = g(x)\) for all \(x\). Then we’d get \[ p(x) = \frac{\Prob{Y=1}}{\Prob{Y=1}+\Prob{Y=0}} = \Prob{Y=1} \] which is just the base rate. So we’d have \(\Prob{Y=1|X=x} = \Prob{Y=1}\). But this is the same as saying that \(Y\) and \(X\) are statistically independent, \(Y \indep X\), or that the mutual information is 0, \(I[X;Y] = 0\). Whatever threshold \(t\) we apply, we’d make the same prediction for everyone, regardless of their features.
On the other hand, suppose that at some point \(x\), \(f(x) > 0\) but \(g(x) = 0\). Then we have \[ p(x) =\frac{f(x) \Prob{Y=1}}{f(x) \Prob{Y=1} + 0 \times \Prob{Y=0}} = 1 \] and we should set \(\Yhat=1\) here no matter what threshold we’re using. Similarly if \(f(x) = 0\) but \(g(x) > 0\), we should set \(\Yhat=0\) for any threshold. If every point \(x\) fell under one or the other of these two cases, we’d say that the two distributions had “disjoint support”, and we could classify every point with certainty.
Now let’s consider the more general case, where we’re applying a threshold \(t\) to \(p(x)\). The feature point \(x\) is one where we set \(\Yhat(x)=1\) when \(p(x) \geq t\), or \[\begin{eqnarray} p(x) & \geq & t\\ \frac{f(x) \Prob{Y=1}}{f(x) \Prob{Y=1} + g(x)\Prob{Y=0}} & \geq & t\\ \frac{1}{1 + \frac{g(x)\Prob{Y=0}}{f(x) \Prob{Y=1}}} & \geq & t\\ 1 + \frac{g(x)\Prob{Y=0}}{f(x) \Prob{Y=1}} & \leq & \frac{1}{t}\\ \frac{g(x)\Prob{Y=0}}{f(x) \Prob{Y=1}} & \leq & \frac{1-t}{t}\\ \frac{f(x) \Prob{Y=1}}{g(x)\Prob{Y=0}} & \geq & \frac{t}{1-t}\\ \frac{f(x)}{g(x)} \frac{\Prob{Y=1}}{\Prob{Y=0}} & \geq & \frac{t}{1-t} \end{eqnarray}\]
Notice that this involves the base-rates (\(\Prob{Y=0}\), \(\Prob{Y=1}\)) as well as the distribution of features in each class. One way to interpret this last equation is that the ratio \(f(x)/g(x)\) gauges the evidence (one way or the other) that the features \(x\) provides about the class membership of a particular case. How strong this evidence has to be before it pushes us into a decision depends on the threshold (\(t\)), but also on what we know about the base-rates. If the base-rates are very lop-sided, say \(\Prob{Y=0} \gg \Prob{Y=1}\), then we’d need stronger evidence (higher ratio \(f(x)/g(x)\)) to push us into saying \(Y=1\), i.e., to setting \(\Yhat=1\).
Let’s revert to information theory. We’ve seen that we can’t classify at all when \(I[X;Y] = 0\) (at least not using \(X\)). More generally, \(I[X;Y] = H[Y]-H[Y|X]\), and \(H[Y|X]\) is precisely the entropy of the distribution \(p(X)\) (averaged over \(X\)). So larger values of \(I[X;Y]\) mean smaller values of \(H[Y|X]\), which in turn means that \(p(x)\) is closer to 0 or 1 for more values of \(x\), which means that more and more accurate classification is possible. When we consider classification trees, we will look directly at trying to (greedily) minimize \(H[Y|X]\).
Folklore in data mining is that you should worry more about finding the right features than about exactly what model or classifier you apply to them. We’ve now seen the rational basis for this folklore. Informative features are ones whose distributions differ substantially across classes. With un-informative features, it doesn’t matter which technique we apply, there’s just no information to be had. Once you’ve found informative features, it’s fairly routine to try out different standard techniques (linear classifiers, nearest neighbors, trees, etc., etc.) on the same feature set. In any particular problem, some of these techniques will have an easier time extracting information in the features. But the information has to be there in the features to begin with.
Another way to tackle the problem of classifier design is to frankly admit that we have two competing objectives, and try to balance them. One objective is to have low false positive rates, and the other objective is to have low false negative rates. For historical and rhetorical reasons, it’s conventional here to consider one minus the false negative rate, \[ \Prob{\Yhat=1|Y=1} \] which is also called the “power”, and written \(\beta\).
When we chose a classifier, we get a benefit in the form of power, \(\beta\), and pay a cost in terms of the false positive rate, FPR. To get a handle on this, remember that for every point \(x\), either \(\Yhat(x) = 1\) or \(\Yhat(x) = 0\). Call the set of points where \(\Yhat(x) = 1\), \(S\). (This is called the acceptance region or decision region, and its boundary is called the decision boundary or critical boundary.) Then \[ \beta(S) \equiv \Prob{\Yhat=1|Y=1} = \sum_{x \in S}{f(x)} ~ \text{or} ~ \int_{S}{f(x) dx} \] depending on whether the features are discrete or continuous. (I did the continuous version in class, but will do the discrete version here, just to encourage mental flexibility on your part.) This is our over-all benefit to using a classifier with the decision region \(S\). Against this, the cost, in the form of false positives, is \[ FPR(S) = \Prob{\Yhat=1|Y=0} = \sum_{x \in S}{g(x)} \]
Every possible classifier gives us some combination of power and false positive rate. We can imagine plotting power against false positive rate for each classifier. Some classifiers are bad: they have high false positive rates and low power. Many other classifiers will “dominate” those bad ones, because they have a lower false positive rate and higher power. But then there will be classifiers which are harder to rank: one has a strictly higher power, but the other has a strictly lower false positive rate. There it’s ambiguous which one we should prefer. On our plot, we’ll see a curve along which increasing the benefit (power) can only happen if we also increase the false positive rate. This curve is called the possibility frontier, or the Pareto frontier2. To get some grasp of this, think about saying \(\Yhat=1\) for every \(x\), i.e., setting \(S\) to be the whole feature space. This would give \(\beta=1\), but also \(FPR=1\). Against that, setting \(\Yhat=0\), or shrinking \(S\) to the empty set, would give \(FPR=0\), but also \(\beta=0\). So we’d usually3 expect the frontier to connect \((0,0)\) to \((1,1)\).
Now the usual way to deal with two competing objectives is to introduce a price, letting us say how much one unit of one objective is worth in terms of the other. Here the two objectives having high power and low false positive rate, so we need to set a price for power in terms of false positives. Let’s call this price \(r\). So we want to maximize \[ \beta - r \times FPR \] or \[ \sum_{x\in S}{f(x)} - r \sum_{x \in S}{g(x)} = \sum_{x \in S}{f(x) - rg(x)} \] Now each summand is either positive, negative or 0. To make the sum as big as possible, we should include every \(x\) where the summand is positive in the set \(S\), and exclude every \(x\) where the summand is negative. (We don’t care about points where the summand is 0.) That is, what we want to do is, \[ \Yhat(x) = \left\{ \begin{array}{cc} 1 & f(x) - rg(x) \geq 0\\ 0 & f(x) - rg(x) < 0 \end{array} \right. \]
We might start instead by saying that we want to maximize power with a constraint on the false positive rate. That is, we set as our problem \[ \max_{S: FPR(S) \leq \alpha}{\beta(S)} \] where \(\alpha\) is our maximum allowed false positive rate: maybe \(\alpha=0.05\), or \(\alpha=0.01\), or \(\alpha=10^{-6}\) if we really don’t like false positives.
(Graphically, if we go back to our plot of false positive rate versus power, we’d be drawing a horizontal line at \(FPR=\alpha\), only considering points below that line, and then just taking the right-most point, which is the highest power classifier that obeys the constraint.)
The usual trick for solving constrained optimization problems is to turn them into unconstrained ones by using a Lagrange multiplier. We had the old objective function \[ \beta(S) = \sum_{x \in S}{f(x)} \] subject to the constraint \[ FPR(S) \leq \alpha \] or \[ FPR(S) - \alpha \leq 0 \] or \[ \sum_{x\in S}{g(x)} - \alpha \leq 0 \] The Lagrangian combines the objective function plus a Lagrange multiplier times the constraint equation: \[ \beta(S) - r(FPR(S) - \alpha) \] The actual problem is now to maximize the Lagrangian over both \(S\) and the Lagrange multiplier, \[ \max_{S, r}{\beta(S) - r(FPR(S) - \alpha)} \] Equivalently, \[ \max_{S, r}{\sum_{x \in S}{f(x)} - r\sum_{x \in S}{g(x)} + r\alpha} \]
Now, when we do the maximization, there will be some value of the multiplier \(r\) which will enforce the constraint \(\alpha\), say \(r^*\). And once we know \(r^*\) what we’re doing is just \[ \max_{S}{\sum_{x\in S}{f(x) - r^* g(x)}} \] and we’ve seen how to do that: include \(x\) in \(S\) if, but only if, \[ \frac{f(x)}{g(x)} \geq r^* \] In other words, the Lagrange multiplier looks just like a price. Economists call a price which enforces a constraint a “shadow price”, and so \(r^*\) is the shadow price of power.
If we want to enforce a limit on the false positive rate, \(FPR \leq \alpha\), but then maximize the power, we should follow the rule \[ \Yhat(x) = \left\{ \begin{array}{cc} 1 & f(x)/g(x) \geq r\\ 0 & f(x)/g(x) < r \end{array} \right. \] for some threshold \(r\) which is a function of \(\alpha\).
Notice that this classification rule does not involve the base rates at all. This is sensible, since neither the false positive rate nor the power involves the base rate.
In your other statistics classes, you’ll have seen a lot of hypothesis tests, many of which involve the likelihood ratio, or (equivalently) the log of the likelihood ratio. The way we’ve analyzed classifiers is exactly parallel to a hypothesis test. \(g(x)\) the likelihood of the features \(X=x\) under the null hypothesis that \(Y=0\), while \(f(x)\) is the likelihood under the alternative hypothesis that \(Y=1\).
It may seem intuitive that a test should compare the likelihoods, but why compare them through their ratio rather than say \(f(x) - g(x)\), or for that matter \(f^2(x) - g^2(x)\)? We have now seen the answer: the ratio is, uniquely, what we need to worry about if we want to maximize power while controlling the false positive rate. As a result about hypothesis testing, this was first shown by Jerzy Neyman and Egon Pearson in the 1930s, and so it’s sometimes called “the Neyman-Pearson lemma” or “the Neyman-Pearson theorem”.
Our analyses of classifier performance has suggested three approaches to how we might design classifiers.
Strategy (1) relies on what’s called the posterior probability of being in each class. (The prior probability of being in class 1 is just the base rate, \(\Prob{Y=1}\).) Strategy (2) relies on the likelihood ratio, and is sometimes called a “Neyman-Pearson” approach. Strategy (3) is what we might call a “direct” approach, though I don’t think it has a common name.
As always, there are advantages and disadvantages to each approach.
Pro:
Cons:
Pros:
Cons:
To illustrate, imagine seeing data like the figure above, and trying to classify it with a linear classifier. That means trying to find a straight line which divides the positive, \(Y=1\) points from the negative, \(Y=0\) points. Typically, when people try to do this, they aim to find the line with the minimum classification error. You can convince yourself that there is no single straight line which will achieve an error rate of 0 on that data, but also that some lines are better than others. In symbols, we have a family4 \(\mathcal{S}\) of possible decision regions \(S\), and we want to find \[ \min_{S \in \mathcal{S}}{\frac{1}{n}\sum_{i=1}^{n}{Y_i \neq \mathbb{1}(X_i \in S)}} \] where the indicator function \(\mathbb{1}(X_i \in S) =1\) if \(X_i \in S\) and \(=0\) otherwise.
There is nothing magic about using linear classifiers. We might, for instance, instead try to divide the points by enclosing all the positive points inside a rectangle, and then (in this case) we could do better than with purely linear separators. This amounts to changing the family of possible regions \(\mathcal{S}\). Many families are defined through complicated functions of the features — a popular choice is to take a bunch of nonlinear transformations of the features, say \(\phi_1(x), \phi_2(x), \ldots \phi_p(x)\), bundle them into a vector \(\phi(x)\), and then apply linear classifiers5 to \(\phi\).
In general, the bigger and more flexible we make \(\mathcal{S}\), the lower we will make the in-sample error, and the lower the generalization error can be. But there are two costs to using very flexible families.
If there is any noise in the labels at all, the \(S\) we get will, in part, reflect that noise. We say that \(S\) “memorizes” the noise, as well as any true signal about where the classification boundary should be. The next figure shows the same classification boundary, with the same values of the features, but with \(Y\) drawn independently from the same distribution \(p(x)\), and you can see that it’s not doing so well — it now misses some positive points and includes some negative points.
So flexibility or “capacity” has an advantage (we can fit more) but also a disadvantage (our results are less stable and more vulnerable to noise).
One way to combat the vulnerability to noise is to impose some sort of geometric constraint. A common one is to require a certain minimum distance between any point and the classification boundary — to insist on only using classifiers with a large “margin”. Such geometric constraints rule out shapes with very irregular, erratic boundaries, which is (one reason) why people talk about this as “regularizing” the problem of finding an optimal classifier. Just as we saw above, constraining classifiers to have a large margin is the same as penalizing classifiers based on their margin. Either way, we would typically use cross-validation to pick the size of the constraint or the penalty.
(People have developed various ways of measuring model capacity, to help quantify this trade-off. Many of them come down to variations on seeing how well the family could seem to classify labels which were pure noise, i.e., how well it would seem to do when \(p(x)=0.5\) for all \(x\). These are important for theory, have helped guide the development of new classifiers, and provide important sanity checks on how well we can hope to do: see for instance Bartlett and Mendelson (2002). But in practice, people overwhelmingly use cross-validation to assess their classifiers.)
Pros:
Cons:
Goel, Hofman, and Sirer (2012) used the full browsing history of about a quarter of a million (US) Web users primarily to examine how different demographic groups — defined by age, sex, race, education and household income — used the Web differently. If we think of each demographic category as a label \(Y\), and which websites were visited (and how often) as features \(X\), they were primarily interested in \(\Prob{X=x|Y=y}\), and how this differed across demographic categories. For instance, people with a post-graduate degree visited news sites about three times as often as people with only a high school degree. (What’s \(X\) and what’s \(Y\) in that example?) It may or may not surprise you to learn that they found large differences in browsing behavior across demographic groups. To steal an example from the paper, men are much more likely than women to visit ESPN, and women are more likely than men to visit Lancôme.
You can now see where this is going. By point (1) in our summary above, the fact that \(\Prob{X=x|Y=y}\) differs across classes \(y\) means that we can use browsing behavior (\(X\)) to predict demographic classes (\(Y\)). Someone who knows what websites you browse can predict your age, sex, race, education, and household income. To demonstrate this, Goel, Hofman, and Sirer (2012) used the 10,000 most popular websites, creating a binary feature for each site, \(X_i=1\) if site \(i\) was visited at all during the study and \(X_i=0\) if not. They then used a linear classifier on these features, with one of the geometric margin constraints I mentioned. The next figure shows how well they were able to predict each of the five demographic variables.
Detail of Figure 8 from Goel, Hofman, and Sirer (2012), showing the ability of a (regularized) linear classifier to predict demographic variables based on web browsing history. Dots show the achieved accuracy, and the \(\times\) shows the frequency of the more common class.
I include this not because the precise accuracies matter — there’s no reason to think this is the best performance attainable, even with these features — but rather to prove the point that this kind of prediction can be done. It doesn’t matter why different demographic groups have different browsing habits, just that those distinctions make a difference. This lets us (or our machines) work backwards from browsing to accurate-but-not-perfect inferences about demographic categories.
Now imagine a recidivism prediction system which does not, officially or explicitly, consider sex, but does have access to the defendant’s web browsing history. (No such system exists, to best of my knowledge, but there’s no intrinsic limit on its creation.) We know, from Goel et al., that sex can be predicted with about 80% accuracy from browsing history (at least). A nefarious designer who wanted to include sex as a predictor for recidivism, but to hide doing so, could therefore use browsing history to predict sex, and then include predicted sex in their model. A less nefarious designer might end up doing something equivalent without even realizing it, say by slightly increasing the predicted risk of those who visit ESPN and slightly reducing the prediction for those who visit Lancôme. Either designer might, when pressed, say that they’re not claiming to say why ESPN predicts recidivism, but facts and facts, and are you going to argue with the math?
In fact, we can go further. We know that younger people have a higher risk of violence than older people, that poorer people have a higher risk than richer people, that men have a higher risk than women, that blacks have a higher risk than whites6, that less educated people have a higher risk than more educated people7. A system which just used Web browsing to sort people into these five demographic categories could8, therefore, achieve non-trivial predictive power. You can even imagine designing such a system innocently, where we just try to boil down a large number of features into (say) a five-dimensional space, before using them to predict violence, without realizing that those five dimensions correspond to age, sex, race, income and education.
None of this really relies on the features being Web browsing history; anything whose distribution differs across demographic groups will do.
On the Neyman-Pearson approach to classifiers, see Scott and Nowak (2005), Rigollet and Tong (2011) and Tong (2013). The Neyman-Pearson lemma itself goes back to Neyman and Pearson (1933) (where they never call it a lemma). The heuristic cost-benefit derivation of it is, so far as I know, my own invention.
Allen, Danielle S. 2017. Cuz: The Life and Times of Michael A. New York: Liveright.
Bartlett, Peter L., and Shahar Mendelson. 2002. “Rademacher and Gaussian Complexities: Risk Bounds and Structural Results.” Journal of Machine Learning Research 3:463–82. http://jmlr.csail.mit.edu/papers/v3/bartlett02a.html.
Dollard, John. 1937. Caste and Class in a Southern Town. New Haven, Connecticut: Yale University Press.
Goel, Sharad, Jake M. Hofman, and M. Irmak Sirer. 2012. “Who Does What on the Web: A Large-Scale Study of Browsing Behavior.” In Sixth International AAAI Conference on Weblogs and Social Media [ICWSM 2012], edited by John G. Breslin, Nicole B. Ellison, James G. Shanahan, and Zeynep Tufekci. AAAI Press. https://www.aaai.org/ocs/index.php/ICWSM/ICWSM12/paper/view/4660.
Leovy, Jill. 2015. Ghettoside: A True Story of Murder in America. New York: Spiegel; Grau.
Neyman, Jerzy, and Egon S. Pearson. 1933. “On the Problem of the Most Efficient Test of Statistical Hypotheses.” Philosophical Transactions of the Royal Society of London A 231:289–337. https://doi.org/10.1098/rsta.1933.0009.
Rigollet, Philippe, and Xin Tong. 2011. “Neyman-Pearson Classification, Convexity and Stochastic Constraints.” Journal of Machine Learning Research 12:2831–55. http://jmlr.org/papers/v12/rigollet11a.html.
Scott, Clayton, and Robert Nowak. 2005. “A Neyman-Pearson Approach to Statistical Learning.” IEEE Transactions on Information Theory 51:3806–19. https://doi.org/10.1109/TIT.2005.856955.
Tong, Xin. 2013. “A Plug-in Approach to Neyman-Pearson Classification.” Journal of Machine Learning Research 14:3011–40. http://jmlr.org/papers/v14/tong13a.html.
Strictly, if \(p(x)=0.5\), it doesn’t matter what \(\Yhat\) is, but I will always write \(\geq\) for definiteness.↩
After Vilfredo Pareto, an economist who pioneered the study of optimization under competing objectives.↩
If the distributions \(f\) and \(g\) don’t overlap, we can get 0 FPR with positive power, and/or power 1 without FPR 1.↩
The more usual jargon word here is “class”, which I used in lecture, but this collides with “class” for whether \(Y=1\) or \(Y=0\), so I’ll try to avoid it↩
As an example, if \(x\) is two dimensional, and \(\phi(x) = (x_1, x_2, x_1^2, x_2^2)\), a linear classifier applied to \(\phi\) can pick out points inside (or outside) a circle, which you couldn’t do with the raw features. (What would the linear classifier for \(\phi\) look like?)↩
There are multiple reasons for this association. One is a long-standing history (cf. Dollard (1937)) of segregating African-Americans into neighborhoods which are under-policed (in the sense that violence often goes unpunished by the forces of the law) and over-policed (in the sense that interactions with the police are often hostile). This sets up a dynamic where people in those neighborhoods don’t trust the police, which makes the police ineffective, which makes being known for willingness to use violence a survival strategy, which etc., etc. Leovy (2015) gives a good account of this feedback loop from (mostly) the side of the police; Allen (2017) gives a glimpse of what it looks like from the other side.↩
Cathy O’Neil would remind us that many of these would flip around if we considered risk of financial crimes rather than violence.↩
I say “could”, because there’s some error in all these classifications, and it’s possible that these errors would cancel out the ability to predict violence from demographics.↩