\[ \newcommand{\Prob}[1]{\mathrm{Pr}\left( #1 \right)} \]

1 Overivew

This lecture and the next are going to cover common sources of technical failure in data mining projects — the kinds of issues which lead to them just not working. (Whether they would be worth doing even if they did work is another story, for the last lecture.) Today we’ll look at three sources of technical failure which are pretty amenable to mathematical treatment:

  1. Weak or non-existent predictive relationships between our features and the outcome we’re trying to predict.
  2. Changing relationships between our features and our outcome.
  3. The intrinsic difficulty of learning relationships between lots of variables.

Next time we’ll cover issues about measurement, model design and interpretation. They’re less mathematical but actually more fundamental.

2 Weak or non-existent signals

2.1 Weak predictive relationships are harder to estimate than strong ones

  • The weaker the relationship between the outcome and the features, the more data we need to distinguish it from pure noise
    • If \(I[X;Y] \approx 0\), then \(X\) and \(Y\) will be nearly independent
    • Which means that \(\Prob{Y=y|X=x} \approx \Prob{Y=y}\) for all \(x\)
    • Which means that \(\Prob{Y=y|X=x_1} \approx \Prob{Y=y|X=x_2}\) for any two \(x_1, x_2\)
    • Which means we’ll need a lot of data to learn the difference between those two distributions
  • But: even if there is no relationship, most methods will do their best to find something
    • e.g., in a linear model, even if all true slope coefficients \(=0\), your estimates (with finite data) won’t be exactly 0
    • CART (with pruning) is one of the few methods I know which will, in fact, say “there’s nothing to see here”
  • The weak-signal problem is exacerbated if comparing already-low rates or proportions
    • Recall that if you estimate a proportion \(p\) on \(n\) samples, the variance is \(p(1-p)/n\). If \(p\) is small, \(1-p \approx 1\), and the variance is about \(p/n\).
    • The relative error, or standard error / estimate, will then be about \(\sqrt{p/n}/p = 1/\sqrt{np}\). Notice that as \(p\) gets smaller, the relative error gets bigger.
    • Similar algebra for determining differences between small proportions — the relative error can be very large!
    • One place this shows up: online advertising, where we know the rates are low, and so seeing whether changes make any difference can require really huge data sizes, and is probably too costly even for the largest companies

2.2 Sometimes there is no predictive relationship

  • Sometimes the signal just doesn’t exist
    • Phrenology was a 19th century pseudo-science which claimed to predict intelligence, character, etc., from the shape of heads, and in particular how pronounced bumps on certain parts of the skull were
      • It was taken astonishingly seriously for a very long time, by people all over the world and the political spectrum, and you can trace its influence in (e.g.) how we still picture “criminality” (Pick 1989)
      • Almost all of the purported correlations were just stuff phrenologists made up
        • Which is almost a legitimate form scientific method (“conjectures and refutations”, “hypothetico-deductivism”, “guess and check”), except that…
        • They didn’t properly test any of their conjectures, and when people did start to actually, systematically check, none of the correlations held up
      • There was a little kernel of scientific truth here: specific mental functions do depend on specific regions of the brain. This was discovered by studying the effects of localized brain damage, which is still the source of most of our knowledge about what different brain regions are for (Shallice and Cooper 2011). These discoveries in the early 1800s helped inspire phrenology (Harrington 1989), but it quickly got out of control
        • Leap from “damaging/removing this brain region impairs this function” to “size of this brain region controls how strong this function is”
        • Leap from “size of this brain region” to “size of bump on the skull”
      • Phrenology has (deservedly) become a by-word for pseudo-science
      • So when you see people trying to predict sexual orientation from photographs, or whether someone will be a good employee from a brief video, etc., you should be very skeptical, and there is indeed every reason to think that these methods are (currently) just BS
    • You should think very hard about why you are trying to relate these features to this outcome, and what your data would look like if there was in fact no signal there

3 Changing relationships between features and outcomes

  • Many names: “Data-set shift”, “Distribution shift”, “Covariate shift”, etc.
  • Let’s distinguish some cases (following Quiñonero-Candela et al. (2009))

3.1 Covariate shift

“Covariate shift” = \(\Prob{Y|X}\) stays the same but \(\Prob{X}\) changes

  • Not an issue if your estimate of \(\Prob{Y|X}\) is very accurate
  • But most models are really dealing with “local” approximations
    • e.g., linear regression is really trying to find the best linear approximation to the true regression function \(\Expect{Y|X}\) (lecture 2)
    • The best linear approximation changes as \(\Prob{X}\) changes, unless that true regression function is really linear
  • In general, the best local approximation will change as \(\Prob{X}\) changes
  • \(\therefore\) The model we learned with the old \(\Prob{X}\) won’t work as well under the new distribution of the covariate
  • One potential way to cope is by weighting data points
    • If you know that the old pdf is \(p(x)\) and the new pdf is \(q(x)\), give data point \(i\) a weight proportional to \(p(x_i)/q(x_i)\) when fitting your model
    • There are some nice theoretical results about this (Cortes, Mansour, and Mohri 2010)
    • But it can be hazardous, since if \(q(x_i)\) is small and \(p(x_i)\) is not-so-small you’re giving a lot of weight to just a handful of points (“the small denominator problem”)
    • And if you need to estimate the ratio \(p(x)/q(x)\) there are ways to do so, but they introduce extra risk [@{Sugiyama-Kanawabee-on-covariate-shift]

3.2 Prior probability shift or class balance shift

“Prior probability shift” or “class balance shift” = \(\Prob{X|Y}\) stays the same but \(\Prob{Y}\) changes

  • This will change \(\Prob{Y|X}\), and so where you should draw your classification boundaries to minimize mis-classifications, or the expected cost of mis-classifications
    • Remember that those boundaries depend on \(\Prob{Y|X}\), which in turn is a function of \(\Prob{X|Y}\) and \(\Prob{Y}\)
    • It also, obviously, changes regressions
  • Re-weighting based on \(X\) doesn’t help here
  • You could try re-weighting on \(Y\)
  • Artificially balancing your data to have equal numbers of positive and negative cases \((\Prob{Y=0} \approx \Prob{Y=1})\) can help learn the difference between \(\Prob{X|Y=0}\) and \(\Prob{X|Y=1}\), but don’t expect that your error rate will really generalize to unbalanced data “in the wild”.
  • The “Neyman-Pearson” approach, of setting a limit on acceptable false positives and then minimizing false negatives, is more robust to this kind of shift than is just minimizing mis-classifications.

3.3 Concept drift

“Concept drift”1 = \(\Prob{X}\) stays the same, but \(\Prob{Y|X}\) changes, or, similarly, \(\Prob{Y}\) stays the same but \(\Prob{X|Y}\) changes

  • That’s history (more or less).
  • An outstanding example: Google Flu Trends
    • Estimated the prevalence of influenza from search-engine activity
    • Stopped working after a while (Lazer et al. 2014b)
    • It’s never recovered (Lazer et al. 2014a).
    • Because the actual relationship between “this many people searched for these words” and “this many people have the flu” changed

3.4 Coping mechanisms

  • I’ve mentioned weighting
  • Another family of coping strategies are incremental learning approaches that continually revise the model
    • One idea is to run many models in parallel:
      • Model 1 gets trained on data points \(1, 2, 3, \ldots n\) (increasing as \(n\) grows)
      • Model 2 gets trained on data points \(k+1, k+2, \ldots n\) (also increasing as \(n\) grows)
      • … Model \(r\) gets trained on data points \(rk+1, rk+2, \ldots n\)
      • Add a new model every \(k\) data points
      • Weight each model based on how well it’s done, and shift weight to the better models
      • Copes with many different kinds of shift, at some cost in efficiency if there no shifts [rowing-ensembles]
  • Note that cross-validation within a data set won’t detect this; you really do need to compare different data sets!

4 Curse of dimensionality

  • Q: \(p\) features uniformly distributed on \([0,1]^p\), \(n=10^9\), what’s the expected number of points within \(\pm 0.005\) (on every axis) of the mid-point, as a function of \(p\)?
  • A: The (hyper-) volume of the target region is \((0.005\times 2)^p = 10^{-2p}\), so the expected number of points in the region is \(10^9 10^{-2p} = 10^{9-2p}\)
    • For \(p=1\), that’s \(10^7\), an immense amount of data
    • For \(p=3\), that’s \(10^3\), still a very respectable sample size
    • For \(p=4\), that’s \(10^1=10\), not nothing but not a lot
    • For \(p=10\), that’s \(10^{-11}\), meaning there’s a substantial probability of not having even one point in the target region
    • For \(p=100\), that’s \(10^{-191}\)
    • Not even worth calculating with thousands of features
  • There are lots of domains where we have thousands or hundreds of thousands of features: images, audio, genetics, brain scans, advertising tracking on the Internet…
  • Why that little calculation matters: Say we’re trying to estimate a \(p\)-dimensional function by averaging all the observations we have which are within a distance \(h\) of the point we’re interested in
    • The bias we get from averaging is \(O(h^2)\), regardless of the dimension
      • (Taylor-expand the function out to second order; you saw this in detail if you took 402 with me)
    • If we’ve got \(n\) samples, we expect \(O(nh^p)\) of them to be within the region we’re averaging over, so the variance of the average will be \(O(n^{-1}h^{-p})\)
    • Total error is bias squared plus variance, so it’s \(O(h^4) + O(n^{-1}h^{-p})\)
    • We can control \(h\), so adjust it to minimize the error: \[\begin{eqnarray} O(h^3) - O(n^{-1}h^{-p-1}) & = & 0\\ O(h^{p+4}) & = & O(n^{-1})\\ h & = & O(n^{-1/(p+4)}) \end{eqnarray}\]
    • The total error we get is \(O(n^{-4/(p+4)}\)
    • This is great if \(p=1\), but miserable if \(p=100\)

  • The curse of dimensionality is that as the amount of data we need grows exponentially with the number of features we use
    • Above argument is just for averaging, but:
      • Almost all predictive modeling methods boil down to “average nearby points” (they just differ in how “nearby” gets defined, or maybe the precise form of averaging)
      • A more complicated argument shows that \(O(n^{-4/(4+p)})\) is in fact the best we can generally hope for (see backup)
      • The basic reason is that the number of possible functions explodes as \(p\) increases, but the amount of information in our sample does not
  • There are basically three ways to escape the curse:
    1. Hope that you already know the right function to use (it’s linear, or quadratic, or maybe an additive combination of smooth, 1-D functions)
      • If we can’t have this, may be we can at least hope that we know what features really matter, so we don’t just blindly throw in irrelevant ones
    2. Hope that while there are \(p\) features, so the \(X\) vectors “live” in a \(p\)-dimensional space, the features are very strongly dependent on each other, so that the \(X\) vectors are on, or very close to, a \(q\)-dimensional subspace, with \(q \ll p\)
      • Use dimension reduction to try to find this subspace, or
      • Use a prediction method which automatically adapts to the “intrinsic” dimension, like k-nearest-neighors (Kpotufe 2011) or some kinds of kernel regression (Kpotufe and Garg 2013)
    3. Use a strongly-constrained parametric model, as in (1), even though it’s wrong, and live with some bias that won’t go away even as \(n\rightarrow\infty\)
      • This is probably the best reason to still use linear models
  • The curse of dimensionality means that blindly doing data mining with tons of features will not end well
    • Of course, doing just this was seriously advocated by the editor of a leading tech magazine and a best-selling business book author (Anderson 2008), and it’s always been more or less implicit in a lot of pitches for “big data” and data mining
    • That this was nonsense was pointed out by everyone with any knowledge at the time, e.g., [https://www.earningmyturns.org/2008/06/end-of-theory-data-deluge-makes.html]
    • Or, you know, it’s a free country and you can try it out, but you should realize that the uncertainty will be huge

5 Summing Up

  1. Watch out for weak or even non-existent signals
    • What’s your margin of error, at your sample size? (use the bootstrap)
    • What would your data look like if there was really nothing there? (simulate)
  2. Watch out for changing distributions
  3. Watch out for trying to estimate too much (curse of dimensionality)
    • Use actual knowledge to impose constraints
    • Be honest about your uncertainties (use the bootstrap again)

6 Backup: Why it’s so hard to beat the curse of dimensionality

  • We’ve seen that local-averaging methods will have a risk that shrinks like \(O(n^{-4/(4+p)})\)
  • What about other methods?
    • We don’t know what the true function is, so should explore what the maximum risk will be, over all not-too-crazy functions
      • If we use linear regression, its risk is \(O(pn^{-1})\) if the true function is linear…
      • … but \(O(1)\) (constant in \(n\)) if the true function is non-linear
  • You can show that the minimum possible maximum risk (“minimax risk”) is \(O(n^{-4/(4+p)})\) (Györfi et al. 2002)
    • (That rate assumes the function has at least 2 derivatives)
    • So local averaging is doing about as well as you can hope in general
    • Getting away from “in general” means: ruling out some functions a priori, without even looking at the data
  • Why is that the minimax rate? The precise argument is technical, but the intuition is information-theoretic
    • We observe the function plus noise, \(Y=f(X)+\epsilon\), at \(n\) points
    • From \((Y_1, \ldots Y_n)\), we want to estimate the function \(f\)
    • This is like a communication channel: the receiver sees \((Y_1, \ldots Y_n)\) and wants to recover the “signal” \(f\)
    • How many bits of information is there in the \(Y\) values? \(nH[Y]\) bits
    • How many possible functions \(f\) are there? Clearly, infinitely many…
      • So let’s limit the range of \(f\) to some interval of length (say) \(\tau\)
      • And let’s even say we don’t care about the exact value of \(f(x)\), we just want to know whether it fits into a discrete bin of width \(\delta\)
      • So the number of possible functions at \(n\) points is going to be at most \((\tau/\delta)^n\)
      • And the amount of information in the \(Y\) values is at most \(n\log{\tau/\delta)}\)
      • BUT we’re assuming the functions are smooth, so knowing \(f(x)\) limits the possible values of \(f(x^{\prime})\) if \(x^{\prime}\) is near by; it has to be within some distance \(\kappa < \tau\) of \(f(x)\), so after discretizing there are \(\kappa/\delta\) choices available
      • When I (the sender / Nature) set \(f(x)\), I don’t have complete freedom to set \(f(x^{\prime})\), it has to be close to what I chose for \(f(x)\), and you (the receiver / Learner) can use that
      • But every dimension for \(x\) gives me, the sender, an additional degree of freedom on which I can alter \(f(x^{\prime})\);
      • there will be \(\kappa/\delta\) choices for what to do when moving along axis 1, and \(\kappa\) independent choices for what to do when moving along axis 2
    • The upshot is that the number of effectively different functions grows exponentially with \(p\), it’s \(O((\kappa/\delta)^p)\)
    • Decoding (learning) requires more information at the receiver than in the signal, so we need more and more samples (growing \(n\)) to pick out the right function
    • Actually getting the rate requires doing this math more precisely
      • In particular, figuring out \(\kappa\), and just how quickly you can let \(\delta \rightarrow 0\) while still having enough information in the \(Y\)’s to pick out the right function

6.1 Backup: Alternative geometric formulations of the curse of dimensionality

  • We looked at how many sampled data points we can expect to find within a small distance of a given location in the \(X\) space
  • Alternative formulation 1: as \(p\) grows, the distance to the nearest neighbor approaches the distance to the average neighbor (at fixed \(n\))
    • Helps understand why nearest neighbor methods struggle in high dimensions
  • Most of the volume of a \(p\)-dimensional sphere (cube, etc.) is in a thin shell of width \(\epsilon\) near its surface
    • E.g., for a disk (=2D sphere) of radius 1, the area is \(\pi\), the area not within the shell of width \(\epsilon\) is \(\pi(1-\epsilon)^2\), so that fraction not close to the surface is \((1-\epsilon)^2\)
    • For a sphere of radius 1, the volume is \(\frac{4}{3}\pi\), the volume not close to the surface is \(\frac{4}{3}\pi(1-\epsilon)^3\), the fraction not close to the surface is \((1-\epsilon)^3\)
    • In general, in \(p\) dimensions, the fraction of the volume more than \(\epsilon\) away from the surface is \((1-\epsilon)^p \rightarrow 0\) as \(p\rightarrow\infty\) (for any \(\epsilon > 0\))
  • Alternative formulation 2: A small “amplification” or “blow-up” of any set with positive probability contains most of the probability
    • Because: the volume of a ball grows so rapidly with its radius
  • These are pretty generic features of high-dimensional distributions, not specific to uniforms (Boucheron, Lugosi, and Massart 2013)

References

Anderson, Chris. 2008. “The End of Theory: The Data Deluge Makes the Scientific Method Obsolete.” Wired 16 (17). http://www.wired.com/2008/06/pb-theory/.

Boucheron, Stéphane, Gábor Lugosi, and Pascal Massart. 2013. Concentration Inequalities: A Nonasymptotic Theory of Independence. Oxford: Oxford University Press.

Cortes, Corinna, Yishay Mansour, and Mehryar Mohri. 2010. “Learning Bounds for Importance Weights.” In Advances in Neural Information Processing 23 [Nips 2010], edited by John Lafferty, C. K. I. Williams, John Shawe-Taylor, Richard S. Zemel, and A. Culotta, 442–50. Cambridge, Massachusetts: MIT Press. http://papers.nips.cc/paper/4156-learning-bounds-for-importance-weighting.

Györfi, László, Michael Kohler, Adam Krzyżak, and Harro Walk. 2002. A Distribution-Free Theory of Nonparametric Regression. New York: Springer-Verlag.

Harrington, Anne. 1989. Medicine, Mind and the Double Brain: A Study in Nineteenth Century Thought. Princeton, New Jersey: Princeton University Press.

Kpotufe, Samory. 2011. “K-Nn Regression Adapts to Local Intrinsic Dimension.” In Advances in Neural Information Processing Systems 24 [Nips 2011], edited by John Shawe-Taylor, Richard S. Zemel, Peter L. Bartlett, Fernando Pereira, and Kilian Q. Weinberger, 729–37. Cambridge, Massachusetts: MIT Press. http://papers.nips.cc/paper/4455-k-nn-regression-adapts-to-local-intrinsic-dimension.

Kpotufe, Samory, and Vikas Garg. 2013. “Adaptivity to Local Smoothness and Dimension in Kernel Regression.” In Advances in Neural Information Processing Systems 26 [Nips 2013], edited by C. J. C. Burges, Léon Bottou, Max Welling, Zoubin Ghahramani, and Kilian Q. Weinberger, 3075–83. Curran Associates. https://papers.nips.cc/paper/5103-adaptivity-to-local-smoothness-and-dimension-in-kernel-regression.

Pick, Daniel. 1989. Faces of Degeneration: A European Disorder, C. 1848 – C. 1918. Cambridge: Cambridge University Press.

Quiñonero-Candela, Joaquin, Masashi Sugiyama, Anton Schwaighofer, and Neil D. Lawrence, eds. 2009. Dataset Shift in Machine Learning. Cambridge, Massachusetts: MIT Press.

Shallice, Tim, and Richard P. Cooper. 2011. The Organisation of Mind. Oxford: Oxford University Press.


  1. Why “concept drift”? Because some of the early work on classifiers in machine learning came out of work in artificial intelligence on learning “concepts”, which in turn was inspired by psychology, and the idea was that you’d mastered a concept, like “circle” or “triangle”, if you could correctly classify instances as belonging to the concept or not; this meant learning a mapping from the features \(X\) to binary labels. If the concept changed over time, the right mapping would change; hence “concept drift”.