Nearest Neighbors III — Mostly Computational

Agenda

  1. Wrap-up on k selection
  2. Computational costs of naive implementation of kNN
  3. Fast, approximate kNN search

Selecting k

A general trade-off to model selection

For cross-validation

For kNN

Computational costs of naive implementation

How expensive is it to run kNN?

Why is it \(O(n)\)?

Using fewer data points

Faster distance computation

Pre-selecting possible neighbors

Data structures: \(k-d\) trees

Using a \(k-d\) tree

Why is a \(k-d\) tree fast?

EXERCISE: How many levels do we need to go down to reach \(\approx k\) candidate neighbors?

Why is a \(k-d\) tree fast?

SOLUTION: Set \(n 2^{-d}\) to \(k\) and solve: \[\begin{eqnarray} n 2^{-d} & = & k\\ \log_2{n} - d & =& \log_2{k}\\ d & = & \log_2{n/k} \end{eqnarray}\]

Building the \(k-d\) tree (one approach)

Locality-sensitive hashing

The random-hyperplane hash

The random-hyperplane hash (cont’d)

The random-inner-product hash

The cluster hash

Some common threads to the LSH techniques

Wrapping up

After-notes

After-notes

References

Azadkia, Mona. 2019. “Optimal Choice of \(k\) for \(k\)-Nearest Neighbor Regression.” E-print, arxiv:1909.05495. http://arxiv.org/abs/1909.05495.

Bentley, Jon Louis. 1975. “Multidimensional Binary Search Trees Used for Associative Searching.” Communications of the ACM 18:508–17. https://doi.org/10.1145/361002.361007.

Charikar, Moses S. 2002. “Similarity Estimation Techniques from Rounding Algorithms.” In, edited by John Reif, 380–88. New York: ACM. https://doi.org/10.1145/509907.509965.

Claeskens, Gerda, and Nils Lid Hjort. 2008. Model Selection and Model Averaging. Cambridge, England: Cambridge University Press.

Datar, Mayur, Nicole Immorlica, Piotr Indyk, and Vahab S. Mirrokni. 2004. “Locality-Sensitive Hashing Scheme Based on P-Stable Distributions.” In Proceedings of the 20th Annual Symposium on Computational Geometry [Scg04], edited by Jack Snoeyink and Jean-Daniel Boissonnat, 253–62. New York: ACM. https://doi.org/10.1145/997817.997857.

Gershenfeld, Neil. 1999. The Nature of Mathematical Modeling. Cambridge, England: Cambridge University Press.

Gionis, Aristides, Piotr Indyk, and Rajeev Motwani. 1999. “Similarity Search in High Dimensions via Hashing.” In Proceedings of the 25th International Conference on Very Large Data Bases [Vldb ’99], edited by Malcolm P. Atkinson, Maria E. Orlowska, Patrick Valduriez, Stanley B. Zdonik, and Michael L. Brodie, 518–29. San Francisco: Morgan Kaufmann.

Leskovec, Jure, Anand Rajaraman, and Jeffrey D. Ullman. 2014. Mining of Massive Datasets. Second. Cambridge, England: Cambridge University Press. http://www.mmds.org.