Information Theory III — Information for Prediction

36-462/662, Data Mining, Fall 2019

Lecture 17 (23 October 2019)

\[ \newcommand{\Expect}[1]{\mathbb{E}\left[ #1 \right]} \]

Information for Prediction

Predictive Information

Predictive Information vs. Accuracy

Predictive Information vs. Accuracy

Reducing the Features

Sufficiency

Sufficiency cont’d.

From Sufficiency to the Information Bottleneck

Why This Matters

Dimension Reduction with a Target Variable

Dimension Reduction without a Target Variable

word2vec

Clustering

Summing up

Backup: Information for Continuous Variables

Replace sums with integrals as needed.

Backup: Conditional entropy and classification accuracy

\[\begin{eqnarray} H[M|Y,\hat{Y}] & = & 0\\ H[Y|\hat{Y}] & = & H[Y|\hat{Y}] + H[M|Y,\hat{Y}]\\ & = & H[Y,M|\hat{Y}]\\ & = & H[M|\hat{Y}] + H[Y|M,\hat{Y}]\\ & = & H[M|\hat{Y}] + \Pr(M=0)H[Y|M=0,\hat{Y}] + \Pr(M=1)H[Y|M=1,\hat{Y}]\\ & \leq & H[M] + \Pr(M=0)H[Y|M=0,\hat{Y}] + \Pr(M=1)H[Y|M=1,\hat{Y}]\\ & = & H[M] + \Pr(M=0)\cdot 0 + \Pr(M=1) H[Y|M=1,\hat{Y}]\\ & = & H[M] + \Pr(M=1) H[Y|M=1,\hat{Y}]\\ & \leq & H[M] + \Pr(M=1)\log{(|\mathcal{Y}|-1)} \end{eqnarray}\]

because if \(M=1\), we know \(Y\neq \hat{Y}\) and there are at most \(|\mathcal{Y}|-1\) values \(Y\) could have

Finally, remember that \(H[Y|\hat{Y}(X)] \geq H[Y|X]\)

This result is called Fano’s inequality

Backup: Sufficiency and the Bottleneck, Take 2 (Shalizi and Crutchfield 2002)

Backup: word2vec in a little more detail

References

Goldberg, Yoav, and Omer Levy. 2014. “word2vec Explained: Deriving Mikolov et Al.’s Negative-Sampling Word Embedding Method.” Electronic preprint, arxiv:1402.3722. https://arxiv.org/abs/1402.3722.

Levy, Omer, and Yoav Goldberg. 2014. “Neural Word Embedding as Implicit Matrix Factorization.” In Advances in Neural Information Processing Systems 27 [Nips 2014], edited by Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, 2177–85. Curran Associates. http://papers.nips.cc/paper/5477-neural-word-embedding-as.

Salmon, Wesley C. 1971. Statistical Explanation and Statistical Relevance. Pittsburgh: University of Pittsburgh Press.

———. 1984. Scientific Explanation and the Causal Structure of the World. Princeton: Princeton University Press.

Scarlett, Jonathan, and Volkan Cevher. n.d. “An Introductory Guide to Fano’s Inequality with Applications in Statistical Estimation.” In Information-Theoretic Methods in Data Science, edited by Yonina Eldar and Miguel Rodrigues. Cambridge, England: Cambridge University Press. https://arxiv.org/abs/1901.00555.

Shalizi, Cosma Rohilla, and James P. Crutchfield. 2002. “Information Bottlenecks, Causal States, and Statistical Relevance Bases: How to Represent Relevant Information in Memoryless Transduction.” Advances in Complex Systems 5:91–95. http://arxiv.org/abs/nlin.AO/0006025.

Tishby, Naftali, Fernando C. Pereira, and William Bialek. 1999. “The Information Bottleneck Method.” In Proceedings of the 37th Annual Allerton Conference on Communication, Control and Computing, edited by B. Hajek and R. S. Sreenivas, 368–77. Urbana, Illinois: University of Illinois Press. http://arxiv.org/abs/physics/0004057.