36-462/662, Data Mining, Fall 2019
Lecture 17 (23 October 2019)
\[ \newcommand{\Expect}[1]{\mathbb{E}\left[ #1 \right]} \]
word2vec
Replace sums with integrals as needed.
\(X\) and \(Y\) both continuous: \[\begin{eqnarray} I[X;Y] & = & \int{p(x,y) \left(\log{\left(\frac{p(x,y)}{p(x)p(y)}\right)}\right) dx dy}\\ & = & \int{p(x) \left(\int{p(y|x) \log{\left(\frac{p(y|x)}{p(y)}\right)} dy}\right) dx} \end{eqnarray}\] with \(p\) being the pdf everywhere
\(X\) continuous, \(Y\) discrete: \[ I[X;Y] = \int{p(x) \left(\sum_{y}{p(y|x) \log{\left(\frac{p(y|x)}{p(y)}\right)}}\right) dx} \] with \(p(x)\) being the pdf but \(p(y|x)\), \(p(y)\) being the (conditional) pmf
because if \(M=1\), we know \(Y\neq \hat{Y}\) and there are at most \(|\mathcal{Y}|-1\) values \(Y\) could have
Finally, remember that \(H[Y|\hat{Y}(X)] \geq H[Y|X]\)
This result is called Fano’s inequality
word2vec
in a little more detailword2vec
software does something easier but related (Goldberg and Levy 2014) …Goldberg, Yoav, and Omer Levy. 2014. “word2vec
Explained: Deriving Mikolov et Al.’s Negative-Sampling Word Embedding Method.” Electronic preprint, arxiv:1402.3722. https://arxiv.org/abs/1402.3722.
Levy, Omer, and Yoav Goldberg. 2014. “Neural Word Embedding as Implicit Matrix Factorization.” In Advances in Neural Information Processing Systems 27 [Nips 2014], edited by Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, 2177–85. Curran Associates. http://papers.nips.cc/paper/5477-neural-word-embedding-as.
Salmon, Wesley C. 1971. Statistical Explanation and Statistical Relevance. Pittsburgh: University of Pittsburgh Press.
———. 1984. Scientific Explanation and the Causal Structure of the World. Princeton: Princeton University Press.
Scarlett, Jonathan, and Volkan Cevher. n.d. “An Introductory Guide to Fano’s Inequality with Applications in Statistical Estimation.” In Information-Theoretic Methods in Data Science, edited by Yonina Eldar and Miguel Rodrigues. Cambridge, England: Cambridge University Press. https://arxiv.org/abs/1901.00555.
Shalizi, Cosma Rohilla, and James P. Crutchfield. 2002. “Information Bottlenecks, Causal States, and Statistical Relevance Bases: How to Represent Relevant Information in Memoryless Transduction.” Advances in Complex Systems 5:91–95. http://arxiv.org/abs/nlin.AO/0006025.
Tishby, Naftali, Fernando C. Pereira, and William Bialek. 1999. “The Information Bottleneck Method.” In Proceedings of the 37th Annual Allerton Conference on Communication, Control and Computing, edited by B. Hajek and R. S. Sreenivas, 368–77. Urbana, Illinois: University of Illinois Press. http://arxiv.org/abs/physics/0004057.