\[ \DeclareMathOperator*{\argmin}{argmin} % thanks, wikipedia! \DeclareMathOperator*{\argmax}{argmax} \]
\[ \newcommand{\Prob}[1]{\mathbb{P}\left( #1 \right)} \newcommand{\Expect}[1]{\mathbb{E}\left[ #1 \right]} \newcommand{\indep}{\perp \!\!\! \perp} \]
\perp
) to indicate independence in LaTeX (as I did at the board)\(I[X;Y]\) measures dependence between \(X\) and \(Y\)
Entropy tells us how many bits we need to describe/encode a random variable
Fano’s inequality: \(H[Y|X] \leq \Prob{E=1}\log_2{(m - 1)} + H[E]\)
Note: if \(m=2\) (binary \(Y\)), then the
Fixing the desired mis-classification rate, the curve gives the maximum allowable \(H[Y|X]\). Fixing \(H[Y|X]\), the curve gives the minimum achievable mis-classification rate.
MORAL: \(H[Y|X]\) gives a lower bound on how well \(Y\) can be guessed from \(X\)
\(H[Y]-H[Y|X] = I[X;Y]\)
Mutual information is amount of entropy removed from \(Y\) by conditioning on \(X\)
Some algebra shows \[\begin{eqnarray} D_X(P\|Q) & \geq & (1-\alpha)\log_2{\frac{1-\alpha}{\beta}} + \alpha\log_2{\frac{\alpha}{1-\beta}}\\ D_X(Q\|P) & \geq & \beta\log_2{\frac{\beta}{1-\alpha}} + (1-\beta)\log_2{\frac{1-\beta}{\alpha}} \end{eqnarray}\]
(See the appendix on information theory in Shalizi (n.d.), which is ripping off Kullback (1968))
Fixing the FNR, the curve gives the minimum necessary divergence; fixing the divergence, the curve gives the minimum attainable FNR. The curve depends on the false positive rate (\(\alpha\)).
If the model is right, in the long run (\(n\rightarrow\infty\)), the true parameter value has the highest log-likelihood
Here’s the obvious way to estimate an entropy, or a mutual information, or a conditional information (etc., etc.):
There are pros and cons to the plug-in approach:
entropy
package on CRAN implements a number of different plug-in estimators for discrete variables (with different estimators of the underlying distribution)mpmi
package implements a plug-in estimator based on kernel density estimationAs a little example, say we look at the empirical entropy of a discrete random variable with \(m\) different possible values, calculated from a sample size of \(n\). That is, we use \[ \hat{H}[X] = -\sum_{x}{\frac{n_x}{n}\log_2{\frac{n_x}{n}}} \]
entropy
package includes a bias-corrected empirical entropy estimatorAlternatives involve somehow trying to avoid estimating the whole distribution
The crucial fact is that the logarithm is a concave function. Geometrically, this means that if you plot the curve of \(\log{u}\), and then connect any two points on the curve by a straight line, the curve will lie above the line:
Algebraically1, this means that for any two numbers \(t_1\) and \(t_2\), and any \(w\), \(0 < w < 1\), we have \[ w \log{t_1} + (1-w) \log{t_2} < \log{(w t_1 + (1-w) t_2)} \] Of course, if we let \(w\) be either 0 or 1, we’d have equality between the right-hand side and the left-hand side, not inequality, but that’s the only way to get equality between them.
This extends to multiple points2. If we pick any \(k\) numbers \(t_1, \ldots t_k\), and a set of weights \(w_1, \ldots w_k\), with all \(w_i > 0\) but \(\sum_{i=1}^{k}{w_i} = 1\), then the weighted average of the logs of the \(t_i\)’s is \(<\) the log of the weighted average: \[ \sum_{i=1}^{k}{w_i \log{t_i}} < \log{\left(\sum_{i=1}^{k}{w_i t_i}\right)} \] Again, if we let one of the \(w_i=1\) and all the others \(=0\), we’d get equality, but otherwise, we have a strict inequality.
This even extends to integrals3: for any probability density \(w(t)\), \[ \int{w(t)\log{t} dt} < \log{\left(\int{t w(t) dt}\right)} \] and, again, the only way to get equality is if the distribution puts probability 1 on one particular value of \(t\), and probability 0 on every other value. (The density needs to be a Dirac delta function.) For a non-degenerate distribution, the inequality is strict.
The general fact that the expected value of a concave function is \(<\) the function applied to the expected value is called Jensen’s inequality.
Now remember that \[ I[X;Y] = \sum_{x,y}{p(x,y) \log_2{\frac{p(x,y)}{p(x) p(y)}}} \] We want to prove that this is \(\geq 0\), so we’re going to prove that \(-I[X;Y] \leq 0\). Of course \[ -I[X;Y] = \sum_{x,y}{p(x,y) \log_2{\frac{p(x) p(y)}{p(x,y)}}} \] This is a weighted average of logarithms, so it’s \(<\) the logarithm of the weighted average: \[\begin{eqnarray} \sum_{x,y}{p(x,y) \log_2{\frac{p(x) p(y)}{p(x,y)}}} & < & \log_2{\sum_{x,y}{p(x,y) \frac{p(x) p(y)}{p(x,y)}}}\\ & = & \log_2{\sum_{x,y}{p(x)p(y)}}\\ & = & \log_2{1}\\ & = &0 \end{eqnarray}\] as desired. The only way to avoid this, and get equality, is for all of the log probability ratios to equal 0, i.e., for \(p(x,y) = p(x) p(y)\), i.e., for \(X\) and \(Y\) to be independent.
\(\log{u}\) is a concave function (see above) so we use the same reasoning about the average of a concave function (Jensen’s inequality) sketched above: \[\begin{eqnarray} \sum_{x}{p(x) \log_2{p(x)}} & < & \log_{2}{\sum_{x}{p(x) p(x)}}\\ & \leq & \log_{2}{\sum_{x}{ p(x) }}\\ & = & \log_{2}{1}\\ & = & 0\\ \end{eqnarray}\] (To go from the first line to the second, remember that we’re assuming \(X\) is discrete, so \(p(x) \leq 1\).)
To make the first line an equality rather than an inequality, we need \(p(x)\) to put probability 1 on one particular value, which will also make the second line an equality.
If \(X\) takes at most \(m\) distinct values, then \(H[X] \leq \log_{2}{m}\), with equality only if \(p(x) = 1/m\) for all \(x\).
To prove this, we write \[ H[X] = \sum_{x}{p(x) \log_2{\frac{1}{p(x)}}} \] So, applying Jensen’s inequality, \[\begin{eqnarray} H[X] & \leq & \log_2{\left( \sum_{x}{p(x) \frac{1}{p(x)}} \right)}\\ & = & \log_{2}{\sum_{x}{1}}\\ & = & \log_{2}{m} \end{eqnarray}\] Again, the only way for the first line to be an equality rather than a strict inequality is for all the values of \(1/p(x\) to be equal, which means all the values of \(p((x)\) must be equal, and so they must all be \(1/m\)
Use Jensen’s inequality in the now-familiar way. Specifically, show that \(-D(p\|q) \leq 0\): \[\begin{eqnarray} -D(p\|q) & = & \sum_{x}{p(x) \log_2{\frac{q(x)}{p(x)}}}\\ & \leq & \log_2{\sum_{x}{p(x) \frac{q(x)}{p(x)}}}\\ & = & \log_2{\sum_{x}{q(x)}}\\ & = & \log_2{1}\\ & = & 0 \end{eqnarray}\] as was to be shown. Again, the only way for the \(\leq\) to be \(=\) is if the distribution of log probability ratios is degenerate at one particular value, which can only be 0, which would imply \(q(x) = p(x)\) for all \(x\).
Ashby, W. Ross. 1956. An Introduction to Cybernetics. London: Chapman; Hall.
———. 1960. Design for a Brain: The Origins of Adaptive Behavior. 2nd ed. London: Chapman; Hall.
Davies, Daniel. 2018. Lying for Money: How Legendary Frauds Reveal the Workings of Our World. London: Profile Books.
Kozachenko, L. F., and N. N. Leonenko. 1987. “Sample Estimate of the Entropy of a Random Vector.” Problems of Information Transmission 23:95–101.
Kraskov, Alexander, Harald Stögbauer, and Peter Grassberger. 2004. “Estimating Mutual Information.” Physical Review E 69:066138. https://doi.org/10.1103/PhysRevE.69.066138.
Kullback, Solomon. 1968. Information Theory and Statistics. 2nd ed. New York: Dover Books.
Kullback, Solomon, and R. A. Leibler. 1951. “On Information and Sufficiency.” Annals of Mathematical Statistics 22:79–86. https://doi.org/10.1214/aoms/1177729694.
Mesner, Octavio César, and Cosma Rohilla Shalizi. 2019. “Conditional Mutual Information Estimation for Mixed Discrete and Continuous Variables with Nearest Neighbors.” E-print, arxiv.org:1912.03387. https://arxiv.org/abs/1912.03387.
Palomar, Daniel P., and Sergio Verdú. 2008. “Lautum Information.” IEEE Transactions on Information Theory 54:964–75. http://www.princeton.edu/~verdu/lautum.info.pdf.
Scarlett, Jonathan, and Volkan Cevher. n.d. “An Introductory Guide to Fano’s Inequality with Applications in Statistical Estimation.” In Information-Theoretic Methods in Data Science, edited by Yonina Eldar and Miguel Rodrigues. Cambridge, England: Cambridge University Press. https://arxiv.org/abs/1901.00555.
Shalizi, Cosma Rohilla. n.d. Advanced Data Analysis from an Elementary Point of View. Cambridge, England: Cambridge University Press. http://www.stat.cmu.edu/~cshalizi/ADAfaEPoV.
Shannon, Claude E. 1948. “A Mathematical Theory of Communication.” Bell System Technical Journal 27:379–423.
Touchette, Hugo, and Seth Lloyd. 1999. “Information-Theoretic Limits of Control.” Physical Review Letters 84:1156–9. http://arxiv.org/abs/chao-dyn/9905039.
———. 2004. “Information-Theoretic Approach to the Study of Control Systems.” Physica A 331:140–72. http://arxiv.org/abs/physics/0104007.
Vaart, A. W. van der. 1998. Asymptotic Statistics. Cambridge, England: Cambridge University Press.
Victor, Jonathan D. 2000. “Asymptotic Bias in Information Estimates and the Exponential (Bell) Polynomials.” Neural Computation 12:2797–2804. https://doi.org/10.1162/089976600300014728.
White, Halbert. 1994. Estimation, Inference and Specification Analysis. Cambridge, England: Cambridge University Press.