\[ \DeclareMathOperator*{\argmin}{argmin} % thanks, wikipedia! \DeclareMathOperator*{\argmax}{argmax} \]

\[ \newcommand{\Prob}[1]{\mathbb{P}\left( #1 \right)} \newcommand{\Expect}[1]{\mathbb{E}\left[ #1 \right]} \newcommand{\indep}{\perp \!\!\! \perp} \]

Prediction

Notation

  • \(p\) is going to stand, indifferently, for the probability mass function or the probability density function, depending on whether we’re dealing with discrete or continuous variables
  • \(p\) is also going to be used for marginal distributions, joint distributions and conditional distributions
    • So \(p(x)\) for the marginal mass/density of the random variable \(X\) at value \(x\), but \(p(y)\) for the marginal mass/density of the random variable \(Y\) at value \(y\)
    • More careful/pedantic notation would be \(p_X(x)\) vs \(p_Y(y)\) but I am hoping for context to keep things straight
  • Similarly, if I need to compare to another probability mass/density function, it’ll be \(q\)
  • There are places where it’s conventional to write \(P\) or \(Q\) for the probability distribution with pmf/pdf \(p\) or \(q\)

Prediction needs the conditional distribution

  • There is some marginal distribution for the response variable \(Y\), with probability mass/density function \(p(y)\)
  • We can use that for prediction, if nothing else is available
    • Regression: predict the expected value of \(y\), i.e., \(\sum_{y}{y p(y)}\) or \(\int{y p(y) dy}\)
    • Classification: predict the class with the highest probability, \(\argmax_{y}{p(y)}\)
  • Using another feature \(X\) only helps if it changes the distribution of \(y\)
  • That is, \(p(y|x)\) has to be different from \(p(y)\), at least sometimes
  • If \(p(y|x)\) isn’t constant in \(x\), we can use it to improve our predictions
    • Regression: predict \(\sum_{y}{y p(y|x)}\) or \(\int{y p(y|x) dx}\)
    • Classification: predict the \(y\) with the highest conditional probability, \(\argmax_{y}{p(y|x)}\)
  • If \(p(y|x)\) is constant in \(x\), we shouldn’t care about it
    • Example I gave in class: someone’s astrological sign, or blood type, won’t change the probability that they will repay a loan
    • Example I gave in class: two people’s astrological signs, and/or blood types, won’t change the probability that they will have a successful date
      • The example was a little bit too hasty: if one or more of them thinks that astrological signs and/or blood types matter for successful dates, then there might actually be an effect…

Statistical independence and conditional distributions

  • Recall from intro. prob. that two random variables \(X\) and \(Y\) are statistically independent when their joint probability is the product of their marginal probabilities: \[ X \indep Y \Leftrightarrow \forall x, y: p(x,y) = p(x) p(y) \]
    • We also say that “the joint distribution factors into the product of the marginals”, or just “the joint distribution factors into the marginals”
    • Here I am using the common \(\indep\) sign for independence
      • Annoyingly, while this has been the standard sign in written math for independence since at least the 1980s, it’s not part of basic LaTeX, or even the more common extension packages, but instead hacked together in a slightly fragile way (see the R Markdown source)
      • You may find it more reliable to just use \(\perp\) (\perp) to indicate independence in LaTeX (as I did at the board)
  • If random variables are not independent, then they are statistically dependent: \[ X \not\indep Y \Leftrightarrow \exists x, y: p(x,y) \neq p(x) p(y) \]
    • Notice there has to be only one pair of \(x\) and \(y\) values where \(p(x,y) \neq p(x) p(y)\)
      • Can you prove that if there are any such \((x,y)\) pairs, there must be at least two? That is, that there can’t be exactly one such pair?
  • We make you do a lot of problems with independent variables, because it’s nice to just multiply the probabilities, but it has a more predictive meaning as well.
  • Claim: If \(X \indep Y\), then \(p(y|x) = p(y)\) for all \(x\) and \(y\). Proof: \[ p(y|x) = \frac{p(x,y)}{p(x)} = \frac{p(x) p(y)}{p(x)} = p(y) \]
    • The first step is just the definition of conditional probability, and the second uses the assumption of independence
  • It is also true that if \(p(y|x) = p(y)\) for all \(x\) and \(y\), then \(X \indep Y\). The proof is almost the same: \[ p(x,y) = p(y|x) p(x) = p(y) p(x) \]
    • (Re-write the definition of conditional probability, use the assumption)
  • So \(X \indep Y\) if and only if (“iff”) \(p(y|x) = p(y)\)
  • If \(X \indep Y\), then \(X\) is useless for predicting \(Y\)
  • If \(X \not\indep Y\), then \(X\) helps predict \(Y\) at least a little bit
  • Let’s try to quantify this, so we can compare features by how well they’d let us predict \(Y\)

Mutual information measures dependence

Mutual information defined

  • Let’s define the following quantity: \[ I[X;Y] \equiv \sum_{x}{\sum_{y}{p(x,y) \log_2{\frac{p(x,y)}{p(x) p(y)}}}} \]
    • If everything’s continuous, replace the sums with integrals
    • Presume that \(0\log{\frac{0}{q}} = 0\)
      • See backup if you don’t believe me
  • This is called the mutual information between \(X\) and \(Y\)
  • Try reading it “from the inside out”
    • \(\frac{p(x,y)}{p(x) p(y)}\) is a ratio of probabilities; it compares the actual joint probability to the product of the marginals, to what we’d see under independence
    • \(\frac{p(x,y)}{p(x) p(y)} > 1\) means that combination of \(x\) and \(y\) is more probable than we’d expect under independence
    • \(\frac{p(x,y)}{p(x) p(y)} < 1\) means that combination of \(x\) and \(y\) is suppressed, compared to what we’d expect under independence
    • Taking the log gives us positive contributions to the sum from more-common combinations and negative contributions from less-common combinations
    • Finally, we weight the log probability ratios for each combination by the probability of that combination, to get an average
  • The log probability ratio \(\log_2{\frac{p(x,y)}{p(x) p(y)}}\) is sometimes called the pointwise mutual information (PMI) of the point \((x,y)\), so the over-all mutual information is the expected value of the PMI

Some properties of mutual information

  1. \(I[X;Y] = I[Y;X]\) (it doesn’t matter whether we sum over \(y\) and then \(x\) or the other way around)
  2. If \(X\indep Y\), then \(I[X;Y] = 0\) (since the pointwise mutual information is exactly 0 at every point)
  3. \(I[X;Y] \geq 0\)
  • This is a really important fact, but the proof is a little tricky, so I skipped it in class; see the backup
  1. \(I[X;Y] = 0\) only if \(X \indep Y\)
    • See backup again, same argument establishes this

Mutual information measures dependence

  • \(I[X;Y] = 0 \Leftrightarrow X \indep Y\)
    • By (2) and (4) above
  • \(I[X;Y] > 0 \Leftrightarrow X \not\indep Y\)
    • By (2), (3) and (4) above

\(I[X;Y]\) measures dependence between \(X\) and \(Y\)

Mutual information in terms of prediction

  • We can re-write mutual information so it’s more clearly about prediction
    • The trick is to replace the joint distribution \(p(x,y)\) with \(p(y|x)p(x)\).
    • That factorization is just the definition of the conditional probability, it’s always true. \[\begin{eqnarray} I[X;Y] & = & \sum_{x}{\sum_{y}{p(x,y) \log_2{\frac{p(x,y)}{p(x) p(y)}}}}\\ & = & \sum_{x}{\sum_{y}{p(x,y) \log_2{\frac{p(y|x) p(x)}{p(x) p(y)}}}}\\ & = & \sum_{x}{\sum_{y}{p(x,y) \log_2{\frac{p(y|x)}{p(y)}}}}\\ \end{eqnarray}\]
  • The inner-most bit (inside the log) is now about how far the the conditional distribution \(p(y|x)\) differs from the marginal distribution \(p(y)\)
    • We average those log ratios over the probabilities of different combinations
  • Pulling the same trick again, \[ I[X;Y] = \sum_{x}{p(x) \sum_{y}{p(y|x) \log_2{\frac{p(y|x)}{p(y)}}}} \]
    • The inner sum is now about how far the distribution of \(Y\) given \(X=x\) departs from the marginal distribution of \(Y\)
    • The outer sum says that we compute this for each \(x\), and then average over different \(x\)’s
    • The information is going to be large when different values of \(x\) make a big difference to the distribution of \(Y\)

Mutual information vs. entropy

Entropy

  • We saw the definition of entropy when we looked at trees: \[ H[X] \equiv -\sum_{x}{p(x) \log_2{p(x)}} \]
    • This is clearly \(=0\) for a degenerate distribution / random variable, one where \(p(x^*) = 1\) for one particular \(x^*\) (and so \(p(x) = 0\) if \(x\neq x^*\))
    • \(H[X] \geq 0\), and \(H[X] = 0\) only if \(X\) is degenerate
      • See backup
    • If \(X\) has \(m\) possible values, then \(H[X] \leq \log_2{m}\), and the only distribution where \(H[X] =\log_2{m}\) is the uniform distribution where \(p(x) = 1/m\) for each \(x\)
      • See backup
  • Entropy is a measure of the spread of the variable, or the uncertainty in the outcome
    • If it’s certain that \(X=x^*\), there’s no spread to its values, there’s no uncertainty about the outcome, and \(H[X]=0\)
    • If \(X\) is uniformly distributed, we’ve got as much spread as possible, as much uncertainty about the value of \(X\) as possible, and \(H[X] = \log_{2}{m}\), its maximum possible value
  • Of course there’s nothing special about \(X\), we can do this to any other (discrete) random variable, so \[ H[Y] = -\sum_{y}{p(y) \log_2{p(y)}} \]
  • We’ll come back to the issue of continuous distributions later

Coding interpretation of entropy

  • Suppose we want to encode or represent the value of \(X\) with a string of binary digits (or bits)
    • Which is how your electronics encode or represent every single thing they ever deal with…
  • If \(X\) can take \(m\) possible values, we can clearly always do this with \(\lceil \log_2{m} \rceil\) bits
    • The round-up notation \(\lceil \cdot \rceil\) is annoying to have to write, so I’ll drop it
  • But we could do better, on average, by giving shorter codes to more-common values of \(X\)
    • E.g., \(X\) is either \(a\), \(b\), \(c\) or \(d\), so we could always use 2 bits
    • But if \(p(a)=0.9\), \(p(b)=p(c)=p(d)=1/30\), then coding “a” as 0, “b” as 10, “c” as 110 and “d” as 111 gives an average code length of \(\frac{9}{10}\times 1 + \frac{1}{30}\times 2 + 2\times\frac{1}{30}\times 3 = 1.1666667\).
      • Aside: I’ve chosen these so that none of the short codes appears at the beginning of one of the longer codes, so a sequence like \(0111 10 110000\) is uniquely decodable (as \(adbcaaa\)); using prefix-free codes like this isn’t essential, but doesn’t hurt
    • Moral: give more common values shorter codes
  • There average code length has to be at most \(\log_2{m}\), and there is clearly some lower limit
    • Because there aren’t a lot of short binary strings, so at most 2 values of \(X\) can get 1-bit codes, at most 4 values can get 2-bit codes, etc.
  • Define \(c(x)\) as the length of the code assigned to the value \(x\). The expected code length is then \(\sum_{x}{p(x) c(x)}\)
  • A fundamental result from Shannon (1948): for any coding scheme, \[ \sum_{x}{p(x) c(x)} \geq -\sum_{x}{p(x) \log_2{p(x)}} = H[X] \]
    • The entropy is a lower bound on the number of bits needed to encode (or store, or describe) the value of \(X\)
  • A second fundamental result from Shannon (1948): if you know \(p\), you can use it to build a code which attains the lower bound
    • (More exactly, if you see a sequence of draws \(X_1, X_2, \ldots X_n\), and use this code to describe the whole sequence, the code length approaches \(nH[X]\) arbitrarily closely)
    • Extensions to handle dependent sequences of data, where the code can shift depending on depending on previous values, are important (they’re how we compress files) but work similarly

Entropy tells us how many bits we need to describe/encode a random variable

Conditional entropy

  • We can apply the definition to a conditional distribution, say the conditional distribution of \(Y\) given \(X=x\) \[ H[Y|X=x] \equiv -\sum_{y}{p(y|x) \log_2{p(y|x)}} \]
    • This is the uncertainty in \(Y\), given that \(X=x\)
  • The conditional entropy of \(Y\) given \(X\) is the average value of \(H[Y|X=x]\): \[ H[Y|X] \equiv \sum_{x}{p(x) H[Y|X=x]} \]
    • This is the average uncertainty remaining in \(Y\), once we’ve conditioned on \(X\)
  • Some easy properties of \(H[Y|X]\):
    • \(H[Y|X] \geq 0\)
      • Because \(H[Y|X=x] \geq 0\), because it’s an entropy
    • If \(Y=f(X)\), then \(H[Y|X] =0\)
      • Because each \(H[Y|X=x] = 0\), because \(p(f(x)|x) = 1\)
    • If \(H[Y|X] = 0\), then \(Y=f(X)\) for some \(f\)
      • Because \(H[Y|X] = 0\) implies \(H[Y|X=x] = 0\) for all \(x\), and that in turn implies \(p(y^*|x) = 1\) for some \(y^*\) (for each \(x\)); that \(y^*\) is \(f(x)\)
  • A harder property: \(H[Y|X] \leq H[Y]\)
    • Proof: we’ll come back to this later
    • Slogan: “Conditioning never increases entropy”
      • Sometimes phrased “Conditioning reduces entropy”, but equality is possible
  • If \(X \indep Y\), then \(H[Y] = H[Y|X]\)
    • Conditioning on an independent variable doesn’t reduce entropy
    • Proof: direct calculation
    • We will see that \(H[Y] = H[Y|X]\) iff \(X \indep Y\)

Coding interpretation of conditional entropy

  • “If we’ve already described \(X\), how many extra bits do we need to describe \(Y\)?”

Classification and conditional entropy

  • We observe \(X\) and want to guess \(Y\)
    • \(Y\) is discrete and takes \(m\) different levels
  • Our guess is some function \(f\) of \(X\), i.e., \(\hat{Y} = f(X)\)
  • Set \(E=1\) if \(f(X) \neq Y\) and \(=0\) otherwise, so \(\Prob{E=1} =\) misclassification rate

Fano’s inequality: \(H[Y|X] \leq \Prob{E=1}\log_2{(m - 1)} + H[E]\)

  • Note: if \(m=2\) (binary \(Y\)), then the

  • The RHS is an increasing function of \(\Prob{E=1}\), i.e., of the misclassification rate
    • This tells us that if we want to keep the misclassification rate below a certain level, then \(H[Y|X]\) has to be small, at most the value on the RHS
    • Going the other way, starting from \(H[Y|X] > 0\), the misclassification rate can’t be too small (though there isn’t a nice way of solving for that minimum rate)

Fixing the desired mis-classification rate, the curve gives the maximum allowable \(H[Y|X]\). Fixing \(H[Y|X]\), the curve gives the minimum achievable mis-classification rate.

  • Scarlett and Cevher (n.d.) is a really great reference on Fano’s inequality and its statistical uses (which this just scratches)

MORAL: \(H[Y|X]\) gives a lower bound on how well \(Y\) can be guessed from \(X\)

Reducing entropy

  • \(H[Y] \geq H[Y|X]\), and \(H[Y] = H[Y|X]\) iff \(X \indep Y\)
  • We can say how much conditioning on \(X\) reduces the entropy of \(Y\): \[ H[Y] - H[Y|X] \]
  • This reduction in entropy \(\geq 0\), and the reduction in entropy \(=0\) iff \(X \indep Y\)

Reduction in entropy = Mutual information

  • \(H[Y] - H[Y|X] \geq 0\), \(=0\) iff \(X \indep Y\)
  • \(I[X;Y] \geq 0\), \(=0\) iff \(X \indep Y\)
  • Can we relate these two?
    • Of course we can!
\[\begin{eqnarray} H[Y] - H[Y|X] & = & -\sum_{y}{p(y)\log_2{p(y)}} + \sum_{x}{p(x)\sum_{y}{p(y|x) \log_2{p(y|x)}}}\\ & = & -\sum_{y}{\left(\log_2{p(y)}\right)\sum_{x}{p(x,y)}} + \sum_{x}{p(x)\sum_{y}{p(y|x) \log_2{p(y|x)}}}\\ & = & -\sum_{y}{\sum_{x}{p(x,y) \log_2{p(y)}}} + \sum_{x}{p(x)\sum_{y}{p(y|x) \log_2{p(y|x)}}}\\ & = & -\sum_{x}{\sum_{y}{p(x,y) \log_2{p(y)}}} + \sum_{x}{p(x)\sum_{y}{p(y|x) \log_2{p(y|x)}}}\\ & = & -\sum_{x}{p(x) \sum_{y}{p(y|x) \log_2{p(y)}}} + \sum_{x}{p(x)\sum_{y}{p(y|x) \log_2{p(y|x)}}}\\ & = & \sum_{x}{p(x)\sum_{y}{p(y|x) \log_2{ \frac{p(y|x)}{p(y)} }}}\\ & = & I[X;Y] \end{eqnarray}\]

\(H[Y]-H[Y|X] = I[X;Y]\)

Mutual information is amount of entropy removed from \(Y\) by conditioning on \(X\)

An upper limit to mutual information

  • Since \(I[X;Y]=I[Y;X]\), \[ H[Y] - H[Y|X] = H[X] - H[X|Y] \]
    • If we can predict \(Y\) from \(X\), we can predict \(X\) from \(Y\)
  • Consequence: \(I[X;Y] \leq H[X]\) and \(I[X;Y] \leq H[Y]\)
    • \(H[X]\) is the upper limit on how much entropy we can remove from \(Y\) by conditioning

An aside about control

  • “The law of requisite variety” (Ashby 1956): If you want \(H[Y|X] = 0\), you must have \(H[X] \geq H[Y]\)
    • That doesn’t mean any \(X\) with \(H[X] > H[Y]\) will give \(H[Y|X] = 0\)!
  • If you want to control \(Y\) perfectly, your control variables must have at least \(H[Y]\) bits of entropy
  • If your control variables \(X\) have \(H[X] < H[Y]\), perfect control isn’t possible, because \(H[Y|X] > 0\)
    • This has really deep consequences for control engineering (Touchette and Lloyd 1999, @Touchette-Lloyd-control-2), neuroscience and general biology (Ashby 1960), studying organizations, fraud (Davies 2018)
      • A dictator (or CEO) cannot actually control everything that happens in his country (or company), because he just doesn’t have the capacity

Continuous variables

  • If \(X\) and \(Y\) are continuous, we can use the same definition of mutual information, only replacing sums with integrals: \[ I[X;Y] = \int{\int{\left(\log_2{\frac{p(x,y)}{p(x)p(y)}}\right) p(x,y) dx dy}} \]
    • This is still \(\geq 0\), and \(I[X;Y]=0\) iff \(X \indep Y\)
    • If you look at the proof of \(I[X;Y] \geq 0\) in the backup, you’ll see it never uses the assumption that \(X\) and \(Y\) are discrete
  • If we define the continuous or “differential” entropy by \[ H[X] = -\int{\left(\log_2{p(x)}\right) p(x) dx} \] and similarly for \(H[Y|X=x]\) and \(H[Y|X]\), then we can still say that \[ I[X;Y] = H[Y]-H[Y|X] = H[X] - H[X|Y] \]
    • \(H[X]\) is still a measure of spread or uncertainty…
    • Warning: \(H[X]\) can be negative in the continuous case
      • The proof that \(H[X] \geq 0\) in the back-up does use the fact that \(X\) is discrete, and specifically that \(p(x) \leq 1\) (where?)
      • \(H[X] = -\infty\) for a degenerate distribution
    • Warning: for discrete variables, \(H[X] = H[f(X)]\) if \(f\) is 1-1; this is not true for continuous variables
  • Entropy is a less useful concept for continuous variables, but information still works just fine

Conditional independence and conditional mutual information

Entropy, log-likelihood, divergence

Cross-entropy and divergence

  • Stick to the discrete case for now; replace sums with integrals as appropriate for the continuous case
  • Entropy of a variable \(Z\) is \[ H[Z] = -\sum_{z}{p(z)\log_2{p(z)}} \]
  • What happens if we put some other distribution inside the log? That is, what happens if we try to calculate \[ -\sum_{z}{p(z)\log_{2}{q(z)}} \] for some other probability function \(q\)?
    • This is called the cross-entropy between \(p\) and \(q\)
    • The notation isn’t very fixed here, but you often see \(J[Z; p\|q]\) or \(J[Z; P\|Q]\)
      • Many people omit the \(Z\) when it’s clear from context
      • The double bar is intended as a kind of reminder that \(p\) and \(q\) (or \(P\) and \(Q\)) are the same kind of thing, but that order matters here
      • The double bar here isn’t a conditioning sign, or a fragment of the norm of a vector
  • Claim: if \(q \neq p\), then \[ -\sum_{z}{p(z)\log_2{p(z)}} < -\sum_{z}{p(z)\log_{2}{q(z)}} \] or \[ J[Z; P\|Q] > H_P[Z] \]
    • “The cross-entropy is always higher than the entropy”
    • This fact is sometimes called the Gibbs inequality
    • Coding interpretation: if you use the wrong probability distribution to design your coding scheme, your description length will be longer than it needs to be
  • Reformulated claim: \[ \sum_{z}{p(z)\log_2{\frac{p(z)}{q(z)}}} \geq 0 \] with equality iff \(p=q\)
    • Proof: see backup
    • \(p(z)/q(z)\) is a probability ratio, \(>1\) at points \(z\) which \(p\) says are more likely than \(q\) does, \(<1\) where \(q\) puts more probability than \(p\) does
    • Taking the log means we get \(+\) where \(p(z) > q(z)\) and \(-\) where \(q(z) > p(z)\)
    • We then average the log probability ratios over possible values of \(z\), weighted by how probable \(p\) thinks they are
  • Let’s define \[ D_{Z}(P\|Q) \equiv \sum_{z}{p(z)\log_2{\frac{p(z)}{q(z)}}} \]
    • Again, the \(Z\) is often omitted when it’s clear from context
    • The capital letters for \(P\) and \(Q\) to indicate it’s really the distributions that matter, and not the probability functions
    • \(D(P\|Q)\) is an expected log-probability ratio
    • Note: \(D(P\|Q) \neq D(Q\|P)\)
      • \(\neq\) in general; it can happen for some pairs of distributions
  • We call \(D(P\|Q)\) the “divergence of \(Q\) from \(P\)
    • The mathematicians won’t call something a “distance” unless it’s a symmetric function, so they won’t let us use that word here
    • Alternate names: Kullback-Leibler divergence (after Kullback and Leibler (1951)), KL divergence, KL information, KL information number, relative entropy
    • Some people write \(D(p, q)\), or \(D(P\|Q)\), or \(D(P,Q)\), or \(D(P:Q)\), etc., etc.
    • \(D(P\|Q)\) measures how far \(Q\) is from \(P\)

Mutual information is a divergence

  • Take \(z=(x,y)\)
  • Take \(p(z)=p(x,y)\), the actual joint distribution of \(x\) and \(y\)
  • Take \(q(z)=p(x)p(y)\), the joint distribution we’d have if \(x\) and \(y\) were independent but with the same marginal distribution
  • We could also work with \[ \sum_{x,y}{p(x)p(y)\log_2{\frac{p(x)p(y)}{p(x,y)}}} \] the lautum information (Palomar and Verdú 2008), which is also \(\geq 0\), and \(=0\) iff \(X \indep Y\), but that doesn’t play so nicely with entropies

Divergence and hypothesis testing / classification

  • Suppose that when \(Y=0\), \(X\) has the distribution \(P\), and that when \(Y=1\), \(X\) has distribution \(Q\)
  • Say \(\alpha\) is the false positive rate and \(\beta\) is the false negative rate
  • Some algebra shows \[\begin{eqnarray} D_X(P\|Q) & \geq & (1-\alpha)\log_2{\frac{1-\alpha}{\beta}} + \alpha\log_2{\frac{\alpha}{1-\beta}}\\ D_X(Q\|P) & \geq & \beta\log_2{\frac{\beta}{1-\alpha}} + (1-\beta)\log_2{\frac{1-\beta}{\alpha}} \end{eqnarray}\]

  • (See the appendix on information theory in Shalizi (n.d.), which is ripping off Kullback (1968))

  • Those inequalities don’t nicely invert to give the best possible values of \(\alpha\) and \(\beta\), but:
    • For fixed false positive rate, \(D_X(P\|Q)\) controls the minimum possible false negative rate
    • For fixed false negative rate, \(D_X(Q\|P)\) controls the minimum possible false positive rate

Fixing the FNR, the curve gives the minimum necessary divergence; fixing the divergence, the curve gives the minimum attainable FNR. The curve depends on the false positive rate (\(\alpha\)).

  • We can also apply this to hypothesis testing (null hypothesis \(P\) vs. alternative hypothesis \(Q\)); the divergence tells us about the optimal size and power of hypothesis tests

Log-likelihood and divergence

  • Suppose we’ve got a family of probability distributions, with a parameter \(\theta\)
    • In general, \(\theta\) is a vector (say, mean and variance for a Gaussian), but that’s not important here so I’m not going to complicate the notation
  • Each value of \(\theta\) gives us a probability mass/density function, say \(q(z;\theta)\)
    • When I want to talk about the distribution (rather than the probability function), I’ll write \(Q(\theta)\)
  • We see independent, identically distributed data \(z_1, \ldots z_n\)
    • The IID assumption isn’t essential but will simplify book-keeping
  • Each distribution in the family assigns some probability to each data point
  • The likelihood of the parameter value \(\theta\) is the over-all probability it assigns to the data, \[ \prod_{i=1}^{n}{q(z_i;\theta)} \]
  • The log-likelihood of \(\theta\) is the log of the likelihood: \[ L(\theta) \equiv \sum_{i=1}^{n}{\log_2{q(z_i;\theta)}} \]
    • Normally we use natural logs, but using \(\log_2\) just changes everything by a constant factor, and will connect us to our information-theoretic quantities
  • \(L(\theta)\) is an extensive quantity (it scales proportionally with \(n\))
  • Let’s look at \(L(\theta)/n\), which will be an intensive quantity, more comparable across data sets of different size: \[ \ell(\theta) \equiv \frac{1}{n}L(\theta) = \frac{1}{n}\sum_{i=1}^{n}{\log_2{q(z_i;\theta)}} \]
  • \(\ell(\theta)\) is a sample average, so the law of large numbers tells us that it will converge to an expected value: as \(n\rightarrow\infty\), \[ \ell(\theta) \rightarrow \Expect{\log_2{q(Z;\theta)}} \]
  • What’s that expectation?
  • Case 1: Suppose the data really did come from the model family, with parameter \(\theta\)
    • Then \[\begin{eqnarray} \Expect{\log_2{q(Z;\theta)}} & = & \sum_{z}{q(z;\theta) \log_{2}{q(z;\theta)}}\\ & = & -H_{Q(\theta)}[Z] ~, \end{eqnarray}\] the negative entropy of \(Z\) under the distribution \(Qq(\theta)\)
  • Case 2: The data came from the model family, but from a different parameter value, say \(\theta_0\)
    • Then \[\begin{eqnarray} \Expect{\log_2{q(Z;\theta)}} & = & \sum_{z}{q(z;\theta_0) \log_2{q(z;\theta)}}\\ & = & \sum_{z}{q(z;\theta_0)\log_2{\left(\frac{q(z;\theta)}{q(z;\theta_0)}q(z;\theta_0)\right)}}\\ &= & -H_{Q(\theta_0)}[Z] - D(Q(\theta_0) \| Q(\theta)) \end{eqnarray}\] the (negative) entropy plus the (negative) divergence of \(Q(\theta)\) from \(Q(\theta_0)\)
  • Remember that \(D(P\|Q) >0\) unless \(P=Q\), so:

If the model is right, in the long run (\(n\rightarrow\infty\)), the true parameter value has the highest log-likelihood

  • This is almost enough to show that maximum likelihood estimates converge on the true parameter value
    • We also need to rule out the possibility that values of \(\theta\) far from \(\theta_0\) have small divergence
    • For an elementary approach to this, see Vaart (1998)
  • Case 3: the model family is just wrong, and the true distribution is \(P\)
    • By the same arguments as in case 2, \[ \Expect{\log_2{q(Z;\theta)}} = -H_P[Z] - D(P\|Q_{\theta}) \]
    • Even if the model is wrong, the parameter value with the highest long-run log-likelihood is the one which gives the smallest distribution from the truth
    • We can define this as \[ \theta^* = \argmin_{\theta}{D(P\|Q_{\theta})} \]
    • So \(\theta^*\) will have the highest long-run log-likelihood
      • Econometricians (like White (1994)) call this the pseudo-true parameter value
    • But that long-run log-likelihood won’t be as good as the (negative) entropy under the true distribution

Estimating entropy and information

Plug-in estimates

Here’s the obvious way to estimate an entropy, or a mutual information, or a conditional information (etc., etc.):

  1. Get an estimate \(\hat{p}\) of the true distribution \(p\)
    1. For discrete variables, the obvious estimate is the empirical distribution, where \(\hat{p}(x) = n_x/n\) (the number of times \(X=x\) divided by the sample size); this is also the maximum likelihood estimate of an unrestricted multinomial distribution
    2. Any consistent estimator can be used, however
  2. Substitute \(\hat{p}\) for \(p\) in the formula for \(H\), \(I\), etc., as needed
    1. This is why it’s called the “plug-in” estimate
    2. If we plug in the empirical distribution, we get the empirical entropy (or information, etc.)
    3. Because \(H\), \(I\), etc., are continuous in the probabilities, and \(\hat{p} \rightarrow p\), the plug-in estimate will converge on the true \(H\) (or \(I\))

There are pros and cons to the plug-in approach:

  • Pro: it’s straightforward and it will converge on the truth with enough data (if you use a consistent estimator of the distribution)
    • The entropy package on CRAN implements a number of different plug-in estimators for discrete variables (with different estimators of the underlying distribution)
    • The mpmi package implements a plug-in estimator based on kernel density estimation
  • Con: estimating a high-dimensional distribution is intrinsically hard, in that estimators either converge very slowly or have bias (or both!), and errors in the distribution estimate propagate into errors in the entropy (or information, etc.)

As a little example, say we look at the empirical entropy of a discrete random variable with \(m\) different possible values, calculated from a sample size of \(n\). That is, we use \[ \hat{H}[X] = -\sum_{x}{\frac{n_x}{n}\log_2{\frac{n_x}{n}}} \]

  • This is negatively biased (Victor 2000): \[ \Expect{\hat{H}[X]} = H[X] - \frac{1}{2\log{2}}\frac{m-1}{n} + O(n^{-2}) \]
    • This is analogous to the way the sample variance is a negatively biased estimate of the population variance
    • Remarkably, the size of the first-order bias doesn’t involve the true distribution or even the true entropy
    • The entropy package includes a bias-corrected empirical entropy estimator
  • Similarly, the empirical mutual information is positively biased, with bias \(\frac{1}{2\log{2}}\frac{(m_x m_y -1) - (m_x-1) - (m_y-1)}{n}\)
  • Dealing with multiple variables is like increasing the number of possible values, so this makes the bias worse and worse

Non-plug-in estimates

Alternatives involve somehow trying to avoid estimating the whole distribution

  • We want \(\Expect{\log_2{\frac{1}{p(X)}}}\)
  • If we can somehow relate this expectation to something else, we can use that something else as the basis of our estimator
  • Example: using nearest neighbors for estimating entropy and mutual information
    • An observation ultimately due to Kozachenko and Leonenko (1987): the distance between \(\vec{x}_i\) and its nearest neighbor is proportional to \(1/p(\vec{x}_i)\)
      • Similarly for distance to the \(k\) nearest neighbor, etc.
    • So \(H[\vec{X}] \approx \Expect{\log{\|\vec{X}-\vec{X}_{NN}\|}}\)
      • There are some proportionality factors and some additive constants you need to worry about
    • So we can estimate \(H\) using just the average distance between each data point and its nearest neighbor
    • Extended to estimate mutual information by Kraskov, Stögbauer, and Grassberger (2004)
    • Extended to estimate conditional mutual information by Mesner and Shalizi (2019)
  • There are lots of other approaches, all based on different bits of math that try to avoid estimating the whole distribution

Backup material

\(0\log{0}\) and \(0\log{\frac{0}{q}}\) are both \(0\)

  • \(0\log{\frac{0}{q}} = 0\log{0} - 0\log{q} = 0\log{0}\), so we only have to show this for \(0\log{0}\)
  • We could just plug \(u=0\) into \(u\log{u}\), but that gives us \(0 \times (-\infty)\), which evaluates to “?”
  • More productively, we want the limit \(\lim_{u\rightarrow 0}{u\log{u}}\)
    • Since \(u\) and \(\log{u}\) are both continuous functions, their product is also a continuous function
    • We can always evaluate continuous functions by taking limits (that’s basically the definition of “continuous”)
  • Visually, it’s obvious:

  • The way we evaluate a limit when it turns into an “indeterminate form” like this is to use L’Hopital’s rule, \(\lim{\frac{f(x)}{g(x)}} =\lim{\frac{df/dx}{dg/dx}}\), so \[\begin{eqnarray} u\log{u} & = & \frac{\log{u}}{1/u}\\ \lim_{u\rightarrow 0}{u\log{u}} & = & \lim_{u\rightarrow 0}{\frac{\log{u}}{1/u}}\\ & = & \lim_{u\rightarrow 0}{\frac{1/u}{-1/u^2}}\\ & = & \lim_{u\rightarrow 0}{\frac{-u^2}{u}}\\ & = & \lim_{u\rightarrow 0}{-u}\\ & = & 0 \end{eqnarray}\] as promised.

Why is \(I[X;Y] \geq 0\)?

The crucial fact is that the logarithm is a concave function. Geometrically, this means that if you plot the curve of \(\log{u}\), and then connect any two points on the curve by a straight line, the curve will lie above the line:

Algebraically1, this means that for any two numbers \(t_1\) and \(t_2\), and any \(w\), \(0 < w < 1\), we have \[ w \log{t_1} + (1-w) \log{t_2} < \log{(w t_1 + (1-w) t_2)} \] Of course, if we let \(w\) be either 0 or 1, we’d have equality between the right-hand side and the left-hand side, not inequality, but that’s the only way to get equality between them.

This extends to multiple points2. If we pick any \(k\) numbers \(t_1, \ldots t_k\), and a set of weights \(w_1, \ldots w_k\), with all \(w_i > 0\) but \(\sum_{i=1}^{k}{w_i} = 1\), then the weighted average of the logs of the \(t_i\)’s is \(<\) the log of the weighted average: \[ \sum_{i=1}^{k}{w_i \log{t_i}} < \log{\left(\sum_{i=1}^{k}{w_i t_i}\right)} \] Again, if we let one of the \(w_i=1\) and all the others \(=0\), we’d get equality, but otherwise, we have a strict inequality.

This even extends to integrals3: for any probability density \(w(t)\), \[ \int{w(t)\log{t} dt} < \log{\left(\int{t w(t) dt}\right)} \] and, again, the only way to get equality is if the distribution puts probability 1 on one particular value of \(t\), and probability 0 on every other value. (The density needs to be a Dirac delta function.) For a non-degenerate distribution, the inequality is strict.

The general fact that the expected value of a concave function is \(<\) the function applied to the expected value is called Jensen’s inequality.

Now remember that \[ I[X;Y] = \sum_{x,y}{p(x,y) \log_2{\frac{p(x,y)}{p(x) p(y)}}} \] We want to prove that this is \(\geq 0\), so we’re going to prove that \(-I[X;Y] \leq 0\). Of course \[ -I[X;Y] = \sum_{x,y}{p(x,y) \log_2{\frac{p(x) p(y)}{p(x,y)}}} \] This is a weighted average of logarithms, so it’s \(<\) the logarithm of the weighted average: \[\begin{eqnarray} \sum_{x,y}{p(x,y) \log_2{\frac{p(x) p(y)}{p(x,y)}}} & < & \log_2{\sum_{x,y}{p(x,y) \frac{p(x) p(y)}{p(x,y)}}}\\ & = & \log_2{\sum_{x,y}{p(x)p(y)}}\\ & = & \log_2{1}\\ & = &0 \end{eqnarray}\] as desired. The only way to avoid this, and get equality, is for all of the log probability ratios to equal 0, i.e., for \(p(x,y) = p(x) p(y)\), i.e., for \(X\) and \(Y\) to be independent.

\(H[X] \geq 0\)

\(\log{u}\) is a concave function (see above) so we use the same reasoning about the average of a concave function (Jensen’s inequality) sketched above: \[\begin{eqnarray} \sum_{x}{p(x) \log_2{p(x)}} & < & \log_{2}{\sum_{x}{p(x) p(x)}}\\ & \leq & \log_{2}{\sum_{x}{ p(x) }}\\ & = & \log_{2}{1}\\ & = & 0\\ \end{eqnarray}\] (To go from the first line to the second, remember that we’re assuming \(X\) is discrete, so \(p(x) \leq 1\).)

To make the first line an equality rather than an inequality, we need \(p(x)\) to put probability 1 on one particular value, which will also make the second line an equality.

\(H[X] \leq \log_2{m}\)

If \(X\) takes at most \(m\) distinct values, then \(H[X] \leq \log_{2}{m}\), with equality only if \(p(x) = 1/m\) for all \(x\).

To prove this, we write \[ H[X] = \sum_{x}{p(x) \log_2{\frac{1}{p(x)}}} \] So, applying Jensen’s inequality, \[\begin{eqnarray} H[X] & \leq & \log_2{\left( \sum_{x}{p(x) \frac{1}{p(x)}} \right)}\\ & = & \log_{2}{\sum_{x}{1}}\\ & = & \log_{2}{m} \end{eqnarray}\] Again, the only way for the first line to be an equality rather than a strict inequality is for all the values of \(1/p(x\) to be equal, which means all the values of \(p((x)\) must be equal, and so they must all be \(1/m\)

\(D(p\|q) \geq 0\), with equality iff \(p=q\)

Use Jensen’s inequality in the now-familiar way. Specifically, show that \(-D(p\|q) \leq 0\): \[\begin{eqnarray} -D(p\|q) & = & \sum_{x}{p(x) \log_2{\frac{q(x)}{p(x)}}}\\ & \leq & \log_2{\sum_{x}{p(x) \frac{q(x)}{p(x)}}}\\ & = & \log_2{\sum_{x}{q(x)}}\\ & = & \log_2{1}\\ & = & 0 \end{eqnarray}\] as was to be shown. Again, the only way for the \(\leq\) to be \(=\) is if the distribution of log probability ratios is degenerate at one particular value, which can only be 0, which would imply \(q(x) = p(x)\) for all \(x\).

References

Ashby, W. Ross. 1956. An Introduction to Cybernetics. London: Chapman; Hall.

———. 1960. Design for a Brain: The Origins of Adaptive Behavior. 2nd ed. London: Chapman; Hall.

Davies, Daniel. 2018. Lying for Money: How Legendary Frauds Reveal the Workings of Our World. London: Profile Books.

Kozachenko, L. F., and N. N. Leonenko. 1987. “Sample Estimate of the Entropy of a Random Vector.” Problems of Information Transmission 23:95–101.

Kraskov, Alexander, Harald Stögbauer, and Peter Grassberger. 2004. “Estimating Mutual Information.” Physical Review E 69:066138. https://doi.org/10.1103/PhysRevE.69.066138.

Kullback, Solomon. 1968. Information Theory and Statistics. 2nd ed. New York: Dover Books.

Kullback, Solomon, and R. A. Leibler. 1951. “On Information and Sufficiency.” Annals of Mathematical Statistics 22:79–86. https://doi.org/10.1214/aoms/1177729694.

Mesner, Octavio César, and Cosma Rohilla Shalizi. 2019. “Conditional Mutual Information Estimation for Mixed Discrete and Continuous Variables with Nearest Neighbors.” E-print, arxiv.org:1912.03387. https://arxiv.org/abs/1912.03387.

Palomar, Daniel P., and Sergio Verdú. 2008. “Lautum Information.” IEEE Transactions on Information Theory 54:964–75. http://www.princeton.edu/~verdu/lautum.info.pdf.

Scarlett, Jonathan, and Volkan Cevher. n.d. “An Introductory Guide to Fano’s Inequality with Applications in Statistical Estimation.” In Information-Theoretic Methods in Data Science, edited by Yonina Eldar and Miguel Rodrigues. Cambridge, England: Cambridge University Press. https://arxiv.org/abs/1901.00555.

Shalizi, Cosma Rohilla. n.d. Advanced Data Analysis from an Elementary Point of View. Cambridge, England: Cambridge University Press. http://www.stat.cmu.edu/~cshalizi/ADAfaEPoV.

Shannon, Claude E. 1948. “A Mathematical Theory of Communication.” Bell System Technical Journal 27:379–423.

Touchette, Hugo, and Seth Lloyd. 1999. “Information-Theoretic Limits of Control.” Physical Review Letters 84:1156–9. http://arxiv.org/abs/chao-dyn/9905039.

———. 2004. “Information-Theoretic Approach to the Study of Control Systems.” Physica A 331:140–72. http://arxiv.org/abs/physics/0104007.

Vaart, A. W. van der. 1998. Asymptotic Statistics. Cambridge, England: Cambridge University Press.

Victor, Jonathan D. 2000. “Asymptotic Bias in Information Estimates and the Exponential (Bell) Polynomials.” Neural Computation 12:2797–2804. https://doi.org/10.1162/089976600300014728.

White, Halbert. 1994. Estimation, Inference and Specification Analysis. Cambridge, England: Cambridge University Press.


  1. Why is this the algebraic version of the geometric statement about the curve and its cord?

  2. Can you prove this, starting from the two-point version?

  3. Can you prove this, starting from the multi-point version of the inequality? (Hint: the integral is a limit of sums.)