Density estimation
36-402
14 March 2024
\[
\newcommand{\Expect}[1]{\mathbb{E}\left[ #1 \right]}
\newcommand{\Indicator}[1]{\mathbb{1}\left( #1 \right)}
\]
Data: body weights of cats
library(MASS)
data(cats)
head(cats$Bwt)
## [1] 2.0 2.0 2.0 2.1 2.1 2.1
summary(cats$Bwt)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.000 2.300 2.700 2.724 3.025 3.900
The empirical cumulative distribution function
- ECDF: \[
\hat{F}_n(a) = \frac{1}{n}\sum_{i=1}^{n}{\Indicator{x_i \leq a}}
\]
- Makes a jump at each distinct value seen in the data
- Constant in between the jumps
Computationally: ecdf
plot(ecdf(cats$Bwt), xlab = "Body weight (kg)", ylab = "Cumulative probability",
main = "Empirical CDF of cats' body weights")

The ECDF estimates the CDF
- Suppose \(x_1, x_2, \ldots x_n\) are IID with CDF \(F\). Then for any \(a\), \[
\hat{F}_n(a) \rightarrow F(a)
\] by law of large numbers
The CDF is not enough

- What’s the probability of a cat weighing between \(3.21\) and \(3.22\) kg?
The CDF is not enough
- More generally, pdf is derivative of CDF, \(f(x) = dF/dx\)
- Derivative of an empirical CDF is either 0 or \(\infty\)
- How do we estimate a pdf?
What about a histogram?
hist(cats$Bwt, xlab = "Body weight (kg)", ylab = "Number of cats", breaks = 20)

- Gives us an idea of where we have more or fewer data points…
Fitting a parametric model
- Pick a parametric pdf that looks sensible
- For example, the gamma distribution with shape parameter \(a\) and scale \(s\) is \[
f(x) = \frac{1}{s^a \Gamma(a)} x^{a-1} e^{-x/s}
\]
- Exponential is the special case \(a=1\)
- \(a > 1\) is a hump with an exponential right tail

Fitting a parametric model (cont’d)
- Now maximize the likelihood \(=\) probability density at the data
- For many standard distributions, the function
fitdistr
in the package MASS has pre-programmed maximum likelihood estimates
- As it happens,
fitdistr
knows about gamma distributions too
- It’s also easy to maximize the log-likelihood yourself
gamma.loglike <- function(params, data = cats$Bwt) {
-mean(log(dgamma(x = data, shape = params[1], scale = params[2])))
}
cats.gamma <- optim(fn = gamma.loglike, par = c(1, 1))
cats.gamma$value
## [1] 0.6677077
cats.gamma$par
## [1] 32.65187332 0.08341489
Compare the fitted density to the histogram
plot(hist(cats$Bwt, plot = FALSE, breaks = 20), freq = FALSE, xlab = "Body weight (kg)",
xlim = c(0, 2 * max(cats$Bwt)))
curve(dgamma(x, shape = cats.gamma$par[1], scale = cats.gamma$par[2]), add = TRUE)

This is not a very good fit on the left, maybe tolerable on the right…
Have a cat picture

The histogram is already a density estimate
- Estimated density is constant on bins
- \(N_i =\) number of samples in bin \(i\)
- \(i(x)=\) which bin \(x\) falls into
- \(h =\) width of the bins
\[
\hat{f}_{hist}(x) \equiv \frac{1}{h} \frac{N_{i(x)}}{n}
\]
- This is a mixture of more basic distributions:
- Uniform distributions on each bin
- Mixing weight of bin \(i\) is \(N_i/n\)
Why this works
Probability of being in an interval of length \(h\) around \(x\) will be \[
\int_{x-h/2}^{x+h/2}{f(u) du} \approx f(x) h
\] for very small \(h\). So \[
N_i \sim \mathrm{Binom}(n, f(x) h)
\] and \[
\hat{f}_{hist}(x) = \frac{N_i}{nh} \sim \frac{1}{nh} \mathrm{Binom}(n, f(x) h)
\] which \(\rightarrow f(x)\) as \(n \rightarrow \infty\)
How bad is a histogram?
- Error from variance: \(\propto 1/nh\)
- Error from bias: \(\propto h^2 (f^{\prime}(x))^2\)
- Taylor-expand inside the integral on the previous slide to first order; the bias will cancel out if we’re exactly at the mid-point of the interval, otherwise it’s as stated
- Total error for estimating the density, \(O(h^2) + O(1/nh)\)
- Best \(h = O(n^{-1/3})\)
- Best total error, \(O(n^{-2/3})\)
- vs. \(O(n^{-1})\) if we’ve got a correct parametric model
Have a cat picture

Kernels
Idea: “Smooth out” each observation into a little distribution
Kernel functions are pdfs
Put a copy of the kernel at each observation; add them up
\[
\hat{f}_{kern}(x) = \frac{1}{n} \sum_{i=1}^{n}{\frac{1}{h} K\left(\frac{x-x_i}{h}\right)}
\]
- This is a mixture:
- Basic distributions are the scaled copies of the kernel, centered at the \(x_i\)s
- Weight of each mixture component \(=1/n\)
How this works for cats
Say \(h=0.01\)

How this works for cats
Say \(h=0.02\)

How this works for cats
Say \(h=0.05\)

How do we pick the bandwidth?
- Cross-validation
- Cross-validate what?
- Integrated squared error: \(\int{(f(x) - \hat{f}(x))^2 dx}\)
- Or (negative, normalized) log-likelihood \(-\int{\log{(\hat{f}(x))} f(x) dx}\)
Evaluating the log-likelihood
- The integral is an expected value: \[
-\int{\log{(\hat{f}(x))} f(x) dx} = \Expect{-\log{\hat{f}(X)}}
\]
- With \(n\) IID samples \(x_i\), \[
\Expect{-\log{\hat{f}(X)}} \approx -\frac{1}{n}\sum_{i=1}^{n}{\log{\hat{f}(x_i)}}
\] by LLN
- A similar but more complicated trick for ISE; see backup slides
Computationally: npudens
library(np)
bwt.np <- npudens(cats$Bwt)
plot(bwt.np, xlab = "Body weight (kg)")

Bandwidth: 0.11
Joint PDFs
- What about multiple variables? We want \(f(x_1, x_2, \ldots x_p) = f(\vec{x})\)
plot(Hwt ~ Bwt, data = cats, xlab = "Body weight (kg)", ylab = "Heart weight (g)")

- Parametric models work just the same way as before
- Kernel density estimates work almost the same way
Joint KDEs
- Use a product of kernels, one for each variable \[
\hat{f}_{kern}(x_1, x_2) = \frac{1}{n} \sum_{i=1}^{n}{\frac{1}{h_1 h_2} K\left(\frac{x_{1}-x_{i1}}{h_1}\right)K\left(\frac{x_{2}-x_{i2}}{h_2}\right)}
\]
- Need as many bandwidths as there are variables
- Pick bandwidths for best prediction of both variables
Computationally: npudens
again
cats.np <- npudens(~Bwt + Hwt, data = cats)
plot(cats.np, view = "fixed", theta = 10, phi = 45, xlab = "Body weight (kg)", ylab = "Heart weight (g)")

EXERCISE: Where did those ridges come from?


- Q1: What happened to the bandwidth on bodyweight (
Bwt
) going from 1D to 2D? Did it increase, decrease, stay the same? How can you tell?
- Q2: Why did that bandwidth do that?
SOLUTIONS
- Q1: What happened to the bandwidth on bodyweight (
Bwt
)?
- It got a lot smaller; we can tell because the distribution is plainly very spiky along the
Bwt
axis
bwt.np$bws$bw
## [1] 0.1098765
cats.np$bws$bw
## [1] 0.008333236 1.485781401
- Q2: Why did that bandwidth change?
- We almost never see the same bandwidths for the same variable in different distributions: it changes if we’re doing marginal vs. joint, it changes in different joint distributions with different other variables, etc.
- We’re optimizing for different objective functions!
- Here: There’s a strong relationship between
Bwt
and Hwt
; if we want to get both right, we can’t average over as wide a range of Bwt
Have a cat picture

Conditional PDFs
- Conditional pdf: \[
f(x_2|x_1) = \frac{f(x_1, x_2)}{f(x_1)}
\] so find the joint and marginal pdfs and divide
- Conditional KDE: \[
\hat{f}_{kern}(x_2|x_1) = \frac{\frac{1}{n} \sum_{i=1}^{n}{\frac{1}{h_1 h_2} K\left(\frac{x_{1}-x_{i1}}{h_1}\right)K\left(\frac{x_{2}-x_{i2}}{h_2}\right)}}{\frac{1}{n} \sum_{i=1}^{n}{\frac{1}{h_1} K\left(\frac{x-x_i}{h_1}\right)}}
\]
- Pick both bandwidths to give the best prediction of \(x_2\)
- i.e., look at the conditional log-likelihood of \(x_2\) given \(x_1\)
Computationally: npcdens
cats.hwt.on.bwt <- npcdens(Hwt ~ Bwt, data = cats)
plot(cats.hwt.on.bwt, view = "fixed", theta = 10, phi = 65)

How well does kernel density estimation work?
- The same sort of analysis we did for kernel regression will work here too
- More book-keeping — see the textbook
- Integrated squared error of KDE is \(O(n^{-4/5})\) in 1D
- Worse than a well-specified parametric model, \(O(n^{-1})\)
- Better than a mis-specified parametric model, \(O(1)\)
- Better than a histogram, \(O(n^{-2/3})\)
- In \(p\) dimensions, ISE of joint KDE is \(O(n^{-4/(4+p)})\)
- Nothing which is universally consistent does any better
- Curse of dimensionality again
Summing up
- It’s easy to learn the CDF
- It’s straightforward to estimate parametric distributions
- maximize the log-likelihood
- Histograms are a first step to a non-parametric density estimate
- Kind of ugly and statistically inefficient, though
- Kernel density estimation lets us learn marginal pdfs, joint pdfs, conditional pdfs…
- The curse of dimensionality is always with us though
READING: Chapter 14
Backup: More about the ECDF
- Glivenko-Cantelli theorem: If \(X\)’s are IID with CDF \(F\), then \[
\max_{a}{|\hat{F}_n(a) - F(a)|} \rightarrow 0
\]
- This is a stronger result than just saying “for each \(a\), \(|\hat{F}_n(a)-F(a)| \rightarrow 0\)”
- Kolmogorov-Smirnov (KS) test: To test whether \(X\)’s are IID with CDF \(F_0\), use \(d_{KS} = \max_{a}{|\hat{F}_n(a)-F_0(a)|}\)
- If \(X \sim F_0\), then \(d_{KS} \rightarrow 0\) as \(n\rightarrow \infty\)
- If \(X \not\sim F_0\), then \(d_{KS} \rightarrow\) some positive number
- The sampling distribution of \(d_{KS}\) under the null turns out to be the same for all \(F_0\) (at least for big \(n\))
- Needs modification if you estimate the \(F_0\) from data as well
Backup: Parametric density estimates
- We have a parametric model for the density \(f(x;\theta)\) with a finite-dimensional vector of parameters \(\theta\)
- We maximize the log-likelihood to get \(\hat{\theta}\)
- Equivalently we minimize the negative normalized log-likelihood \[
L(\theta) = -\frac{1}{n}\sum_{i=1}^{n}{\log{f(X_i;\theta)}}
\]
- This is a sample mean so LLN says \(L(\theta) \rightarrow \Expect{L(\theta)} \equiv \lambda(\theta)\)
- You can show that if the data came from the model with some true parameter \(\theta_0\), then \(\lambda(\theta_0)\) is the unique global minimum
Backup: More about parametric density estimates
- Quite generally, the MLE \(\hat{\theta} \rightarrow \theta_0\)
- Almost as generally, for large \(n\), \(\hat{\theta} \sim \mathcal{N}(\theta_0, \mathbf{J}(\theta_0)/n)\)
- Hence \(\hat{\theta} = \theta_0 + O(1/\sqrt{n})\)
- Here \(\mathbf{J}(\theta) =\) matrix of 2nd partial derivatives of \(\lambda\) = Fisher information
- Can plug in \(\mathbf{J}(\hat{\theta})\) to get approximate confidence intervals, etc.
- Doesn’t matter if the data is Gaussian — \(L(\theta)\) is a sample average so by the CLT it’s (approximately) Gaussian, and that propagates
- Things are more complicated if the model is mis-specified
- More details in Appendix D of the textbook
Backup: Evaluating ISE
\[\begin{eqnarray}
\int{(f(x) - \hat{f}(x))^2 dx} & = & \int{f^2(x) - 2f(x)\hat{f}(x) \hat{f}^2(x) dx}\\
& = & \int{f^2(x) dx} - 2\int{\hat{f}(x) f(x) dx} + \int{\hat{f}^2(x) dx}\\
& = & (A) + (B) + (C)\\
& = & (\text{same for all estimates})\\
& & - (\text{depends on truth and estimate})\\
& & + \text{(depends only on estimate)}
\end{eqnarray}\]
- If we’re comparing different estimates, term (A) doesn’t matter, so we can ignore it
- Term (C) depends only on our estimate, and we can get it by (say) numerical integration
- That leaves term (B), and with \(n\) independent samples \(x_i\), \[
\int{\hat{f}(x) f(x) dx} \approx \frac{1}{n}\sum_{i=1}^{n}{\hat{f}(x_i)}
\] by LLN