Density estimation
21 March 2019
\[
\newcommand{\Expect}[1]{\mathbb{E}\left[ #1 \right]}
\newcommand{\Indicator}[1]{\mathbb{1}\left( #1 \right)}
\]
Data: body weights of cats
library(MASS)
data(cats)
head(cats$Bwt)
## [1] 2.0 2.0 2.0 2.1 2.1 2.1
summary(cats$Bwt)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.000 2.300 2.700 2.724 3.025 3.900
The empirical cumulative distribution function
- ECDF: \[
\hat{F}_n(a) = \frac{1}{n}\sum_{i=1}^{n}{\Indicator{x_i \leq a}}
\]
- Makes a jump at each distinct value seen in the data
- Constant in between the jumps
Computationally: ecdf
plot(ecdf(cats$Bwt), xlab = "Body weight (kg)", ylab = "Cumulative probability")
The ECDF estimates the CDF
- Suppose \(x_1, x_2, \ldots x_n\) are IID with CDF \(F\). Then for any \(a\), \[
\hat{F}_n(a) \rightarrow F(a)
\] by law of large numbers
The CDF is not enough
- What’s the probability of a cat weighing between \(3.21\) and \(3.22\) kg?
- Q from discussion: What about running a smoother (e.g., a spline) through the ECDF?
- A: You could do that, but you’d need to make sure the curve was always monotone, between 0 and 1, etc.
- Once we estimate the density, we can recover the CDF by integrating
The CDF is not enough
- More generally, pdf is derivative of CDF, \(f(x) = dF/dx\)
- Derivative of an empirical CDF is either 0 or \(\infty\)
- How do we estimate a pdf?
What about a histogram?
hist(cats$Bwt, xlab = "Body weight (kg)", ylab = "Number of cats", breaks = 20)
- Gives us an idea of where we have more or fewer data points…
Fitting a parametric model
- Pick a parametric pdf that looks sensible
- For example, the gamma distribution with shape parameter \(a\) and scale \(s\) is \[
f(x) = \frac{1}{s^a \Gamma(a)} x^{a-1} e^{-x/s}
\]
- Exponential is the special case \(a=1\)
- \(a > 1\) is a hump with an exponential right tail
Fitting a parametric model (cont’d)
- Now maximize the likelihood
- For some (but not all) standard distributions, the function
fitdistr
in the package MASS has pre-programmed maximum likelihood estimates
- You saw this in homework 3 for the \(t\) distribution
- As it happens,
fitdistr
knows about gamma distributions too
- It’s also easy to maximize the log-likelihood yourself
# Define the (negative, normalized) log-likelihood function
gamma.loglike <- function(params, data = cats$Bwt) {
-mean(log(dgamma(x = data, shape = params[1], scale = params[2])))
}
# Ask R to optimize it, with a starting guess about the parameters
cats.gamma <- optim(fn = gamma.loglike, par = c(1, 1))
# What is the best (negative, normalized) log-likelihood?
cats.gamma$value
## [1] 0.6677077
# What parameters did optim find for us?
cats.gamma$par
## [1] 32.65187332 0.08341489
Compare the fitted density to the histogram
plot(hist(cats$Bwt, plot = FALSE, breaks = 20), freq = FALSE, xlab = "Body weight (kg)")
curve(dgamma(x, shape = cats.gamma$par[1], scale = cats.gamma$par[2]), add = TRUE)
The histogram is already a density estimate
- Estimated density is constant on bins
- \(N_i =\) number of samples in bin \(i\)
- \(i(x)=\) which bin \(x\) falls into
- \(h =\) width of the bins
\[
\hat{f}_{hist}(x) \equiv \frac{1}{h} \frac{N_{i(x)}}{n}
\]
Why this works
Probability of falling into a region of length \(h\) around \(x\) will be \[
\int_{x-h/2}^{x+h/2}{f(u) du} \approx f(x) h
\] for very small \(h\). So \[
N_i \sim \mathrm{Binom}(n, f(x) h)
\] and \[
\hat{f}_{hist}(x) = \frac{N_i}{nh} \sim \frac{1}{nh} \mathrm{Binom}(n, f(x) h)
\] which \(\rightarrow f(x)\) as \(n \rightarrow \infty\)
How bad is a histogram?
Error from variance: \(\propto 1/nh\)
Error from bias: \(\propto h^2 (f^{\prime}(x))^2\)
Total error for estimating the density, \(O(h^2) + O(1/nh)\)
Best \(h = O(n^{-1/3})\)
Best total error, \(O(n^{-2/3})\)
vs. \(O(n^{-1})\) if we’ve got a correct parametric model
Q from discussion: What about variable-width bins?
A: That can help, but it’s even better to get away from fixed bins altogether
Kernels
Idea: “Smooth out” each observation into a little distribution
Kernel functions are pdfs
Put a copy of the kernel at each observation; add them up
\[
\hat{f}_{kern}(x) = \frac{1}{n} \sum_{i=1}^{n}{\frac{1}{h} K\left(\frac{x-x_i}{h}\right)}
\]
How this works for cats
Say \(h=0.01\)
How do we pick the bandwidth?
- Cross-validation
- Cross-validate what?
- Integrated squared error: \(\int{(f(x) - \hat{f}(x))^2 dx}\)
- Or (negative, normalized) log-likelihood \(\int{-\log{(\hat{f}(x))} f(x) dx}\)
Evaluating the log-likelihood
- The integral is an expected value: \[
\int{-\log{(\hat{f}(x))} f(x) dx} = \Expect{-\log{\hat{f}(X)}}
\]
- With \(n\) IID samples \(x_i\), \[
\Expect{-\log{\hat{f}(X)}} \approx \frac{1}{n}\sum_{i=1}^{n}{-\log{\hat{f}(x_i)}}
\] by LLN
- A similar but more complicated trick for ISE; see backup slides
Computationally: npudens
library(np)
bwt.np <- npudens(cats$Bwt)
plot(bwt.np, xlab = "Body weight (kg)")
Joint PDFs
- What about multiple variables? We want \(f(x_1, x_2, \ldots x_p) = f(\vec{x})\)
plot(Hwt ~ Bwt, data = cats, xlab = "Body weight (kg)", ylab = "Heart weight (g)")
- Parametric models work just the same way as before
- Kernel density estimates work almost the same way
Joint KDEs
- Use a product of kernels, one for each variable \[
\hat{f}_{kern}(x_1, x_2) = \frac{1}{n} \sum_{i=1}^{n}{\frac{1}{h_1 h_2} K\left(\frac{x_{1}-x_{i1}}{h_1}\right)K\left(\frac{x_{2}-x_{i2}}{h_2}\right)}
\]
- Need as many bandwidths as there are variables
- Pick bandwidths for best prediction of both variables
Computationally: npudens
again
cats.np <- npudens(~Bwt + Hwt, data = cats)
plot(cats.np, view = "fixed", theta = 10, phi = 45)
EXERCISE: Where do the ridges come from?
- Q1: What happened to the bandwidth on bodyweight (
Bwt
)?
- Q2: Why did that bandwidth change?
- There’s a strong relationship between
Bwt
and Hwt
; if we want to get both right, we can’t average over as wide a range of Bwt
Conditional PDFs
- Conditional pdf: \[
f(x_2|x_1) = \frac{f(x_1, x_2)}{f(x_1)}
\] so find the joint and marginal pdfs and divide
- Conditional KDE: \[
\hat{f}_{kern}(x_2|x_1) = \frac{\frac{1}{n} \sum_{i=1}^{n}{\frac{1}{h_1 h_2} K\left(\frac{x_{1}-x_{i1}}{h_1}\right)K\left(\frac{x_{2}-x_{i2}}{h_2}\right)}}{\frac{1}{n} \sum_{i=1}^{n}{\frac{1}{h_1} K\left(\frac{x-x_i}{h_1}\right)}}
\]
- Pick bandwidths for best prediction of \(x_2\)
- i.e., look at the conditional log-likelihood of \(x_2\) given \(x_1\)
Computationally: npcdens
cats.hwt.on.bwt <- npcdens(Hwt ~ Bwt, data = cats)
plot(cats.hwt.on.bwt, view = "fixed", theta = 10, phi = 65)
How well does kernel density estimation work?
- The same sort of analysis we did for kernel regression will work here too
- More book-keeping — see the text
- Integrated squared error of KDE is \(O(n^{-4/5})\) in 1D
- Worse than a well-specified parametric model, \(O(n^{-1})\)
- Better than a mis-specified parametric model, \(O(1)\)
- Better than a histogram, \(O(n^{-2/3})\)
- In \(p\) dimensions, ISE of joint KDE is \(O(n^{-4/(4+p)})\)
- Nothing which is universally consistent does any better
- Curse of dimensionality again
Summing up
- It’s easy to learn the CDF
- It’s easy to estimate parametric distributions
- maximize the log-likelihood
- Histograms are a first step to a non-parametric density estimate
- Kind of ugly and statistically inefficient, though
- Kernel density estimation lets us learn marginal pdfs, joint pdfs, conditional pdfs…
- The curse of dimensionality is always with us though
READING: Chapter 14
Backup: More about the ECDF
- Glivenko-Cantelli theorem: If \(X\)’s are IID with CDF \(F\), then \[
\max_{a}{|\hat{F}_n(a) - F(a)|} \rightarrow 0
\]
- This is a stronger result than just saying “for each \(a\), \(|\hat{F}_n(a)-F(a)| \rightarrow 0\)”
- Kolmogorov-Smirnov (KS) test: To test whether \(X\)’s are IID with CDF \(F_0\), use \(d_{KS} = \max_{a}{|\hat{F}_n(a)-F_0(a)|}\)
- If \(X \sim F_0\), then \(d_{KS} \rightarrow 0\) as \(n\rightarrow \infty\)
- If \(X \not\sim F_0\), then \(d_{KS} \rightarrow\) some positive number
- The sampling distribution of \(d_{KS}\) under the null turns out to be the same for all \(F_0\) (at least for big \(n\))
- Needs modification if you estimate the \(F_0\) from data as well
Backup: Parametric density estimates
- We have a parametric model for the density \(f(x;\theta)\) with a finite-dimensional vector of parameters \(\theta\)
- We maximize the log-likelihood to get \(\hat{\theta}\)
- Equivalently we minimize the negative normalized log-likelihood \[
L(\theta) = -\frac{1}{n}\sum_{i=1}^{n}{\log{f(X_i;\theta)}}
\]
- This is a sample mean so LLN says \(L(\theta) \rightarrow \Expect{L(\theta)} \equiv \lambda(\theta)\)
- You can show that if the data came from the model with some true parameter \(\theta_0\), then \(\lambda(\theta_0)\) is the unique global minimum
Backup: More about parametric density estimates
- Quite generally, the MLE \(\hat{\theta} \rightarrow \theta_0\)
- Almost as generally, for large \(n\), \(\hat{\theta} \sim \mathcal{N}(\theta_0, \mathbf{J}(\theta_0)/n)\)
- Hence \(\hat{\theta} = \theta_0 + O(1/\sqrt{n})\)
- Here \(\mathbf{J}(\theta) =\) matrix of 2nd partial derivatives of \(\lambda\) = Fisher information
- Can plug in \(\mathbf{J}(\hat{\theta})\) to get approximate confidence intervals, etc.
- Doesn’t matter if the data is Gaussian — \(L(\theta)\) is a sample average so by the CLT it’s (approximately) Gaussian, and that propagates
- Things are more complicated if the model is mis-specified
Backup: Evaluating ISE
\[\begin{eqnarray}
\int{(f(x) - \hat{f}(x))^2 dx} & = & \int{f^2(x) - 2f(x)\hat{f}(x) \hat{f}^2(x) dx}\\
& = & \int{f^2(x) dx} - 2\int{\hat{f}(x) f(x) dx} + \int{\hat{f}^2(x) dx}\\
& = & (A) + (B) + (C)\\
& = & (\text{same for all estimates}) - (\text{depends on truth and estimate}) + \text{(depends only on estimate)}
\end{eqnarray}\]
- If we’re comparing different estimates, term (A) doesn’t matter, so we can ignore it
- Term (C) depends only on our estimate, and we can get it by (say) numerical integration
- That leaves term (B), and with \(n\) independent samples \(x_i\), \[
\int{\hat{f}(x) f(x) dx} \approx \frac{1}{n}\sum_{i=1}^{n}{\hat{f}(x_i)}
\] by LLN