Density estimation

21 March 2019

\[ \newcommand{\Expect}[1]{\mathbb{E}\left[ #1 \right]} \newcommand{\Indicator}[1]{\mathbb{1}\left( #1 \right)} \]

Data: body weights of cats

library(MASS)
data(cats)
head(cats$Bwt)
## [1] 2.0 2.0 2.0 2.1 2.1 2.1
summary(cats$Bwt)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.000   2.300   2.700   2.724   3.025   3.900

The empirical cumulative distribution function

Computationally: ecdf

plot(ecdf(cats$Bwt), xlab = "Body weight (kg)", ylab = "Cumulative probability")

The ECDF estimates the CDF

The CDF is not enough

The CDF is not enough

What about a histogram?

hist(cats$Bwt, xlab = "Body weight (kg)", ylab = "Number of cats", breaks = 20)

Fitting a parametric model

Fitting a parametric model (cont’d)

# Define the (negative, normalized) log-likelihood function
gamma.loglike <- function(params, data = cats$Bwt) {
    -mean(log(dgamma(x = data, shape = params[1], scale = params[2])))
}
# Ask R to optimize it, with a starting guess about the parameters
cats.gamma <- optim(fn = gamma.loglike, par = c(1, 1))
# What is the best (negative, normalized) log-likelihood?
cats.gamma$value
## [1] 0.6677077
# What parameters did optim find for us?
cats.gamma$par
## [1] 32.65187332  0.08341489

Compare the fitted density to the histogram

plot(hist(cats$Bwt, plot = FALSE, breaks = 20), freq = FALSE, xlab = "Body weight (kg)")
curve(dgamma(x, shape = cats.gamma$par[1], scale = cats.gamma$par[2]), add = TRUE)

The histogram is already a density estimate

\[ \hat{f}_{hist}(x) \equiv \frac{1}{h} \frac{N_{i(x)}}{n} \]

Why this works

Probability of falling into a region of length \(h\) around \(x\) will be \[ \int_{x-h/2}^{x+h/2}{f(u) du} \approx f(x) h \] for very small \(h\). So \[ N_i \sim \mathrm{Binom}(n, f(x) h) \] and \[ \hat{f}_{hist}(x) = \frac{N_i}{nh} \sim \frac{1}{nh} \mathrm{Binom}(n, f(x) h) \] which \(\rightarrow f(x)\) as \(n \rightarrow \infty\)

How bad is a histogram?

Error from variance: \(\propto 1/nh\)

Error from bias: \(\propto h^2 (f^{\prime}(x))^2\)

Total error for estimating the density, \(O(h^2) + O(1/nh)\)

Best \(h = O(n^{-1/3})\)

Best total error, \(O(n^{-2/3})\)

vs. \(O(n^{-1})\) if we’ve got a correct parametric model

Q from discussion: What about variable-width bins?

A: That can help, but it’s even better to get away from fixed bins altogether

Kernels

Idea: “Smooth out” each observation into a little distribution

Kernel functions are pdfs

Put a copy of the kernel at each observation; add them up

\[ \hat{f}_{kern}(x) = \frac{1}{n} \sum_{i=1}^{n}{\frac{1}{h} K\left(\frac{x-x_i}{h}\right)} \]

How this works for cats

Say \(h=0.01\)

How do we pick the bandwidth?

Evaluating the log-likelihood

Computationally: npudens

library(np)
bwt.np <- npudens(cats$Bwt)
plot(bwt.np, xlab = "Body weight (kg)")

Joint PDFs

plot(Hwt ~ Bwt, data = cats, xlab = "Body weight (kg)", ylab = "Heart weight (g)")

Joint KDEs

Computationally: npudens again

cats.np <- npudens(~Bwt + Hwt, data = cats)
plot(cats.np, view = "fixed", theta = 10, phi = 45)

EXERCISE: Where do the ridges come from?

Conditional PDFs

Computationally: npcdens

cats.hwt.on.bwt <- npcdens(Hwt ~ Bwt, data = cats)
plot(cats.hwt.on.bwt, view = "fixed", theta = 10, phi = 65)

How well does kernel density estimation work?

Summing up

READING: Chapter 14

Backup: More about the ECDF

Backup: Parametric density estimates

Backup: More about parametric density estimates

Backup: Evaluating ISE

\[\begin{eqnarray} \int{(f(x) - \hat{f}(x))^2 dx} & = & \int{f^2(x) - 2f(x)\hat{f}(x) \hat{f}^2(x) dx}\\ & = & \int{f^2(x) dx} - 2\int{\hat{f}(x) f(x) dx} + \int{\hat{f}^2(x) dx}\\ & = & (A) + (B) + (C)\\ & = & (\text{same for all estimates}) - (\text{depends on truth and estimate}) + \text{(depends only on estimate)} \end{eqnarray}\]