Data: body weights of cats

library(MASS)
data(cats)
head(cats$Bwt)

## [1] 2.0 2.0 2.0 2.1 2.1 2.1

summary(cats$Bwt)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.000   2.300   2.700   2.724   3.025   3.900

The empirical cumulative distribution function

ECDF: \[ \hat{F}_n(a) = \frac{1}{n}\sum_{i=1}^{n}{\Indicator{x_i \leq a}} \]
- Makes a jump at each distinct value seen in the data
- Constant in between the jumps

Computationally: `ecdf`

plot(ecdf(cats$Bwt), xlab = "Body weight (kg)", ylab = "Cumulative probability")

The ECDF estimates the CDF

Suppose \(x_1, x_2, \ldots x_n\) are IID with CDF \(F\). Then for any \(a\), \[ \hat{F}_n(a) \rightarrow F(a) \] by law of large numbers
- Also: see backup slides

The CDF is not enough

What’s the probability of a cat weighing between \(3.21\) and \(3.22\) kg?
- ECDF says: exactly 0
Q from discussion: What about running a smoother (e.g., a spline) through the ECDF?
A: You could do that, but you’d need to make sure the curve was always monotone, between 0 and 1, etc.
- Once we estimate the density, we can recover the CDF by integrating

The CDF is not enough

More generally, pdf is derivative of CDF, \(f(x) = dF/dx\)
Derivative of an empirical CDF is either 0 or \(\infty\)
- Where is it \(\infty\)?
How do we estimate a pdf?

What about a histogram?

hist(cats$Bwt, xlab = "Body weight (kg)", ylab = "Number of cats", breaks = 20)

Gives us an idea of where we have more or fewer data points…

Fitting a parametric model

Pick a parametric pdf that looks sensible
For example, the gamma distribution with shape parameter \(a\) and scale \(s\) is \[ f(x) = \frac{1}{s^a \Gamma(a)} x^{a-1} e^{-x/s} \]
- Exponential is the special case \(a=1\)
- \(a > 1\) is a hump with an exponential right tail

Fitting a parametric model (cont’d)

Now maximize the likelihood
For some (but not all) standard distributions, the function fitdistr in the package MASS has pre-programmed maximum likelihood estimates
- You saw this in homework 3 for the \(t\) distribution
- As it happens, fitdistr knows about gamma distributions too
It’s also easy to maximize the log-likelihood yourself

# Define the (negative, normalized) log-likelihood function
gamma.loglike <- function(params, data = cats$Bwt) {
    -mean(log(dgamma(x = data, shape = params[1], scale = params[2])))
}
# Ask R to optimize it, with a starting guess about the parameters
cats.gamma <- optim(fn = gamma.loglike, par = c(1, 1))
# What is the best (negative, normalized) log-likelihood?
cats.gamma$value

## [1] 0.6677077

# What parameters did optim find for us?
cats.gamma$par

## [1] 32.65187332  0.08341489

Compare the fitted density to the histogram

plot(hist(cats$Bwt, plot = FALSE, breaks = 20), freq = FALSE, xlab = "Body weight (kg)")
curve(dgamma(x, shape = cats.gamma$par[1], scale = cats.gamma$par[2]), add = TRUE)

The histogram is already a density estimate

Estimated density is constant on bins
\(N_i =\) number of samples in bin \(i\)
\(i(x)=\) which bin \(x\) falls into
\(h =\) width of the bins

\[ \hat{f}_{hist}(x) \equiv \frac{1}{h} \frac{N_{i(x)}}{n} \]

Why this works

Probability of falling into a region of length \(h\) around \(x\) will be \[ \int_{x-h/2}^{x+h/2}{f(u) du} \approx f(x) h \] for very small \(h\). So \[ N_i \sim \mathrm{Binom}(n, f(x) h) \] and \[ \hat{f}_{hist}(x) = \frac{N_i}{nh} \sim \frac{1}{nh} \mathrm{Binom}(n, f(x) h) \] which \(\rightarrow f(x)\) as \(n \rightarrow \infty\)

How bad is a histogram?

Error from variance: \(\propto 1/nh\)

Error from bias: \(\propto h^2 (f^{\prime}(x))^2\)

Total error for estimating the density, \(O(h^2) + O(1/nh)\)

Best \(h = O(n^{-1/3})\)

Best total error, \(O(n^{-2/3})\)

vs. \(O(n^{-1})\) if we’ve got a correct parametric model

Q from discussion: What about variable-width bins?

A: That can help, but it’s even better to get away from fixed bins altogether

Kernels

Idea: “Smooth out” each observation into a little distribution

Kernel functions are pdfs

Put a copy of the kernel at each observation; add them up

\[ \hat{f}_{kern}(x) = \frac{1}{n} \sum_{i=1}^{n}{\frac{1}{h} K\left(\frac{x-x_i}{h}\right)} \]

How this works for cats

Say \(h=0.01\)

How do we pick the bandwidth?

Cross-validation
Cross-validate what?
Integrated squared error: \(\int{(f(x) - \hat{f}(x))^2 dx}\)
Or (negative, normalized) log-likelihood \(\int{-\log{(\hat{f}(x))} f(x) dx}\)

Evaluating the log-likelihood

The integral is an expected value: \[ \int{-\log{(\hat{f}(x))} f(x) dx} = \Expect{-\log{\hat{f}(X)}} \]
With \(n\) IID samples \(x_i\), \[ \Expect{-\log{\hat{f}(X)}} \approx \frac{1}{n}\sum_{i=1}^{n}{-\log{\hat{f}(x_i)}} \] by LLN
- A similar but more complicated trick for ISE; see backup slides

Computationally: `npudens`

library(np)
bwt.np <- npudens(cats$Bwt)
plot(bwt.np, xlab = "Body weight (kg)")

Joint PDFs

What about multiple variables? We want \(f(x_1, x_2, \ldots x_p) = f(\vec{x})\)

plot(Hwt ~ Bwt, data = cats, xlab = "Body weight (kg)", ylab = "Heart weight (g)")

Parametric models work just the same way as before
Kernel density estimates work almost the same way

Joint KDEs

Use a product of kernels, one for each variable \[ \hat{f}_{kern}(x_1, x_2) = \frac{1}{n} \sum_{i=1}^{n}{\frac{1}{h_1 h_2} K\left(\frac{x_{1}-x_{i1}}{h_1}\right)K\left(\frac{x_{2}-x_{i2}}{h_2}\right)} \]
Need as many bandwidths as there are variables
Pick bandwidths for best prediction of both variables

Computationally: `npudens` again

cats.np <- npudens(~Bwt + Hwt, data = cats)
plot(cats.np, view = "fixed", theta = 10, phi = 45)

EXERCISE: Where do the ridges come from?

Q1: What happened to the bandwidth on bodyweight (Bwt)?
- It got a lot smaller
Q2: Why did that bandwidth change?
- There’s a strong relationship between Bwt and Hwt; if we want to get both right, we can’t average over as wide a range of Bwt

Conditional PDFs

Conditional pdf: \[ f(x_2|x_1) = \frac{f(x_1, x_2)}{f(x_1)} \] so find the joint and marginal pdfs and divide
Conditional KDE: \[ \hat{f}_{kern}(x_2|x_1) = \frac{\frac{1}{n} \sum_{i=1}^{n}{\frac{1}{h_1 h_2} K\left(\frac{x_{1}-x_{i1}}{h_1}\right)K\left(\frac{x_{2}-x_{i2}}{h_2}\right)}}{\frac{1}{n} \sum_{i=1}^{n}{\frac{1}{h_1} K\left(\frac{x-x_i}{h_1}\right)}} \]
Pick bandwidths for best prediction of \(x_2\)
- i.e., look at the conditional log-likelihood of \(x_2\) given \(x_1\)

Computationally: `npcdens`

cats.hwt.on.bwt <- npcdens(Hwt ~ Bwt, data = cats)
plot(cats.hwt.on.bwt, view = "fixed", theta = 10, phi = 65)

How well does kernel density estimation work?

The same sort of analysis we did for kernel regression will work here too
- More book-keeping — see the text
Integrated squared error of KDE is \(O(n^{-4/5})\) in 1D
- Worse than a well-specified parametric model, \(O(n^{-1})\)
- Better than a mis-specified parametric model, \(O(1)\)
- Better than a histogram, \(O(n^{-2/3})\)
In \(p\) dimensions, ISE of joint KDE is \(O(n^{-4/(4+p)})\)
- Nothing which is universally consistent does any better
- Curse of dimensionality again

Summing up

It’s easy to learn the CDF
- use the empirical CDF
It’s easy to estimate parametric distributions
- maximize the log-likelihood
Histograms are a first step to a non-parametric density estimate
- Kind of ugly and statistically inefficient, though
Kernel density estimation lets us learn marginal pdfs, joint pdfs, conditional pdfs…
- The curse of dimensionality is always with us though

READING: Chapter 14

Backup: More about the ECDF

Glivenko-Cantelli theorem: If \(X\)’s are IID with CDF \(F\), then \[ \max_{a}{|\hat{F}_n(a) - F(a)|} \rightarrow 0 \]
- This is a stronger result than just saying “for each \(a\), \(|\hat{F}_n(a)-F(a)| \rightarrow 0\)”
Kolmogorov-Smirnov (KS) test: To test whether \(X\)’s are IID with CDF \(F_0\), use \(d_{KS} = \max_{a}{|\hat{F}_n(a)-F_0(a)|}\)
- If \(X \sim F_0\), then \(d_{KS} \rightarrow 0\) as \(n\rightarrow \infty\)
- If \(X \not\sim F_0\), then \(d_{KS} \rightarrow\) some positive number
- The sampling distribution of \(d_{KS}\) under the null turns out to be the same for all \(F_0\) (at least for big \(n\))
- Needs modification if you estimate the \(F_0\) from data as well

Backup: Parametric density estimates

We have a parametric model for the density \(f(x;\theta)\) with a finite-dimensional vector of parameters \(\theta\)
We maximize the log-likelihood to get \(\hat{\theta}\)
Equivalently we minimize the negative normalized log-likelihood \[ L(\theta) = -\frac{1}{n}\sum_{i=1}^{n}{\log{f(X_i;\theta)}} \]
This is a sample mean so LLN says \(L(\theta) \rightarrow \Expect{L(\theta)} \equiv \lambda(\theta)\)
You can show that if the data came from the model with some true parameter \(\theta_0\), then \(\lambda(\theta_0)\) is the unique global minimum

Backup: More about parametric density estimates

Quite generally, the MLE \(\hat{\theta} \rightarrow \theta_0\)
Almost as generally, for large \(n\), \(\hat{\theta} \sim \mathcal{N}(\theta_0, \mathbf{J}(\theta_0)/n)\)
- Hence \(\hat{\theta} = \theta_0 + O(1/\sqrt{n})\)
- Here \(\mathbf{J}(\theta) =\) matrix of 2nd partial derivatives of \(\lambda\) = Fisher information
- Can plug in \(\mathbf{J}(\hat{\theta})\) to get approximate confidence intervals, etc.
- Doesn’t matter if the data is Gaussian — \(L(\theta)\) is a sample average so by the CLT it’s (approximately) Gaussian, and that propagates
Things are more complicated if the model is mis-specified

Backup: Evaluating ISE

\[\begin{eqnarray} \int{(f(x) - \hat{f}(x))^2 dx} & = & \int{f^2(x) - 2f(x)\hat{f}(x) \hat{f}^2(x) dx}\\ & = & \int{f^2(x) dx} - 2\int{\hat{f}(x) f(x) dx} + \int{\hat{f}^2(x) dx}\\ & = & (A) + (B) + (C)\\ & = & (\text{same for all estimates}) - (\text{depends on truth and estimate}) + \text{(depends only on estimate)} \end{eqnarray}\]

If we’re comparing different estimates, term (A) doesn’t matter, so we can ignore it
Term (C) depends only on our estimate, and we can get it by (say) numerical integration
That leaves term (B), and with \(n\) independent samples \(x_i\), \[ \int{\hat{f}(x) f(x) dx} \approx \frac{1}{n}\sum_{i=1}^{n}{\hat{f}(x_i)} \] by LLN

Density estimation

Data: body weights of cats

The empirical cumulative distribution function

Computationally: ecdf

The ECDF estimates the CDF

The CDF is not enough

The CDF is not enough

What about a histogram?

Fitting a parametric model

Fitting a parametric model (cont’d)

Compare the fitted density to the histogram

The histogram is already a density estimate

Why this works

How bad is a histogram?

Kernels

How this works for cats

How do we pick the bandwidth?

Evaluating the log-likelihood

Computationally: npudens

Joint PDFs

Joint KDEs

Computationally: npudens again

Conditional PDFs

Computationally: npcdens

How well does kernel density estimation work?

Summing up

Backup: More about the ECDF

Backup: Parametric density estimates

Backup: More about parametric density estimates

Backup: Evaluating ISE

Computationally: `ecdf`

Computationally: `npudens`

Computationally: `npudens` again

Computationally: `npcdens`