Data: body weights of cats

library(MASS)
data(cats)
head(cats$Bwt)

## [1] 2.0 2.0 2.0 2.1 2.1 2.1

summary(cats$Bwt)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.000   2.300   2.700   2.724   3.025   3.900

The empirical cumulative distribution function

ECDF: \[ \hat{F}_n(a) = \frac{1}{n}\sum_{i=1}^{n}{\Indicator{x_i \leq a}} \]
- Makes a jump at each distinct value seen in the data
- Constant in between the jumps

Computationally: `ecdf`

plot(ecdf(cats$Bwt), xlab = "Body weight (kg)", ylab = "Cumulative probability",
    main = "Empirical CDF of cats' body weights")

The ECDF estimates the CDF

Suppose \(x_1, x_2, \ldots x_n\) are IID with CDF \(F\). Then for any \(a\), \[ \hat{F}_n(a) \rightarrow F(a) \] by law of large numbers
- Also: see backup slides

The CDF is not enough

What’s the probability of a cat weighing between \(3.21\) and \(3.22\) kg?
- ECDF says: exactly 0

The CDF is not enough

More generally, pdf is derivative of CDF, \(f(x) = dF/dx\)
Derivative of an empirical CDF is either 0 or \(\infty\)
- Where is it \(\infty\)?
How do we estimate a pdf?

What about a histogram?

hist(cats$Bwt, xlab = "Body weight (kg)", ylab = "Number of cats", breaks = 20)

Gives us an idea of where we have more or fewer data points…

Fitting a parametric model

Pick a parametric pdf that looks sensible
For example, the gamma distribution with shape parameter \(a\) and scale \(s\) is \[ f(x) = \frac{1}{s^a \Gamma(a)} x^{a-1} e^{-x/s} \]
- Exponential is the special case \(a=1\)
- \(a > 1\) is a hump with an exponential right tail

Fitting a parametric model (cont’d)

Now maximize the likelihood \(=\) probability density at the data
For many standard distributions, the function fitdistr in the package MASS has pre-programmed maximum likelihood estimates
- As it happens, fitdistr knows about gamma distributions too
It’s also easy to maximize the log-likelihood yourself

gamma.loglike <- function(params, data = cats$Bwt) {
    -mean(log(dgamma(x = data, shape = params[1], scale = params[2])))
}
cats.gamma <- optim(fn = gamma.loglike, par = c(1, 1))
cats.gamma$value

## [1] 0.6677077

cats.gamma$par

## [1] 32.65187332  0.08341489

Compare the fitted density to the histogram

plot(hist(cats$Bwt, plot = FALSE, breaks = 20), freq = FALSE, xlab = "Body weight (kg)",
    xlim = c(0, 2 * max(cats$Bwt)))
curve(dgamma(x, shape = cats.gamma$par[1], scale = cats.gamma$par[2]), add = TRUE)

This is not a very good fit on the left, maybe tolerable on the right…

The histogram is already a density estimate

Estimated density is constant on bins
\(N_i =\) number of samples in bin \(i\)
\(i(x)=\) which bin \(x\) falls into
\(h =\) width of the bins

\[ \hat{f}_{hist}(x) \equiv \frac{1}{h} \frac{N_{i(x)}}{n} \]

This is a mixture of more basic distributions:
- Uniform distributions on each bin
- Mixing weight of bin \(i\) is \(N_i/n\)

Why this works

Probability of being in an interval of length \(h\) around \(x\) will be \[ \int_{x-h/2}^{x+h/2}{f(u) du} \approx f(x) h \] for very small \(h\). So \[ N_i \sim \mathrm{Binom}(n, f(x) h) \] and \[ \hat{f}_{hist}(x) = \frac{N_i}{nh} \sim \frac{1}{nh} \mathrm{Binom}(n, f(x) h) \] which \(\rightarrow f(x)\) as \(n \rightarrow \infty\)

How bad is a histogram?

Error from variance: \(\propto 1/nh\)
Error from bias: \(\propto h^2 (f^{\prime}(x))^2\)
- Taylor-expand inside the integral on the previous slide to first order; the bias will cancel out if we’re exactly at the mid-point of the interval, otherwise it’s as stated
Total error for estimating the density, \(O(h^2) + O(1/nh)\)
Best \(h = O(n^{-1/3})\)
Best total error, \(O(n^{-2/3})\)
- vs. \(O(n^{-1})\) if we’ve got a correct parametric model

Kernels

Idea: “Smooth out” each observation into a little distribution

Kernel functions are pdfs

Put a copy of the kernel at each observation; add them up

\[ \hat{f}_{kern}(x) = \frac{1}{n} \sum_{i=1}^{n}{\frac{1}{h} K\left(\frac{x-x_i}{h}\right)} \]

This is a mixture:
- Basic distributions are the scaled copies of the kernel, centered at the \(x_i\)s
- Weight of each mixture component \(=1/n\)

How this works for cats

Say \(h=0.01\)

How this works for cats

Say \(h=0.02\)

How this works for cats

Say \(h=0.05\)

How do we pick the bandwidth?

Cross-validation
Cross-validate what?
Integrated squared error: \(\int{(f(x) - \hat{f}(x))^2 dx}\)
Or (negative, normalized) log-likelihood \(-\int{\log{(\hat{f}(x))} f(x) dx}\)

Evaluating the log-likelihood

The integral is an expected value: \[ -\int{\log{(\hat{f}(x))} f(x) dx} = \Expect{-\log{\hat{f}(X)}} \]
With \(n\) IID samples \(x_i\), \[ \Expect{-\log{\hat{f}(X)}} \approx -\frac{1}{n}\sum_{i=1}^{n}{\log{\hat{f}(x_i)}} \] by LLN
- A similar but more complicated trick for ISE; see backup slides

Computationally: `npudens`

library(np)
bwt.np <- npudens(cats$Bwt)
plot(bwt.np, xlab = "Body weight (kg)")

Bandwidth: 0.11

Joint PDFs

What about multiple variables? We want \(f(x_1, x_2, \ldots x_p) = f(\vec{x})\)

plot(Hwt ~ Bwt, data = cats, xlab = "Body weight (kg)", ylab = "Heart weight (g)")

Parametric models work just the same way as before
Kernel density estimates work almost the same way

Joint KDEs

Use a product of kernels, one for each variable \[ \hat{f}_{kern}(x_1, x_2) = \frac{1}{n} \sum_{i=1}^{n}{\frac{1}{h_1 h_2} K\left(\frac{x_{1}-x_{i1}}{h_1}\right)K\left(\frac{x_{2}-x_{i2}}{h_2}\right)} \]
Need as many bandwidths as there are variables
Pick bandwidths for best prediction of both variables

Computationally: `npudens` again

cats.np <- npudens(~Bwt + Hwt, data = cats)
plot(cats.np, view = "fixed", theta = 10, phi = 45, xlab = "Body weight (kg)", ylab = "Heart weight (g)")

EXERCISE: Where did those ridges come from?

Q1: What happened to the bandwidth on bodyweight (Bwt) going from 1D to 2D? Did it increase, decrease, stay the same? How can you tell?
Q2: Why did that bandwidth do that?

SOLUTIONS

Q1: What happened to the bandwidth on bodyweight (Bwt)?
- It got a lot smaller; we can tell because the distribution is plainly very spiky along the Bwt axis

bwt.np$bws$bw

## [1] 0.1098765

cats.np$bws$bw

## [1] 0.008333236 1.485781401

Q2: Why did that bandwidth change?
- We almost never see the same bandwidths for the same variable in different distributions: it changes if we’re doing marginal vs. joint, it changes in different joint distributions with different other variables, etc.
  - We’re optimizing for different objective functions!
- Here: There’s a strong relationship between Bwt and Hwt; if we want to get both right, we can’t average over as wide a range of Bwt

Conditional PDFs

Conditional pdf: \[ f(x_2|x_1) = \frac{f(x_1, x_2)}{f(x_1)} \] so find the joint and marginal pdfs and divide
Conditional KDE: \[ \hat{f}_{kern}(x_2|x_1) = \frac{\frac{1}{n} \sum_{i=1}^{n}{\frac{1}{h_1 h_2} K\left(\frac{x_{1}-x_{i1}}{h_1}\right)K\left(\frac{x_{2}-x_{i2}}{h_2}\right)}}{\frac{1}{n} \sum_{i=1}^{n}{\frac{1}{h_1} K\left(\frac{x-x_i}{h_1}\right)}} \]
Pick both bandwidths to give the best prediction of \(x_2\)
- i.e., look at the conditional log-likelihood of \(x_2\) given \(x_1\)

Computationally: `npcdens`

cats.hwt.on.bwt <- npcdens(Hwt ~ Bwt, data = cats)
plot(cats.hwt.on.bwt, view = "fixed", theta = 10, phi = 65)

How well does kernel density estimation work?

The same sort of analysis we did for kernel regression will work here too
- More book-keeping — see the textbook
Integrated squared error of KDE is \(O(n^{-4/5})\) in 1D
- Worse than a well-specified parametric model, \(O(n^{-1})\)
- Better than a mis-specified parametric model, \(O(1)\)
- Better than a histogram, \(O(n^{-2/3})\)
In \(p\) dimensions, ISE of joint KDE is \(O(n^{-4/(4+p)})\)
- Nothing which is universally consistent does any better
- Curse of dimensionality again

Summing up

It’s easy to learn the CDF
- use the empirical CDF
It’s straightforward to estimate parametric distributions
- maximize the log-likelihood
Histograms are a first step to a non-parametric density estimate
- Kind of ugly and statistically inefficient, though
Kernel density estimation lets us learn marginal pdfs, joint pdfs, conditional pdfs…
- The curse of dimensionality is always with us though

READING: Chapter 14

Backup: More about the ECDF

Glivenko-Cantelli theorem: If \(X\)’s are IID with CDF \(F\), then \[ \max_{a}{|\hat{F}_n(a) - F(a)|} \rightarrow 0 \]
- This is a stronger result than just saying “for each \(a\), \(|\hat{F}_n(a)-F(a)| \rightarrow 0\)”
Kolmogorov-Smirnov (KS) test: To test whether \(X\)’s are IID with CDF \(F_0\), use \(d_{KS} = \max_{a}{|\hat{F}_n(a)-F_0(a)|}\)
- If \(X \sim F_0\), then \(d_{KS} \rightarrow 0\) as \(n\rightarrow \infty\)
- If \(X \not\sim F_0\), then \(d_{KS} \rightarrow\) some positive number
- The sampling distribution of \(d_{KS}\) under the null turns out to be the same for all \(F_0\) (at least for big \(n\))
- Needs modification if you estimate the \(F_0\) from data as well

Backup: Parametric density estimates

We have a parametric model for the density \(f(x;\theta)\) with a finite-dimensional vector of parameters \(\theta\)
We maximize the log-likelihood to get \(\hat{\theta}\)
Equivalently we minimize the negative normalized log-likelihood \[ L(\theta) = -\frac{1}{n}\sum_{i=1}^{n}{\log{f(X_i;\theta)}} \]
This is a sample mean so LLN says \(L(\theta) \rightarrow \Expect{L(\theta)} \equiv \lambda(\theta)\)
You can show that if the data came from the model with some true parameter \(\theta_0\), then \(\lambda(\theta_0)\) is the unique global minimum

Backup: More about parametric density estimates

Quite generally, the MLE \(\hat{\theta} \rightarrow \theta_0\)
Almost as generally, for large \(n\), \(\hat{\theta} \sim \mathcal{N}(\theta_0, \mathbf{J}(\theta_0)/n)\)
- Hence \(\hat{\theta} = \theta_0 + O(1/\sqrt{n})\)
- Here \(\mathbf{J}(\theta) =\) matrix of 2nd partial derivatives of \(\lambda\) = Fisher information
- Can plug in \(\mathbf{J}(\hat{\theta})\) to get approximate confidence intervals, etc.
- Doesn’t matter if the data is Gaussian — \(L(\theta)\) is a sample average so by the CLT it’s (approximately) Gaussian, and that propagates
Things are more complicated if the model is mis-specified
More details in Appendix D of the textbook

Backup: Evaluating ISE

\[\begin{eqnarray} \int{(f(x) - \hat{f}(x))^2 dx} & = & \int{f^2(x) - 2f(x)\hat{f}(x) \hat{f}^2(x) dx}\\ & = & \int{f^2(x) dx} - 2\int{\hat{f}(x) f(x) dx} + \int{\hat{f}^2(x) dx}\\ & = & (A) + (B) + (C)\\ & = & (\text{same for all estimates})\\ & & - (\text{depends on truth and estimate})\\ & & + \text{(depends only on estimate)} \end{eqnarray}\]

If we’re comparing different estimates, term (A) doesn’t matter, so we can ignore it
Term (C) depends only on our estimate, and we can get it by (say) numerical integration
That leaves term (B), and with \(n\) independent samples \(x_i\), \[ \int{\hat{f}(x) f(x) dx} \approx \frac{1}{n}\sum_{i=1}^{n}{\hat{f}(x_i)} \] by LLN

Density estimation

Data: body weights of cats

The empirical cumulative distribution function

Computationally: `ecdf`

The ECDF estimates the CDF

The CDF is not enough

The CDF is not enough

What about a histogram?

Fitting a parametric model

Fitting a parametric model (cont’d)

Compare the fitted density to the histogram

Have a cat picture

The histogram is already a density estimate

Why this works

How bad is a histogram?

Have a cat picture

Kernels

How this works for cats

How this works for cats

How this works for cats

How do we pick the bandwidth?

Evaluating the log-likelihood

Computationally: `npudens`

Joint PDFs

Joint KDEs

Computationally: `npudens` again

EXERCISE: Where did those ridges come from?

SOLUTIONS

Have a cat picture

Conditional PDFs

Computationally: `npcdens`

How well does kernel density estimation work?

Summing up

Backup: More about the ECDF

Backup: Parametric density estimates

Backup: More about parametric density estimates

Backup: Evaluating ISE

Density estimation

Data: body weights of cats

The empirical cumulative distribution function

Computationally: ecdf

The ECDF estimates the CDF

The CDF is not enough

The CDF is not enough

What about a histogram?

Fitting a parametric model

Fitting a parametric model (cont’d)

Compare the fitted density to the histogram

Have a cat picture

The histogram is already a density estimate

Why this works

How bad is a histogram?

Have a cat picture

Kernels

How this works for cats

How this works for cats

How this works for cats

How do we pick the bandwidth?

Evaluating the log-likelihood

Computationally: npudens

Joint PDFs

Joint KDEs

Computationally: npudens again

EXERCISE: Where did those ridges come from?

SOLUTIONS

Have a cat picture

Conditional PDFs

Computationally: npcdens

How well does kernel density estimation work?

Summing up

Backup: More about the ECDF

Backup: Parametric density estimates

Backup: More about parametric density estimates

Backup: Evaluating ISE

Computationally: `ecdf`

Computationally: `npudens`

Computationally: `npudens` again

Computationally: `npcdens`