\[ \newcommand{\Expect}[1]{\mathbb{E}\left[ #1 \right]} \newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]} \newcommand{\SampleVar}[1]{\widehat{\mathrm{Var}}\left[ #1 \right]} \newcommand{\Cov}[1]{\mathrm{Cov}\left[ #1 \right]} \newcommand{\Prob}[1]{\mathbb{P}\left( #1 \right)} \newcommand{\TrueRegFunc}{\mu} \newcommand{\EstRegFunc}{\widehat{\TrueRegFunc}} \DeclareMathOperator{\tr}{tr} \DeclareMathOperator*{\argmin}{argmin} \DeclareMathOperator{\det}{det} \newcommand{\TrueNoise}{\epsilon} \newcommand{\EstNoise}{\widehat{\TrueNoise}} \newcommand{\Indicator}[1]{\mathbb{I}\left( #1 \right)} \newcommand{\se}[1]{\mathrm{se}\left[ #1 \right]} \newcommand{\CrossEntropy}{\ell} \newcommand{\xmin}{x_{\mathrm{min}}} \]

Agenda

We want to do statistical inference with dependent data
We’ll get there by stepping back to look at how inference with independent data really works
- And so what we need to generalize

In our previous episodes

Many plausible ideas about how to estimate with dependent data
Some models which generate dependent data
How do we know that ideas work (in those models or elsewhere)?
Strategy: go back to inference with independent data, and see what we can abstract

For the rest of today

Assume \(X_1, X_2, \ldots X_n\) are independent and identically distributed (IID)
- Each \(X_i\) might have multiple dimensions, e.g., \(X_i = (Y_i, Z_i)\)
We don’t know the common distribution of the \(X\)’s
That distribution might be parametric with unknown parameter(s) \(\theta\), and pdf \(f(x;\theta)\)
- E.g., Gaussian, \(f(x;\theta) = \frac{1}{\sqrt{2\pi \theta_2}} e^{-(x-\theta_1)^2/2\theta_2}\)
- E.g., Pareto/power-law, \(f(x;\theta) = \frac{\theta - 1}{\xmin} {\left(\frac{x}{\xmin}\right)}^{-\theta}\) for \(x \geq \xmin\)
- The Pareto is a common model for “heavy tailed” data (Clauset, Shalizi, and Newman 2009)
Or it might be nonparametric, meaning we don’t assume any parametric form

What we want to infer

Parameters, in a parametric model
Moments
Quantiles
More complex things — optimal slope of \(Y_i\) on \(Z_i\)
In general, some function \(\psi(p)\) of the true pdf \(p\)

Some standard estimates work because of the law of large numbers

Obvious example: Use \(\overline{X}\) as an estimate of \(\Expect{X}\)
Less obvious: \(\frac{n}{n+\nu}\overline{X}\) also works for any fixed \(\nu > 0\)

What do we mean by “works”?

Consistency: \(\hat{\psi}_n \rightarrow \psi\) as \(n\rightarrow\infty\)
Also nice to know:
- Bias: \(\Expect{\hat{\psi}_n} - \psi\) (should \(\rightarrow 0\))
- Variance: \(\Var{\hat{\psi}_n}\) (also should \(\rightarrow 0\))
- Sampling distribution of \(\hat{\psi}_n\) (to get confidence sets)

The basic ingredient for consistency is the law of large numbers

Sample mean \(\overline{X}_n = n^{-1}\sum_{i=1}^{n}{X_i}\)
Assume the \(X_i\) all have mean \(\Expect{X}\), variance \(\Var{X}\), and are uncorrelated (but not necessarily independent)
Let’s prove that \(\overline{X}_n \rightarrow \Expect{X}\) as \(n\rightarrow\infty\)
- So the sample mean is consistent for the true expectation value

It’s enough to show that \[ \Expect{(\overline{X}_n-\Expect{X})^2} \rightarrow 0 ~\text{as} ~ n\rightarrow\infty \] (Why?)
It’s enough to show that \(\Expect{\overline{X}_n} = \Expect{X}\) and \(\Var{\overline{X}_n} \rightarrow 0\) (Why?)

Proof of the law of large numbers, cont’d.

The sample mean is unbiased: \[\begin{eqnarray} \Expect{\overline{X}_n} & = & \Expect{\frac{1}{n}\sum_{i=1}^{n}{X_i}}\\ & = & \frac{n\Expect{X_1}}{n} = \Expect{X}\\ \end{eqnarray}\]
The sample mean has shrinking variance: \[\begin{eqnarray} \Var{\overline{X}_n} & = & \Var{\frac{1}{n}\sum_{i=1}^{n}{X_i}}\\ & = & \frac{\sum_{i=1}^{n}{\Var{X_i}}}{n^2}\\ & = & \frac{n \Var{X}}{n^2} = \frac{\Var{X}}{n} \rightarrow 0\\ \end{eqnarray}\]
\(\therefore\) \[\begin{eqnarray} \Expect{(\overline{X}_n-\Expect{X})^2} & = & \left(\Expect{\overline{X}_n}-\Expect{X}\right)^2 + \Var{\overline{X}_n}\\ & = & 0 + n^{-1}\Var{X} \rightarrow 0 \end{eqnarray}\]
\(\therefore\) \(\overline{X}_n \rightarrow \Expect{X}\) as \(n\rightarrow\infty\)

Law of large numbers in general

If the \(X_i\) are IID, then so are \(h(X_i)\), for any function \(h\), so \[ \overline{h(X)_n} \rightarrow \Expect{h(X)} \]

This is part of why maximum likelihood works

The normalized log-likelihood is a (random) function of \(\theta\): \[ L_n(\theta) = \frac{1}{n}\sum_{i=1}^{n}{\log{f(X_i;\theta)}} \]
Apply the LLN: for each \(\theta\), \[ L_n(\theta) \rightarrow \Expect{\log{f(X;\theta)}} \equiv \CrossEntropy(\theta) \]
Fact (the “Gibbs inequality”): for any pdfs \(f\), \(g\), \[ \int{f(x) \log{f(x)} dx} \geq \int{f(x) \log{g(x)} dx} \]
- (similarly for pmfs)
So \(\CrossEntropy(\theta)\) is maximized at the true \(\theta\), \(\theta_0\): \[ \ell(\theta_0) = \int{f(x;\theta_0) \log{(f(x;\theta_0))} dx} \geq \int{f(x;\theta_0) \log{(f(x;\theta))} dx} = \ell(\theta) \]

Maximum likelihood can work even without moments

For the Pareto distribution, \(\Expect{X^k} = \infty\) if \(\theta \leq k\)
- So \(\Var{X} = \infty\) if \(\theta \leq 2\)
But \(\log{f}\) is well-behaved: \[ \log{f(x;\theta)} = \log{(\theta -1)} - (\theta-1)\log{\xmin} - \theta \log{x} \]
- Exercise (off-line): Find \(\Expect{\log{X}}\) and \(\Var{\log{X}}\) in terms of \(\theta\) and \(\xmin\)
What does the normalized log-likelihood function look like?

Convergence of the log-likelihood function (an example)

Normalized log-likelihoods for 10 different IID samples from the Pareto distribution (\(n=10\), \(\theta=1.5\), \(\xmin=1\))

Convergence of the log-likelihood function (an example)

Normalized log-likelihoods for 10 different IID samples from the Pareto distribution (\(n=10^3\), \(\theta=1.5\), \(\xmin=1\))

Convergence of the log-likelihood function (an example)

Normalized log-likelihoods for 10 different IID samples from the Pareto distribution (\(n=10^5\), \(\theta=1.5\), \(\xmin=1\))

Convergence of the log-likelihood function (an example)

Normalized log-likelihoods for the Pareto distribution, showing convergence as \(n\rightarrow\infty\) along a single IID sequence

The more general pattern

Assume (1): \(\hat{\psi}_n = \argmin_{\psi}{M_n(\psi)}\) for some random functions \(M_n\)
- If you want to maximize something instead, minimize its negative
Assume (2): \(M_n(\psi) \rightarrow m(\psi)\) as \(n\rightarrow \infty\)
Assume (3): \(m(\psi)\) has a unique minimum at the true \(\psi_0\)
Then, in general, \(\hat{\psi}_n \rightarrow \psi_0\)

(some disclaimers apply)

This applies pretty broadly

Estimating the expectation with the sample mean: \[\begin{eqnarray} M_n(\psi) & = & \frac{1}{n}\sum_{i=1}^{n}{(X_i - \psi)^2}\\ m(\psi) & = & \Expect{(X-\psi)^2} \end{eqnarray}\]
Estimating a simple linear regression: \[\begin{eqnarray} M_n(\psi) & = & \frac{1}{n}\sum_{i=1}^{n}{(Y_i - \psi_1 - \psi_2 Z_i)^2}\\ m(\psi) & = & \Expect{(Y-\psi_1 - \psi_2 Z)^2} \end{eqnarray}\]
Nonlinear least squares, with regression function \(\TrueRegFunc(z;\psi)\) \[\begin{eqnarray} M_n(\psi) & = & \frac{1}{n}\sum_{i=1}^{n}{(Y_i - \TrueRegFunc(Z_i;\psi))^2}\\ m(\psi) & = & \Expect{(Y-\TrueRegFunc(Z;\psi))^2} \end{eqnarray}\]
Estimating the \(\alpha\) quantile (Koenker and Hallock 2001): \[\begin{eqnarray} M_n(\psi) & = & \frac{1}{n}\sum_{i=1}^{n}{X_i(\alpha - \mathbf{I}(X_i < 0))}\\ m(\psi) & = & \Expect{X(\alpha - \mathbf{I}(X < 0))} \end{eqnarray}\]
- Extends naturally to estimating conditional quantiles

What about estimation error?

Remember that the standard error of an estimator is its standard deviation
- RMS error of the estimator \(\hat{\psi}_n\) \(= \sqrt{\se{\hat{\psi}_n}^2 + \mathrm{bias}(\hat{\psi}_n)^2}\)
If we’re using the sample mean to estimate the expectation, we can use the standard error of the mean based on the population variance: \[ \Var{\overline{X}_n} = \frac{1}{n}\Var{X} \Rightarrow \se{\overline{X}_n} = \sqrt{\frac{\Var{X}}{n}} \]
If we’re estimating something else, we need to find the standard errort some other way
If \(\hat{\psi}_n = h(A_n,B_n)\) and we know \(\Var{A}_n\), \(\Var{B}_n\), we can use propagation of error (a.k.a. the delta method):

\[\begin{eqnarray} h(\Expect{A_n}, \Expect{B_n}) & \approx & h(A_n, B_n) + (\Expect{A_n} - A_n)\frac{\partial h}{\partial a} + (\Expect{B_n}-B_n)\frac{\partial h}{\partial b} ~ \text{(Taylor series)}\\ \hat{\psi}_n = h(A_n, B_n) & \approx & h(\Expect{A_n}, \Expect{B_n}) + (A_n - \Expect{A_n})\frac{\partial h}{\partial a} + (B_n - \Expect{B_n})\frac{\partial h}{\partial b}\\ \Var{\hat{\psi}_n} & \approx & {\left(\frac{\partial h}{\partial a}\right)}^2\Var{A_n} +{\left(\frac{\partial h}{\partial b}\right)}^2\Var{B_n} + 2\left(\frac{\partial h}{\partial a}\frac{\partial h}{\partial b}\right)\Cov{A_n, B_n} \end{eqnarray}\]

But we’re usually estimating things by doing some weird optimization problem so we can’t just take a mean or use propagation of error…

Estimating by optimizing

Assume:
- \(\hat{\psi}_n\) minimizes some \(M_n(\psi)\)
- \(M_n(\psi) \rightarrow m(\psi)\) (everywhere)
- \(\psi_0\) minimizes \(m(\psi)\) (uniquely)
- \(\psi\) is one-dimensional (for now)

\[\begin{eqnarray} 0 & = & \frac{dM_n}{d\psi}(\hat{\psi}_n) ~ \text{(optimum)}\\ 0 & \approx & \frac{dM_n}{d\psi}(\psi_0) + (\hat{\psi}_n - \psi_0)\frac{d^2 M_n}{d\psi^2}(\psi_0) ~ \text{(Taylor expansion)}\\ ( \hat{\psi}_n - \psi_0) \frac{d^2 M_n}{d\psi^2}(\psi_0) & \approx & -\frac{dM_n}{d\psi}(\psi_0)\\ \hat{\psi}_n - \psi_0 & \approx & -\frac{\frac{dM_n}{d\psi}(\psi_0)}{\frac{d^2 M_n}{d\psi^2}(\psi_0)}\\ \hat{\psi}_n & \approx & \psi_0 - \frac{\frac{dM_n}{d\psi}(\psi_0)}{\frac{d^2 M_n}{d\psi^2}(\psi_0)} \end{eqnarray}\]

Estimating by optimizing

Still assuming \(\psi\) is one-dimensional, write \(M_n^{\prime}\) and \(M_n^{\prime\prime}\) for the derivatives \[ \hat{\psi}_n \approx \psi_0 - \frac{M_n^{\prime}}{M_n^{\prime\prime}} \] As \(n\rightarrow \infty\), \[\begin{eqnarray} M_n^{\prime} & \rightarrow & m^{\prime}(\psi_0) = 0 ~ \text{(optimum)}\\ M_n^{\prime^\prime} & \rightarrow & m^{\prime\prime}(\psi_0) > 0 ~ \text{(optimum)} \end{eqnarray}\] so \[ \hat{\psi}_n \rightarrow \psi_0 \]

Estimating by optimizing

(still in 1D)

What about variance? \[\begin{eqnarray} \hat{\psi}_n & \approx & \psi_0 - \frac{M_n^{\prime}}{M_n^{\prime\prime}}\\ \Var{\hat{\psi}_n} & \approx & \Var{\frac{M_n^{\prime}}{M_n^{\prime\prime}}}\\ & = & \frac{\Var{M_n^{\prime}}}{(m^{\prime\prime})^2}\\ \se{\hat{\psi}_n} & \approx & \frac{\sqrt{\Var{M^{\prime}_n}}}{m^{\prime\prime}} \end{eqnarray}\]
Variance of \(\hat{\psi}_n\) goes down as the curvature goes up
Variance of \(\hat{\psi}_n\) goes up with the variance in \(M_n\)
- If \(M_n\) is an average, with \(\Var{M_n} = O(1/n)\), so is \(M^{\prime}_n\), and \(\Var{M_n^{\prime}} = O(1/n)\)
- Then \(\se{\hat{\psi}_n} = O(1/\sqrt{n})\)
What about the distribution?
- If \(M_n \rightsquigarrow \mathcal{N}\), because of the CLT, then
- \(M_n^{\prime} \rightsquigarrow \mathcal{N}\) (usually)

Estimating by optimizing

More general case: \(\psi\) is a vector \[\begin{eqnarray} \hat{\psi}_n & = & \argmin_{\psi}{M_n(\psi)}\\ 0 & = & \nabla M_n(\hat{\psi}_n)\\ 0 & \approx & \nabla M_n(\psi_0) + (\hat{\psi}_n-\psi_0) \nabla \nabla M_n(\psi_0)\\ \nabla \nabla M_n(\psi_0) & \equiv & \mathbf{H}_n(\psi_0) \rightarrow \mathbf{h}\\ \hat{\psi}_n & \approx & \psi_0 - \mathbf{h}^{-1} \nabla M_n(\psi_0)\\ \Var{\hat{\psi}_n } & \approx & \mathbf{h}^{-1} \Var{\nabla M_n(\psi_0)} \mathbf{h}^{-1} \end{eqnarray}\]
- \(\mathbf{h} = \nabla \nabla m(\psi_0)\) Hessian of \(m\) at \(\psi_0\)
Again, if \(\nabla M_n(\psi_0)\) is an average, CLT suggests \(\rightsquigarrow \mathcal{N}\)

In practice…

Approximate \(\mathbf{h}\) by \(\nabla \nabla M_n(\hat{\psi}_n)\), say \(\mathbf{H}_n\)
Approximate \(\Var{\nabla M_n(\psi_0)}\) by sample covariance of the derivative of \(M_n\), say \(\mathbf{J}_n\)
Sandwich covariance matrix is \(\mathbf{H}_n^{-1}\mathbf{J}_n \mathbf{H}_n^{-1}\)

Special application to maximum likelihood

\(\nabla \nabla \CrossEntropy(\theta) \equiv \mathbf{i}(\theta)\), the Fisher information
Fisher identity: \(\Var{\nabla L_n(\theta)} = \frac{1}{n} \mathbf{i}(\theta)\), if the model is right
So \(\Var{\hat{\theta}_n} = \frac{1}{n} \mathbf{i}^{-1}(\theta_0)\), if the model is right
Exercise (offline):
- Find the Fisher information for the Pareto

Generalizing away from IID data

Estimate by optimizing some objective function \(M_n\) \[ \hat{\psi}_n = \argmin_{\psi}{M_n(\psi)} \]
Hope the objective function tends to a limit function \(m\) \[ M_n(\psi) \rightarrow m(\psi) \]
Hope that the limiting function has its minimum in the right place \[ \argmin_{\psi}{m(\psi)} = \psi_0 \]
Then the exact same arguments apply: \[\begin{eqnarray} \hat{\psi}_n & \rightarrow & \psi_0\\ \Var{\hat{\psi}_n} & \rightarrow & \mathbf{h}^{-1} \Var{\nabla M_n(\psi_0)} \mathbf{h}^{-1} \end{eqnarray}\]
We need to make sure our objective functions converge
We’re going to need a law of large numbers for dependent data

Summary

We usually estimate by minimizing an objective function
If the objective function has a well-behaved limit, the estimate will converge
The variance of the estimate depends on:
- The curvature of the objective function (\(\uparrow\) curvature \(\downarrow\) variance)
- The variance of the derivative of the objective function
None of these ingredients actually requires independence
Once we have convergence of our objective function, everything will fall in to place

Backup: Modes of convergence

Consistency is usually defined as “convergence in probability”: For any \(\epsilon > 0\), \[ \Prob{|\hat{\psi}_n - \psi| \geq \epsilon} \rightarrow 0 \]
The main slides use \(L_2\) convergence, \[ \Expect{(\hat{\psi}_n - \psi)^2} \rightarrow 0 \]
- \(L_2\) convergence implies convergence in probability by Chebyshev’s inequality but not vice versa
There is also “strong convergence” or “almost sure convergence”: \[ \Prob{\hat{\psi}_n \rightarrow \psi} = 1 \]
- A.S.-convergence implies convergence in probability but not vice versa
- If \(\hat{\psi}_n \rightarrow \psi\) almost surely, we say \(\hat{\psi}_n\) is a strongly consistent estimator of \(\psi\)
  - We won’t deal much with strongly consistent estimastors in this course

Backup: Chebyshev’s inequality

For any random variable \(Z\), and any \(\epsilon > 0\),

\[ \Prob{|Z-\Expect{Z}| > \epsilon} \leq \frac{\Var{Z}}{\epsilon^2} \]

Proof: Apply Markov’s inequality to \((Z-\Expect{Z})^2\), which is \(\geq 0\) and has expectation \(\Var{Z}\).

Backup: Markov’s inequality

For any non-negative random variable \(Z\), and any \(\epsilon > 0\),

\[ \Prob{Z \geq \epsilon} \leq \frac{\Expect{Z}}{\epsilon} \]

Proof: \[\begin{eqnarray} Z & = & Z\Indicator{Z \geq \epsilon} + Z\Indicator{Z < \epsilon}\\ \Expect{Z} & = & \Expect{Z \Indicator{Z \geq \epsilon}} + \Expect{Z \Indicator{Z < \epsilon}}\\ & \geq & \Expect{Z \Indicator{Z \geq \epsilon}}\\ & \geq & \Expect{\epsilon \Indicator{Z \geq \epsilon}}\\ & = & \epsilon\Expect{\Indicator{Z \geq \epsilon}} = \epsilon \Prob{Z \geq \epsilon} \end{eqnarray}\]

Backup: Disclaimers to the “More general pattern”

Need to assume (2’) that \(M_n(\psi) \rightarrow m(\psi)\) uniformly over \(\psi\)
- Meaning: for each \(\epsilon > 0\), exists an \(N(\epsilon)\) such that \(\max_{\psi}{|M_n(\psi) - m(\psi)|} \leq \epsilon\) if \(n \geq N(\epsilon)\)
- Or at least (2’’) uniformly over a region where \(\hat{\psi}_n\) will concentrate
Need to assume (3’) that \(m(\psi)\) has a unique minimum at \(\psi_0\), and that this minimum is “well-separated”
- Roughly: \(m(\psi) - m(\psi_0)\) can be very small only if \(\psi\) is close to \(\psi_0\)
- Less roughly, for all sufficiently small \(\delta > 0\), \(m(\psi) - m(\psi_0) \leq \epsilon\) implies \(|\psi - \psi_0| \leq \delta(\epsilon)\) and \(\delta(\epsilon) \rightarrow 0\) as \(\epsilon \rightarrow 0\)

Sketch proof

We first show that \(m(\hat{\psi}_n)\) is close to \(m(\psi_0)\)
- Because (assumption 3) \(\psi_0\) minimizes \(m\), we know \(m(\hat{\psi}_n) - m(\psi_0) \geq 0\)
- Now pick any \(\epsilon > 0\); we’ll show that \(m(\hat{\psi}_n) - m(\psi_0)\) is eventually \(\leq 2 \epsilon\)
- Take \(m(\hat{\psi}_n) - m(\psi_0)\) and add and subtract \(M_n(\hat{\psi}_n)\) to get two deviations of \(M_n\) from its limit \(m\): \[\begin{eqnarray} m(\hat{\psi}_n) - m(\psi_0) & = & m(\hat{\psi}_n) - M_n(\hat{\psi}_n) + M_n(\hat{\psi}_n) - m(\psi_0)\\ & \leq & m(\hat{\psi}_n) - M_n(\hat{\psi}_n) + M_n(\psi_0) - m(\psi_0) ~ (\text{because} ~ \hat{\psi}_n~ \text{minimizes} ~ M_n)\\ & \leq & | M_n(\hat{\psi}_n) - m(\hat{\psi}_n)| + |M_n(\psi_0) - m(\psi_0)| ~ (\text{because} ~ a+b \leq |a|+|b|) \end{eqnarray}\]
- For large enough \(n\), \(\max_{\psi}{|M_n(\psi) - m(\psi)|} \leq \epsilon\), so for large enough \(n\) (because of uniform convergence), so \[ 0 \leq m(\hat{\psi}_n) - m(\psi_0) \leq 2\epsilon \]
- But \(\epsilon\) is arbitrarily small so \(m(\hat{\psi}_n) \rightarrow m(\psi_0)\)
Now use assumption (3’) to get that \(\hat{\psi}_n \rightarrow \psi_0\)

Backup: The Gibbs inequality

\[\begin{eqnarray} \int{f(x) \log{f(x)} dx} - \int{f(x) \log{g(x)} dx} & = & \int{f(x) (\log{f(x)} - \log{g(x)}) dx}\\ & = & \int{f(x) \log{\frac{f(x)}{g(x)}} dx}\\ & = & -\int{f(x) \log{\frac{g(x)}{f(x)}} dx}\\ & \geq & -\log{\int{f(x) \frac{g(x)}{f(x)} dx}} = \log{1} = 0 \end{eqnarray}\]

where the last line uses Jensen’s inequality

(proof for pmfs is entirely parallel)

Backup: Jensen’s inequality

A function \(h\) is convex when, for any \(w \in [0,1]\), \[ w h(x_1) + (1-w) h(x_2) \geq h(wx_1 + (1-w) x_2) \]
- i.e., for any two points on the curve of \(h(x)\), the curve lies below the straight line connecting the points
The function \(-\log{x}\) is convex:

Jensen’s inequality: for any convex function \(h\), \(\Expect{h(X)} \geq h(\Expect{X})\), under any distribution of \(X\)
- basically follows from the definition of convexity
- Average of points on the curve vs. the curve at the average value
Extension: for any function \(q\) (convex or not), \(\Expect{h(q(X))} \geq h(\Expect{q(X)})\)

Backup: Further reading

The general ideas about estimation here are a pretty standard part of asymptotic statistical theory
- Specifically, we’re looking at what the literature calls “M-estimators”, which work by minimizing some loss function
  - Typically the loss function is an average over the data points, but that’s not quite needed here
My account of “The more general pattern” for showing consistency owes a lot to the presentation in Vaart (1998)
The Taylor series approximations for getting asymptotic convergence, the asymptotic variance, etc., are “folklore”, meaning I haven’t tracked down where they came from
- You can find more elaborate and rigorous versions in Vaart (1998), Barndorff-Nielsen and Cox (1995), etc.
The special case of maximizing the log-likelihood for a well-specified model first came together in Fisher (1922)
- For some of the history of likelihood maximization, see Stigler (2007)
Huber (1967)’s work on maximum likelihood for mis-specified models paved the way for things like sandwich covariances, which are often used in econometrics (owing to the influence of Halbert White, see White (1994))
- See Zeileis (2004);Zeileis (2006) for a well-designed software package for computing sandwich covariance matrices for many common models in R

References

Barndorff-Nielsen, O. E., and D. R. Cox. 1995. Inference and Asymptotics. London: Chapman; Hall.

Clauset, Aaron, Cosma Rohilla Shalizi, and M. E. J. Newman. 2009. “Power-Law Distributions in Empirical Data.” SIAM Review 51:661–703. http://arxiv.org/abs/0706.1062.

Fisher, R. A. 1922. “On the Mathematical Foundations of Theoretical Statistics.” Philosophical Transactions of the Royal Society A 222:309–68. http://digital.library.adelaide.edu.au/dspace/handle/2440/15172.

Huber, Peter J. 1967. “The Behavior of Maximum Likelihood Estimates Under Nonstandard Conditions.” In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, edited by Lucien M. Le Cam and Jerzy Neyman, 1:221–33. Berkeley: University of California Press. http://projecteuclid.org/euclid.bsmsp/1200512988.

Koenker, Roger, and Kevin F. Hallock. 2001. “Quantile Regression.” Journal of Economic Perspectives 15:143–56. https://doi.org/10.1257/jep.15.4.143.

Stigler, Stephen M. 2007. “The Epic Story of Maximum Likelihood.” Statistical Science 22:598–620. https://doi.org/10.1214/07-STS249.

Vaart, A. W. van der. 1998. Asymptotic Statistics. Cambridge, England: Cambridge University Press.

White, Halbert. 1994. Estimation, Inference and Specification Analysis. Cambridge, England: Cambridge University Press.

Zeileis, Achim. 2004. “Econometric Computing with HC and HAC Covariance Matrix Estimators.” Journal of Statistical Software 11 (10):1–17. https://doi.org/10.18637/jss.v011.i10.

———. 2006. “Object-Oriented Computation of Sandwich Estimators.” Journal of Statistical Software 16 (9):1–16. https://doi.org/10.18637/jss.v016.i09.

Inference I — Inference with Independent Data

Agenda

In our previous episodes

For the rest of today

What we want to infer

Some standard estimates work because of the law of large numbers

What do we mean by “works”?

The basic ingredient for consistency is the law of large numbers

Proof of the law of large numbers, cont’d.

Law of large numbers in general

This is part of why maximum likelihood works

Maximum likelihood can work even without moments

Convergence of the log-likelihood function (an example)

Convergence of the log-likelihood function (an example)

Convergence of the log-likelihood function (an example)

Convergence of the log-likelihood function (an example)

The more general pattern

This applies pretty broadly

What about estimation error?

Estimating by optimizing

Estimating by optimizing

Estimating by optimizing

Estimating by optimizing

In practice…

Special application to maximum likelihood

Generalizing away from IID data

Summary

Backup: Modes of convergence

Backup: Chebyshev’s inequality

Backup: Markov’s inequality

Backup: Disclaimers to the “More general pattern”

Sketch proof

Backup: The Gibbs inequality

Backup: Jensen’s inequality

Backup: Further reading

References