Inference I — Inference with Independent Data

36-467/36-667

13 October 2020 (Lecture 13)

\[ \newcommand{\Expect}[1]{\mathbb{E}\left[ #1 \right]} \newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]} \newcommand{\SampleVar}[1]{\widehat{\mathrm{Var}}\left[ #1 \right]} \newcommand{\Cov}[1]{\mathrm{Cov}\left[ #1 \right]} \newcommand{\Prob}[1]{\mathbb{P}\left( #1 \right)} \newcommand{\TrueRegFunc}{\mu} \newcommand{\EstRegFunc}{\widehat{\TrueRegFunc}} \DeclareMathOperator{\tr}{tr} \DeclareMathOperator*{\argmin}{argmin} \DeclareMathOperator{\det}{det} \newcommand{\TrueNoise}{\epsilon} \newcommand{\EstNoise}{\widehat{\TrueNoise}} \newcommand{\Indicator}[1]{\mathbb{I}\left( #1 \right)} \newcommand{\se}[1]{\mathrm{se}\left[ #1 \right]} \newcommand{\CrossEntropy}{\ell} \newcommand{\xmin}{x_{\mathrm{min}}} \]

Agenda

In our previous episodes

For the rest of today

What we want to infer

Some standard estimates work because of the law of large numbers

What do we mean by “works”?

The basic ingredient for consistency is the law of large numbers

  1. It’s enough to show that \[ \Expect{(\overline{X}_n-\Expect{X})^2} \rightarrow 0 ~\text{as} ~ n\rightarrow\infty \] (Why?)
  2. It’s enough to show that \(\Expect{\overline{X}_n} = \Expect{X}\) and \(\Var{\overline{X}_n} \rightarrow 0\) (Why?)

Proof of the law of large numbers, cont’d.

Law of large numbers in general

This is part of why maximum likelihood works

Maximum likelihood can work even without moments

Convergence of the log-likelihood function (an example)

Normalized log-likelihoods for 10 different IID samples from the Pareto distribution (\(n=10\), \(\theta=1.5\), \(\xmin=1\))

Convergence of the log-likelihood function (an example)

Normalized log-likelihoods for 10 different IID samples from the Pareto distribution (\(n=10^3\), \(\theta=1.5\), \(\xmin=1\))

Convergence of the log-likelihood function (an example)

Normalized log-likelihoods for 10 different IID samples from the Pareto distribution (\(n=10^5\), \(\theta=1.5\), \(\xmin=1\))

Convergence of the log-likelihood function (an example)

Normalized log-likelihoods for the Pareto distribution, showing convergence as \(n\rightarrow\infty\) along a single IID sequence

The more general pattern

(some disclaimers apply)

This applies pretty broadly

What about estimation error?

\[\begin{eqnarray} h(\Expect{A_n}, \Expect{B_n}) & \approx & h(A_n, B_n) + (\Expect{A_n} - A_n)\frac{\partial h}{\partial a} + (\Expect{B_n}-B_n)\frac{\partial h}{\partial b} ~ \text{(Taylor series)}\\ \hat{\psi}_n = h(A_n, B_n) & \approx & h(\Expect{A_n}, \Expect{B_n}) + (A_n - \Expect{A_n})\frac{\partial h}{\partial a} + (B_n - \Expect{B_n})\frac{\partial h}{\partial b}\\ \Var{\hat{\psi}_n} & \approx & {\left(\frac{\partial h}{\partial a}\right)}^2\Var{A_n} +{\left(\frac{\partial h}{\partial b}\right)}^2\Var{B_n} + 2\left(\frac{\partial h}{\partial a}\frac{\partial h}{\partial b}\right)\Cov{A_n, B_n} \end{eqnarray}\]

Estimating by optimizing

\[\begin{eqnarray} 0 & = & \frac{dM_n}{d\psi}(\hat{\psi}_n) ~ \text{(optimum)}\\ 0 & \approx & \frac{dM_n}{d\psi}(\psi_0) + (\hat{\psi}_n - \psi_0)\frac{d^2 M_n}{d\psi^2}(\psi_0) ~ \text{(Taylor expansion)}\\ ( \hat{\psi}_n - \psi_0) \frac{d^2 M_n}{d\psi^2}(\psi_0) & \approx & -\frac{dM_n}{d\psi}(\psi_0)\\ \hat{\psi}_n - \psi_0 & \approx & -\frac{\frac{dM_n}{d\psi}(\psi_0)}{\frac{d^2 M_n}{d\psi^2}(\psi_0)}\\ \hat{\psi}_n & \approx & \psi_0 - \frac{\frac{dM_n}{d\psi}(\psi_0)}{\frac{d^2 M_n}{d\psi^2}(\psi_0)} \end{eqnarray}\]

Estimating by optimizing

Estimating by optimizing

(still in 1D)

Estimating by optimizing

In practice…

Special application to maximum likelihood

Generalizing away from IID data

Summary

Backup: Modes of convergence

Backup: Chebyshev’s inequality

For any random variable \(Z\), and any \(\epsilon > 0\),

\[ \Prob{|Z-\Expect{Z}| > \epsilon} \leq \frac{\Var{Z}}{\epsilon^2} \]

Proof: Apply Markov’s inequality to \((Z-\Expect{Z})^2\), which is \(\geq 0\) and has expectation \(\Var{Z}\).

Backup: Markov’s inequality

For any non-negative random variable \(Z\), and any \(\epsilon > 0\),

\[ \Prob{Z \geq \epsilon} \leq \frac{\Expect{Z}}{\epsilon} \]

Proof: \[\begin{eqnarray} Z & = & Z\Indicator{Z \geq \epsilon} + Z\Indicator{Z < \epsilon}\\ \Expect{Z} & = & \Expect{Z \Indicator{Z \geq \epsilon}} + \Expect{Z \Indicator{Z < \epsilon}}\\ & \geq & \Expect{Z \Indicator{Z \geq \epsilon}}\\ & \geq & \Expect{\epsilon \Indicator{Z \geq \epsilon}}\\ & = & \epsilon\Expect{\Indicator{Z \geq \epsilon}} = \epsilon \Prob{Z \geq \epsilon} \end{eqnarray}\]

Backup: Disclaimers to the “More general pattern”

Sketch proof

Backup: The Gibbs inequality

\[\begin{eqnarray} \int{f(x) \log{f(x)} dx} - \int{f(x) \log{g(x)} dx} & = & \int{f(x) (\log{f(x)} - \log{g(x)}) dx}\\ & = & \int{f(x) \log{\frac{f(x)}{g(x)}} dx}\\ & = & -\int{f(x) \log{\frac{g(x)}{f(x)}} dx}\\ & \geq & -\log{\int{f(x) \frac{g(x)}{f(x)} dx}} = \log{1} = 0 \end{eqnarray}\]

where the last line uses Jensen’s inequality

(proof for pmfs is entirely parallel)

Backup: Jensen’s inequality

Backup: Further reading

References

Barndorff-Nielsen, O. E., and D. R. Cox. 1995. Inference and Asymptotics. London: Chapman; Hall.

Clauset, Aaron, Cosma Rohilla Shalizi, and M. E. J. Newman. 2009. “Power-Law Distributions in Empirical Data.” SIAM Review 51:661–703. http://arxiv.org/abs/0706.1062.

Fisher, R. A. 1922. “On the Mathematical Foundations of Theoretical Statistics.” Philosophical Transactions of the Royal Society A 222:309–68. http://digital.library.adelaide.edu.au/dspace/handle/2440/15172.

Huber, Peter J. 1967. “The Behavior of Maximum Likelihood Estimates Under Nonstandard Conditions.” In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, edited by Lucien M. Le Cam and Jerzy Neyman, 1:221–33. Berkeley: University of California Press. http://projecteuclid.org/euclid.bsmsp/1200512988.

Koenker, Roger, and Kevin F. Hallock. 2001. “Quantile Regression.” Journal of Economic Perspectives 15:143–56. https://doi.org/10.1257/jep.15.4.143.

Stigler, Stephen M. 2007. “The Epic Story of Maximum Likelihood.” Statistical Science 22:598–620. https://doi.org/10.1214/07-STS249.

Vaart, A. W. van der. 1998. Asymptotic Statistics. Cambridge, England: Cambridge University Press.

White, Halbert. 1994. Estimation, Inference and Specification Analysis. Cambridge, England: Cambridge University Press.

Zeileis, Achim. 2004. “Econometric Computing with HC and HAC Covariance Matrix Estimators.” Journal of Statistical Software 11 (10):1–17. https://doi.org/10.18637/jss.v011.i10.

———. 2006. “Object-Oriented Computation of Sandwich Estimators.” Journal of Statistical Software 16 (9):1–16. https://doi.org/10.18637/jss.v016.i09.