Inference I — Inference with Independent Data

36-467/36-667

23 October 2018

\[ \newcommand{\Expect}[1]{\mathbb{E}\left[ #1 \right]} \newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]} \newcommand{\SampleVar}[1]{\widehat{\mathrm{Var}}\left[ #1 \right]} \newcommand{\Cov}[1]{\mathrm{Cov}\left[ #1 \right]} \newcommand{\Prob}[1]{\mathbb{P}\left( #1 \right)} \newcommand{\TrueRegFunc}{\mu} \newcommand{\EstRegFunc}{\widehat{\TrueRegFunc}} \DeclareMathOperator{\tr}{tr} \DeclareMathOperator*{\argmin}{argmin} \DeclareMathOperator{\det}{det} \newcommand{\TrueNoise}{\epsilon} \newcommand{\EstNoise}{\widehat{\TrueNoise}} \newcommand{\Indicator}[1]{\mathbb{I}\left( #1 \right)} \newcommand{\se}[1]{\mathrm{se}\left[ #1 \right]} \newcommand{\CrossEntropy}{\ell} \newcommand{\xmin}{x_{\mathrm{min}}} \]

Agenda

Spatial narratives

Drs. Jessica Benner and Emma Slayton, CMU Library

In our previous episodes

Until further notice…

What we want to infer

Some standard estimates work because of the law of large numbers

What do we mean by “works”?

The basic ingredient for consistency is the law of large numbers

Hint: \(\Expect{Z^2} = (\Expect{Z})^2 + \Var{Z}\), for any variable \(Z\)

Solution to the exercise

\[\begin{eqnarray} \Expect{\overline{X}_n} & = & \Expect{\frac{1}{n}\sum_{i=1}^{n}{X_i}}\\ & = & \frac{n\Expect{X_1}}{n} = \Expect{X}\\ \Var{\overline{X}_n} & = & \Var{\frac{1}{n}\sum_{i=1}^{n}{X_i}}\\ & = & \frac{\sum_{i=1}^{n}{\Var{X_i}}}{n^2}\\ & = & \frac{n \Var{X}}{n^2} = \frac{\Var{X}}{n} \rightarrow 0\\ \therefore \Expect{(\overline{X}_n-\Expect{X})^2} & = & \left(\Expect{\overline{X}_n}-\Expect{X}\right)^2 + \Var{\overline{X}_n}\\ & = & 0 + n^{-1}\Var{X} \rightarrow 0 \end{eqnarray}\]

Law of large numbers in general

This is part of why maximum likelihood works

Maximum likelihood can work even without moments

Convergence of the log-likelihood function (an example)

Normalized log-likelihoods for 10 different IID samples from the Pareto distribution (\(n=10\), \(\theta=1.5\), \(\xmin=1\))

Convergence of the log-likelihood function (an example)

Normalized log-likelihoods for 10 different IID samples from the Pareto distribution (\(n=10^3\), \(\theta=1.5\), \(\xmin=1\))

Convergence of the log-likelihood function (an example)

Normalized log-likelihoods for 10 different IID samples from the Pareto distribution (\(n=10^5\), \(\theta=1.5\), \(\xmin=1\))

Convergence of the log-likelihood function (an example)

Normalized log-likelihoods for the Pareto distribution, showing convergence as \(n\rightarrow\infty\) along a single IID sequence

The more general pattern

(some disclaimers apply)

This applies pretty broadly

What about estimation error?

\[\begin{eqnarray} h(\Expect{A_n}, \Expect{B_n}) & \approx & h(A_n, B_n) + (\Expect{A_n} - A_n)\frac{\partial h}{\partial a} + (\Expect{B_n}-B_n)\frac{\partial h}{\partial b} ~ \text{(Taylor series)}\\ \hat{\psi}_n = h(A_n, B_n) & \approx & h(\Expect{A_n}, \Expect{B_n}) + (A_n - \Expect{A_n})\frac{\partial h}{\partial a} + (B_n - \Expect{B_n})\frac{\partial h}{\partial b}\\ \Var{\hat{\psi}_n} & \approx & {\left(\frac{\partial h}{\partial a}\right)}^2\Var{A_n} +{\left(\frac{\partial h}{\partial b}\right)}^2\Var{B_n} + 2\left(\frac{\partial h}{\partial a}\frac{\partial h}{\partial b}\right)\Cov{A_n, B_n} \end{eqnarray}\]

Estimating by optimizing

Estimating by optimizing

Estimating by optimizing

(still in 1D)

Estimating by optimizing

In practice…

Special application to maximum likelihood

Generalizing away from IID data

Summary

Backup: Modes of convergence

Backup: Chebyshev’s inequality

For any random variable \(Z\), and any \(\epsilon > 0\),

\[ \Prob{|Z-\Expect{Z}| > \epsilon} \leq \frac{\Var{Z}}{\epsilon^2} \]

Proof: Apply Markov’s inequality to \((Z-\Expect{Z})^2\), which is \(\geq 0\) and has expectation \(\Var{Z}\).

Backup: Markov’s inequality

For any non-negative random variable \(Z\), and any \(\epsilon > 0\),

\[ \Prob{Z \geq \epsilon} \leq \frac{\Expect{Z}}{\epsilon} \]

Proof: \[\begin{eqnarray} Z & = & Z\Indicator{Z \geq \epsilon} + Z\Indicator{Z < \epsilon}\\ \Expect{Z} & = & \Expect{Z \Indicator{Z \geq \epsilon}} + \Expect{Z \Indicator{Z < \epsilon}}\\ & \geq & \Expect{Z \Indicator{Z \geq \epsilon}}\\ & \geq & \Expect{\epsilon \Indicator{Z \geq \epsilon}}\\ & = & \epsilon\Expect{\Indicator{Z \geq \epsilon}} = \epsilon \Prob{Z \geq \epsilon} \end{eqnarray}\]

Backup: Disclaimers to the “More general pattern”

Backup: The Gibbs inequality

\[\begin{eqnarray} \int{f(x) \log{f(x)} dx} - \int{f(x) \log{g(x)} dx} & = & \int{f(x) (\log{f(x)} - \log{g(x)}) dx}\\ & = & \int{f(x) \log{\frac{f(x)}{g(x)}} dx}\\ & = & -\int{f(x) \log{\frac{g(x)}{f(x)}} dx}\\ & \geq & -\log{\int{f(x) \frac{g(x)}{f(x)} dx}} = \log{1} = 0 \end{eqnarray}\]

where the last line uses Jensen’s inequality

Backup: Jensen’s inequality

References

Clauset, Aaron, Cosma Rohilla Shalizi, and M. E. J. Newman. 2009. “Power-Law Distributions in Empirical Data.” SIAM Review 51:661–703. http://arxiv.org/abs/0706.1062.

Koenker, Roger, and Kevin F. Hallock. 2001. “Quantile Regression.” Journal of Economic Perspectives 15:143–56. https://doi.org/10.1257/jep.15.4.143.