Inference I — Inference with Independent Data
36-467/36-667
23 October 2018
\[
\newcommand{\Expect}[1]{\mathbb{E}\left[ #1 \right]}
\newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]}
\newcommand{\SampleVar}[1]{\widehat{\mathrm{Var}}\left[ #1 \right]}
\newcommand{\Cov}[1]{\mathrm{Cov}\left[ #1 \right]}
\newcommand{\Prob}[1]{\mathbb{P}\left( #1 \right)}
\newcommand{\TrueRegFunc}{\mu}
\newcommand{\EstRegFunc}{\widehat{\TrueRegFunc}}
\DeclareMathOperator{\tr}{tr}
\DeclareMathOperator*{\argmin}{argmin}
\DeclareMathOperator{\det}{det}
\newcommand{\TrueNoise}{\epsilon}
\newcommand{\EstNoise}{\widehat{\TrueNoise}}
\newcommand{\Indicator}[1]{\mathbb{I}\left( #1 \right)}
\newcommand{\se}[1]{\mathrm{se}\left[ #1 \right]}
\newcommand{\CrossEntropy}{\ell}
\newcommand{\xmin}{x_{\mathrm{min}}}
\]
Agenda
- Spatial narratives
- Reminders: inference with IID data
- Generalization to dependent, heterogeneous data
Spatial narratives
Drs. Jessica Benner and Emma Slayton, CMU Library
In our previous episodes
- Many plausible ideas about how to estimate with dependent data
- Some models which generate dependent data
- How do we know that ideas work (in those models or elsewhere)?
- Strategy: go back to inference with independent data, and see what we can abstract
Until further notice…
- Assume \(X_1, X_2, \ldots X_n\) are independent and identically distributed (IID)
- Each \(X_i\) might have multiple dimensions, e.g., \(X_i = (Y_i, Z_i)\)
- We don’t know the common distribution of the \(X\)’s
- That distribution might be parametric with unknown parameter(s) \(\theta\), and pdf \(f(x;\theta)\)
- E.g., Gaussian, \(f(x;\theta) = \frac{1}{\sqrt{2\pi \theta_2}} e^{-(x-\theta_1)^2/2\theta_2}\)
- E.g., Pareto/power-law, \(f(x;\theta) = \frac{\theta - 1}{\xmin} {\left(\frac{x}{\xmin}\right)}^{-\theta}\) for \(x \geq \xmin\)
- The Pareto is a common model for “heavy tailed” data (Clauset, Shalizi, and Newman 2009)
- Or it might be nonparametric, meaning we don’t assume any parametric form
What we want to infer
- Parameters, in a parametric model
- Moments
- Quantiles
- More complex things — optimal slope of \(Y_i\) on \(Z_i\)
- In general, some function \(\psi(p)\) of the true pdf \(p\)
Some standard estimates work because of the law of large numbers
- Obvious example: Use \(\overline{X}\) as an estimate of \(\Expect{X}\)
- Less obvious: \(\frac{n}{n+\nu}\overline{X}\) also works for any fixed \(\nu > 0\)
What do we mean by “works”?
- Consistency: \(\hat{\psi}_n \rightarrow \psi\) as \(n\rightarrow\infty\)
- Also nice to know:
- Bias: \(\Expect{\hat{\psi}_n} - \psi\) (should \(\rightarrow 0\))
- Variance: \(\Var{\hat{\psi}_n}\) (also should \(\rightarrow 0\))
- Sampling distribution of \(\hat{\psi}_n\) (to get confidence sets)
The basic ingredient for consistency is the law of large numbers
- Sample mean \(\overline{X}_n = n^{-1}\sum_{i=1}^{n}{X_i}\)
- Assume the \(X_i\) all have mean \(\Expect{X}\), variance \(\Var{X}\), and are uncorrelated
- Exercise: Prove that \[
\Expect{(\overline{X}_n-\Expect{X})^2} \rightarrow 0 ~\text{as} ~ n\rightarrow\infty
\]
Hint: \(\Expect{Z^2} = (\Expect{Z})^2 + \Var{Z}\), for any variable \(Z\)
Solution to the exercise
\[\begin{eqnarray}
\Expect{\overline{X}_n} & = & \Expect{\frac{1}{n}\sum_{i=1}^{n}{X_i}}\\
& = & \frac{n\Expect{X_1}}{n} = \Expect{X}\\
\Var{\overline{X}_n} & = & \Var{\frac{1}{n}\sum_{i=1}^{n}{X_i}}\\
& = & \frac{\sum_{i=1}^{n}{\Var{X_i}}}{n^2}\\
& = & \frac{n \Var{X}}{n^2} = \frac{\Var{X}}{n} \rightarrow 0\\
\therefore \Expect{(\overline{X}_n-\Expect{X})^2} & = & \left(\Expect{\overline{X}_n}-\Expect{X}\right)^2 + \Var{\overline{X}_n}\\
& = & 0 + n^{-1}\Var{X} \rightarrow 0
\end{eqnarray}\]
Law of large numbers in general
- If the \(X_i\) are IID, then so are \(h(X_i)\), for any function \(h\), so \[
\overline{h(X)_n} \rightarrow \Expect{h(X)}
\]
This is part of why maximum likelihood works
- The normalized log-likelihood is a (random) function of \(\theta\): \[
L_n(\theta) = \frac{1}{n}\sum_{i=1}^{n}{\log{f(X_i;\theta)}}
\]
- Apply the LLN: for each \(\theta\), \[
L_n(\theta) \rightarrow \Expect{\log{f(X;\theta)}} \equiv \CrossEntropy(\theta)
\]
- Fact: for any pdfs \(f\), \(g\), \[
\int{f(x) \log{f(x)} dx} \geq \int{f(x) \log{g(x)} dx}
\]
- a.k.a. “Gibbs inequality”
- (similarly for pmfs)
- So \(\CrossEntropy(\theta)\) is maximized at the true \(\theta\), \(\theta_0\): \[
\int{f(x;\theta_0) \log{f(x;\theta_0)} dx} \geq \int{f(x;\theta_0) \log{f(x;\theta)} dx}
\]
Maximum likelihood can work even without moments
- For the Pareto distribution, \(\Expect{X^k} = \infty\) if \(\theta \leq k\)
- So \(\Var{X} = \infty\) if \(\theta \leq 2\)
- But \(\log{f}\) is well-behaved: \[
\log{f(x;\theta)} = \log{(\theta -1)} - (\theta-1)\log{\xmin} - \theta \log{x}
\]
- Exercise (off-line): Find \(\Expect{\log{X}}\) and \(\Var{\log{X}}\) in terms of \(\theta\) and \(\xmin\)
- What does the normalized log-likelihood function look like?
Convergence of the log-likelihood function (an example)
Normalized log-likelihoods for 10 different IID samples from the Pareto distribution (\(n=10\), \(\theta=1.5\), \(\xmin=1\))
Convergence of the log-likelihood function (an example)
Normalized log-likelihoods for 10 different IID samples from the Pareto distribution (\(n=10^3\), \(\theta=1.5\), \(\xmin=1\))
Convergence of the log-likelihood function (an example)
Normalized log-likelihoods for 10 different IID samples from the Pareto distribution (\(n=10^5\), \(\theta=1.5\), \(\xmin=1\))
Convergence of the log-likelihood function (an example)
Normalized log-likelihoods for the Pareto distribution, showing convergence as \(n\rightarrow\infty\) along a single IID sequence
The more general pattern
- Assume (1): \(\hat{\psi}_n = \argmin_{\psi}{M_n(\psi)}\) for some random functions \(M_n\)
- If you maximize instead, minimize the negative
- Assume (2): \(M_n(\psi) \rightarrow m(\psi)\) as \(n\rightarrow \infty\)
- Assume (3): \(m(\psi)\) has a unique minimum at the true \(\psi_0\)
- Then, in general, \(\hat{\psi}_n \rightarrow \psi_0\)
(some disclaimers apply)
This applies pretty broadly
- Estimating the expectation with the sample mean: \[\begin{eqnarray}
M_n(\psi) & = & \frac{1}{n}\sum_{i=1}^{n}{(X_i - \psi)^2}\\
m(\psi) & = & \Expect{(X-\psi)^2}
\end{eqnarray}\]
- Estimating a simple linear regression: \[\begin{eqnarray}
M_n(\psi) & = & \frac{1}{n}\sum_{i=1}^{n}{(Y_i - \psi_1 - \psi_2 Z_i)^2}\\
m(\psi) & = & \Expect{(Y-\psi_1 - \psi_2 Z)^2}
\end{eqnarray}\]
- Nonlinear least squares, with regression function \(\TrueRegFunc(z;\psi)\) \[\begin{eqnarray}
M_n(\psi) & = & \frac{1}{n}\sum_{i=1}^{n}{(Y_i - \TrueRegFunc(Z_i;\psi))^2}\\
m(\psi) & = & \Expect{(Y-\TrueRegFunc(Z;\psi))^2}
\end{eqnarray}\]
- Estimating the \(\alpha\) quantile (Koenker and Hallock 2001): \[\begin{eqnarray}
M_n(\psi) & = & \frac{1}{n}\sum_{i=1}^{n}{X_i(\alpha - \mathbf{I}(X_i < 0))}\\
m(\psi) & = & \Expect{X(\alpha - \mathbf{I}(X < 0))}
\end{eqnarray}\]
- Extends naturally to estimating conditional quantiles
What about estimation error?
- If we’re using a mean, we can use the standard error of the mean: \[
\Var{\overline{X}_n} = \frac{1}{n}\Var{X} \Rightarrow \se{\overline{X}_n} = \sqrt{\frac{\Var{X}}{n}}
\]
- If \(\hat{\psi}_n = h(A_n,B_n)\) and we know \(\Var{A}_n\), \(\Var{B}_n\), use propagation of error (a.k.a. the delta method):
\[\begin{eqnarray}
h(\Expect{A_n}, \Expect{B_n}) & \approx & h(A_n, B_n) + (\Expect{A_n} - A_n)\frac{\partial h}{\partial a} + (\Expect{B_n}-B_n)\frac{\partial h}{\partial b} ~ \text{(Taylor series)}\\
\hat{\psi}_n = h(A_n, B_n) & \approx & h(\Expect{A_n}, \Expect{B_n}) + (A_n - \Expect{A_n})\frac{\partial h}{\partial a} + (B_n - \Expect{B_n})\frac{\partial h}{\partial b}\\
\Var{\hat{\psi}_n} & \approx & {\left(\frac{\partial h}{\partial a}\right)}^2\Var{A_n} +{\left(\frac{\partial h}{\partial b}\right)}^2\Var{B_n}
+ 2\left(\frac{\partial h}{\partial a}\frac{\partial h}{\partial b}\right)\Cov{A_n, B_n}
\end{eqnarray}\]
- What if it’s some weird optimization problem?
Estimating by optimizing
- Assume:
- \(\hat{\psi}_n\) minimizes some \(M_n(\psi)\)
- \(\psi_0\) minimizes \(m(\psi)\) (uniquely)
- \(M_n(\psi) \rightarrow m(\psi)\) (everywhere)
- \(\psi\) is one-dimensional (for now)
- \[\begin{eqnarray}
0 & = & \frac{dM_n}{d\psi}(\hat{\psi}_n) ~ \text{(optimum)}\\
0 & \approx & \frac{dM_n}{d\psi}(\psi_0) + (\hat{\psi}_n - \psi_0)\frac{d^2 M_n}{d\psi^2}(\psi_0) ~ \text{(Taylor expansion)}\\
( \hat{\psi}_n - \psi_0) \frac{d^2 M_n}{d\psi^2}(\psi_0) & \approx & -\frac{dM_n}{d\psi}(\psi_0)\\
\hat{\psi}_n - \psi_0 & \approx & -\frac{\frac{dM_n}{d\psi}(\psi_0)}{\frac{d^2 M_n}{d\psi^2}(\psi_0)}\\
\hat{\psi}_n & \approx & \psi_0 - \frac{\frac{dM_n}{d\psi}(\psi_0)}{\frac{d^2 M_n}{d\psi^2}(\psi_0)}
\end{eqnarray}\]
Estimating by optimizing
- Still assuming \(\psi\) is one-dimensional, write \(M_n^{\prime}\) and \(M_n^{\prime\prime}\) for the derivatives \[
\hat{\psi}_n \approx \psi_0 - \frac{M_n^{\prime}}{M_n^{\prime\prime}}
\] As \(n\rightarrow \infty\), \[\begin{eqnarray}
M_n^{\prime} & \rightarrow & m^{\prime}(\psi_0) = 0 ~ \text{(optimum)}\\
M_n^{\prime^\prime} & \rightarrow & m^{\prime\prime}(\psi_0) > 0 ~ \text{(optimum)}
\end{eqnarray}\] so \[
\hat{\psi}_n \rightarrow \psi_0
\]
Estimating by optimizing
(still in 1D)
- What about variance? \[\begin{eqnarray}
\hat{\psi}_n & \approx & \psi_0 - \frac{M_n^{\prime}}{M_n^{\prime\prime}}\\
\Var{\hat{\psi}_n} & \approx & \Var{\frac{M_n^{\prime}}{M_n^{\prime\prime}}}\\
& = & \frac{\Var{M_n^{\prime}}}{(m^{\prime\prime})^2}\\
\se{\hat{\psi}_n} & \approx & \frac{\sqrt{\Var{M^{\prime}_n}}}{m^{\prime\prime}}
\end{eqnarray}\]
- Variance of \(\hat{\psi}_n\) goes down as the curvature goes up
- Variance of \(\hat{\psi}_n\) goes up with the variance in \(M_n\)
- If \(M_n\) is an average, with \(\Var{M_n} = O(1/n)\), so is \(M^{\prime}_n\), and \(\Var{M_n^{\prime}} = O(1/n)\)
- Then \(\se{\hat{\psi}_n} = O(1/\sqrt{n})\)
- What about the distribution?
- If \(M_n \rightsquigarrow \mathcal{N}\), because of the CLT, then
- \(M_n^{\prime} \rightsquigarrow \mathcal{N}\) (usually)
Estimating by optimizing
- More general case: \(\psi\) is a vector \[\begin{eqnarray}
\hat{\psi}_n & = & \argmin_{\psi}{M_n(\psi)}\\
0 & = & \nabla M_n(\hat{\psi}_n)\\
0 & \approx & \nabla M_n(\psi_0) + (\hat{\psi}_n-\psi_0) \nabla \nabla M_n(\psi_0)\\
\nabla \nabla M_n(\psi_0) & \equiv & \mathbf{H}_n(\psi_0) \rightarrow \mathbf{h}\\
\hat{\psi}_n & \approx & \psi_0 - \mathbf{h}^{-1} \nabla M_n(\psi_0)\\
\Var{\hat{\psi}_n } & \approx & \mathbf{h}^{-1} \Var{\nabla M_n(\psi_0)} \mathbf{h}^{-1}
\end{eqnarray}\]
- \(\mathbf{h} = \nabla \nabla m(\psi_0)\) Hessian of \(m\) at \(\psi_0\)
- Again, if \(\nabla M_n(\psi_0)\) is an average, CLT suggests \(\rightsquigarrow \mathcal{N}\)
In practice…
- Approximate \(\mathbf{h}\) by \(\nabla \nabla M_n(\hat{\psi}_n)\), say \(\mathbf{H}_n\)
- Approximate \(\Var{\nabla M_n(\psi_0)}\) by sample covariance of the derivative of \(M_n\), say \(\mathbf{J}_n\)
- Sandwich covariance matrix is \(\mathbf{H}_n^{-1}\mathbf{J}_n \mathbf{H}_n^{-1}\)
Special application to maximum likelihood
Generalizing away from IID data
- Estimate by optimizing some objective function \(M_n\) \[
\hat{\psi}_n = \argmin_{\psi}{M_n(\psi)}
\]
- Hope the objective function tends to a limit function \(m\) \[
M_n(\psi) \rightarrow m(\psi)
\]
- Hope that the limiting function has its minimum in the right place \[
\argmin_{\psi}{m(\psi)} = \psi_0
\]
- Then the exact same arguments apply: \[\begin{eqnarray}
\hat{\psi}_n & \rightarrow & \psi_0\\
\Var{\hat{\psi}_n} & \rightarrow & \mathbf{h}^{-1} \Var{\nabla M_n(\psi_0)} \mathbf{h}^{-1}
\end{eqnarray}\]
- We need to make sure our objective functions converge
- We’re going to need a law of large numbers for dependent data
Summary
- We usually estimate by minimizing an objective function
- If the objective function has a well-behaved limit, the estimate will converge
- The variance of the estimate depends on:
- The curvature of the objective function (\(\uparrow\) curvature \(\downarrow\) variance)
- The variance of the derivative of the objective function
- None of these ingredients actually requires independence
- Once we have convergence of our objective function, everything will fall in to place
Backup: Modes of convergence
- Consistency is usually defined as “convergence in probability”: For any \(\epsilon > 0\), \[
\Prob{|\hat{\psi}_n - \psi| \geq \epsilon} \rightarrow 0
\]
- The main slides use \(L_2\) convergence, \[
\Expect{(\hat{\psi}_n - \psi)^2} \rightarrow 0
\]
- \(L_2\) convergence implies convergence in probability by Chebyshev’s inequality but not vice versa
- There is also “strong convergence” or “almost sure convergence”: \[
\Prob{\hat{\psi}_n \rightarrow \psi} = 1
\]
- A.S.-convergence implies convergence in probability but not vice versa
Backup: Chebyshev’s inequality
For any random variable \(Z\), and any \(\epsilon > 0\),
\[
\Prob{|Z-\Expect{Z}| > \epsilon} \leq \frac{\Var{Z}}{\epsilon^2}
\]
Proof: Apply Markov’s inequality to \((Z-\Expect{Z})^2\), which is \(\geq 0\) and has expectation \(\Var{Z}\).
Backup: Markov’s inequality
For any non-negative random variable \(Z\), and any \(\epsilon > 0\),
\[
\Prob{Z \geq \epsilon} \leq \frac{\Expect{Z}}{\epsilon}
\]
Proof: \[\begin{eqnarray}
Z & = & Z\Indicator{Z \geq \epsilon} + Z\Indicator{Z < \epsilon}\\
\Expect{Z} & = & \Expect{Z \Indicator{Z \geq \epsilon}} + \Expect{Z \Indicator{Z < \epsilon}}\\
& \geq & \Expect{Z \Indicator{Z \geq \epsilon}}\\
& \geq & \Expect{\epsilon \Indicator{Z \geq \epsilon}}\\
& = & \epsilon\Expect{\Indicator{Z \geq \epsilon}} = \epsilon \Prob{Z \geq \epsilon}
\end{eqnarray}\]
Backup: Disclaimers to the “More general pattern”
- Need to assume (2’) that \(M_n(\psi) \rightarrow m(\psi)\) uniformly over \(\psi\)
- Or at least (2’’) uniformly over a region where \(\hat{\psi}_n\) will concentrate
- Need to assume (3’) that \(m(\psi)\) has a unique minimum at \(\psi_0\), and that this minimum is “well-separated”
- Roughly: \(m(\psi) - m(\psi_0)\) can be very small only if \(\psi\) is close to \(\psi_0\)
Backup: The Gibbs inequality
\[\begin{eqnarray}
\int{f(x) \log{f(x)} dx} - \int{f(x) \log{g(x)} dx} & = & \int{f(x) (\log{f(x)} - \log{g(x)}) dx}\\
& = & \int{f(x) \log{\frac{f(x)}{g(x)}} dx}\\
& = & -\int{f(x) \log{\frac{g(x)}{f(x)}} dx}\\
& \geq & -\log{\int{f(x) \frac{g(x)}{f(x)} dx}} = \log{1} = 0
\end{eqnarray}\]
where the last line uses Jensen’s inequality
Backup: Jensen’s inequality
- A function \(h\) is convex when, for any \(w \in [0,1]\), \[
w h(x_1) + (1-w) h(x_2) \geq h(wx_1 + (1-w) x_2)
\]
- i.e., for any two points on the curve of \(h(x)\), the curve lies below the straight line connecting the points
- The function \(-\log{x}\) is convex (draw it!)
- Jensen’s inequality: for any convex function \(h\), \(\Expect{h(X)} \geq h(\Expect{X})\), under any distribution of \(X\)
- basically follows from the definition of convexity
- Average of points on the curve vs. the curve at the average value
- Extension: for any function \(q\) (convex or not), \(\Expect{h(q(X))} \geq h(\Expect{q(X)})\)
References
Clauset, Aaron, Cosma Rohilla Shalizi, and M. E. J. Newman. 2009. “Power-Law Distributions in Empirical Data.” SIAM Review 51:661–703. http://arxiv.org/abs/0706.1062.