title: ‘Lecture 11: Distributions as Models’ author: “36-350”" date: “1 October 2014”" output: beamer_presentation transition: none font-family: Garamond —

Previously

Agenda

Let’s Load Some Cheerful Data

data("cats",package="MASS")

StatsCat disdains the non-standard definition of Sobolev spaces in chapter 1

Cats may be too saccharine (unless we think about where cats$Hwt comes from)

Let’s Grab some Data

The “pdfetch” package grabs economic and financial information from public-domain sources, e.g., Federal Reserve or Yahoo Finance

require(pdfetch)
## Loading required package: pdfetch
# Corrections Corp. of America
CXW <- pdfetch_YAHOO("CXW",fields="adjclose",
  from=as.Date("2000-01-01",
  to=as.Date("2014-09-30")))

CCA detention center in Leavenworth, KS

Let’s Grab some Data (cont’d.)

summary(CXW)
##      Index                 CXW       
##  Min.   :2000-01-03   Min.   : 0.69  
##  1st Qu.:2003-09-12   1st Qu.: 6.25  
##  Median :2007-05-21   Median :13.65  
##  Mean   :2007-05-19   Mean   :14.26  
##  3rd Qu.:2011-01-24   3rd Qu.:19.47  
##  Max.   :2014-10-01   Max.   :36.12
plot(CXW)

plot of chunk unnamed-chunk-3

If the prison-industrial complex is too grim, refer back to the cats

Let’s Transform Some Data

The price \(p_t\) doesn’t matter, what matters are the returns \(r_t = \log{(p_t/p_{t-1})}\)

returns <- na.omit(as.vector(diff(log(CXW$CXW))))
summary(returns)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -0.4020 -0.0105  0.0000  0.0005  0.0108  0.4430
plot(returns,type="l")

plot of chunk unnamed-chunk-4

The Data’s Distribution

quantile(x,probs) calculates the quantiles at probs from x

quantile(returns,c(0.25,0.5,0.75))
##      25%      50%      75% 
## -0.01046  0.00000  0.01079

Getting distributions from data (cont’d.)

ecdf() - e mpirical c umulative d istribution f unction; no assumptions but also no guess about distribution between the observations

In math, ECDF is often written as \(\widehat{F}\) or \(\widehat{F}_n\)

plot(ecdf(returns), main="Empirical CDF of CXW returns")

plot of chunk unnamed-chunk-6

Conceptually, quantile and ecdf are inverses to each other

Getting Probability Densities from Data

hist(x) calculates a histogram from x - divide the data range up into equal-width bins and count how many fall into each bin - Or divide bin counts by (total count)*(width of bin), and get an estimate of the probability density function (pdf)
Produces plot as a default side-effect

Probability Densities from Data (cont’d.)

hist(returns,n=101,probability=TRUE)

plot of chunk unnamed-chunk-7

Probability Densities from Data (cont’d.)

density(x) estimates the density of x by counting how many observations fall in a little window around each point, and then smoothing
“Bandwidth” \(\approx\) width of window around each point
Technically, a “kernel density estimate”
Take 36-402 in the spring for much more

Remember: density() is an estimate of the pdf, not The Truth

density returns a collection of \(x,y\) values, suitable for plotting

Probability Densities from Data (cont’d.)

plot(density(returns),main="Estimated pdf of CXW returns")

plot of chunk unnamed-chunk-8

Probability Densities from Data (cont’d.)

hist(returns,n=101,probability=TRUE)
lines(density(returns),lty="dashed")

plot of chunk unnamed-chunk-9

Getting distributions from data (cont’d.)

table() - tabulate outcomes, most useful for discrete spaces; remember to normalize if you want probabilities

plot(table(cats$Sex)/nrow(cats),ylab="probability")

plot of chunk unnamed-chunk-10

Who Cares About the Distribution of the Data?

R commands for distributions

?Distributions to see which distributions are built in

If you write your own, follow the conventions

Examples

dnorm(x=c(-1,0,1),mean=1,sd=0.1)
## [1] 5.521e-87 7.695e-22 3.989e+00
pnorm(q=c(2,-2)) # defaults to mean=0,sd=1
## [1] 0.97725 0.02275
dbinom(5,size=7,p=0.7,log=TRUE)
## [1] -1.147
qchisq(p=0.95,df=5)
## [1] 11.07
rt(n=4,df=2)
## [1]  0.1698  0.0581  0.1540 -0.3191

Displaying Probability Distributions

curve is very useful for the d, p, q functions:

curve(dgamma(x,shape=45,scale=1.9),from=0,to=200)

plot of chunk unnamed-chunk-12

N.B.: the r functions aren’t things it makes much sense to plot

How Do We Fit Distributional Models to the Data?

Method of Moments (MM), Closed Form

gamma.est_MM <- function(x) {
  m <- mean(x); v <- var(x)
  return(c(shape=m^2/v, scale=v/m))
}

MM, Numerically

gamma.mean <- function(shape,scale) { return(shape*scale) }
gamma.var <- function(shape,scale) { return(shape*scale^2) }
gamma.discrepancy <- function(shape,scale,x) {
  return((mean(x)-gamma.mean(shape,scale))^2 + (var(x)-gamma.mean(shape,scale))^2)
}

More Generally…

Maximum Likeihood

Likelihood in Code

loglike.foo <- function(params, x) {
  sum(dfoo(x=x,params,log=TRUE))
}

(will see how to do this for real later in class)

What Do We Do with the Likelihood?

Why Use the MLE?

fitdistr

MLE for one-dimensional distributions can be done through fitdistr in the MASS package

It knows about most the standard distributions, but you can also give it arbitrary probability density functions and it will try to maximize them
A starting value for the optimization is optional for some distributions, required for others (including user-defined densities)

Returns the parameter estimates and standard errors
SEs come from large-\(n\) approximations so use cautiously

fitdistr Examples

Fit the gamma distribution to the cats’ hearts:

require(MASS)
## Loading required package: MASS
fitdistr(cats$Hwt, densfun="gamma")
##    shape     rate 
##   20.300    1.910 
##  ( 2.373) ( 0.226)

Returns: estimates above, standard errors below

fitdistr Examples (cont’d.)

Fit the Students-\(t\) distribution to the returns:

fitdistr(returns,"t")
## Warning: NaNs produced
## Warning: NaNs produced
## Warning: NaNs produced
## Warning: NaNs produced
## Warning: NaNs produced
## Warning: NaNs produced
## Warning: NaNs produced
## Warning: NaNs produced
## Warning: NaNs produced
## Warning: NaNs produced
## Warning: NaNs produced
## Warning: NaNs produced
## Warning: NaNs produced
## Warning: NaNs produced
## Warning: NaNs produced
## Warning: NaNs produced
## Warning: NaNs produced
##        m           s          df    
##   0.0001751   0.0126459   1.8813709 
##  (0.0002681) (0.0003039) (0.0784463)

Here parameters are location (m), scale (s) and degrees of freedom (very heavy tails)

fitdistr Examples (cont’d.)

User-defined density:

dpareto <- function(x,exponent,xmin,log=FALSE) {
  f <- (exponent-1)/xmin * (x/xmin)^(-exponent)
  f <- ifelse(x<xmin,NA,f)
  if(!log) { return(f) } else (return(log(f)))
}
# Fit pareto to large absolute returns
  # Parameters given outside the "start" list are fixed
fitdistr(abs(returns)[abs(returns)>0.05], densfun=dpareto,
         start=list(exponent=2.5), xmin=0.05)
## Warning: one-dimensional optimization by Nelder-Mead is unreliable:
## use "Brent" or optimize() directly
##   exponent 
##   2.898    
##  (0.122)

Checking Your Estimator

gamma.est_MM(rgamma(100,shape=19,scale=45))
## shape scale 
## 21.09 41.80
gamma.est_MM(rgamma(1e5,shape=19,scale=45))
## shape scale 
## 18.99 45.05
gamma.est_MM(rgamma(1e6,shape=19,scale=45))
## shape scale 
## 18.99 45.03

Checking the Fit

Use your eyes: Graphic overlays of theory vs. data

plot(density(cats$Hwt))
cats.gamma <- gamma.est_MM(cats$Hwt)
curve(dgamma(x,shape=cats.gamma["shape"],scale=cats.gamma["scale"]),add=TRUE,col="blue")

plot of chunk unnamed-chunk-18

Checking the Fit (cont’d.)

# Really bad and good days for CXW shareholders, per model:
qnorm(c(0.01,0.99),mean=mean(returns),sd=sd(returns))
## [1] -0.07588  0.07695
# As it happened:
quantile(returns,c(0.01,0.99)) 
##       1%      99% 
## -0.09373  0.09415

Quantile-Quantile (QQ) Plots

qqnorm(returns); qqline(returns)

plot of chunk unnamed-chunk-20

QQ Plots (cont’d)

qqplot(returns,qt((1:500)/501,df=2))

plot of chunk unnamed-chunk-21

Calibration Plots

Making a Calibration Plot

plot(ecdf(pnorm(returns, mean=mean(returns), sd=sd(returns))),
     main="Calibration of Gaussian distribution for returns")
abline(0,1,col="grey")

plot of chunk unnamed-chunk-22 Again, way too many large changes (in either direction)

Calibration Plots (cont’d.)

CXW.t <- coefficients(fitdistr(returns,"t"))
## Warning: NaNs produced
## Warning: NaNs produced
## Warning: NaNs produced
## Warning: NaNs produced
## Warning: NaNs produced
## Warning: NaNs produced
## Warning: NaNs produced
## Warning: NaNs produced
## Warning: NaNs produced
## Warning: NaNs produced
## Warning: NaNs produced
## Warning: NaNs produced
## Warning: NaNs produced
## Warning: NaNs produced
## Warning: NaNs produced
## Warning: NaNs produced
## Warning: NaNs produced
plot(ecdf(pt((returns-CXW.t[1])/CXW.t[2], df=CXW.t[3])),
     main="Calibration of t distribution for returns")
abline(0,1,col="grey")

plot of chunk unnamed-chunk-23

Calibration Plots (cont’d.)

plot(ecdf(pgamma(cats$Hwt, shape=cats.gamma["shape"], scale=cats.gamma["scale"])),
     main="Calibration of gamma distribution for cats' hearts")
abline(0,1,col="grey")

plot of chunk unnamed-chunk-24

Calibration Plots (cont’d.)

Challenge: Write a general function for making a calibraton plot, taking a data vector, a cumulative probability function, and a parameter vector

Kolmogorov-Smirnov Test

KS Test, Data vs. Theory

ks.test(returns,pnorm,mean=0,sd=0.0125)
## Warning: ties should not be present for the Kolmogorov-Smirnov test
## 
##  One-sample Kolmogorov-Smirnov test
## 
## data:  returns
## D = 0.085, p-value < 2.2e-16
## alternative hypothesis: two-sided

KS Test, Data vs. Theory (cont’d.)

Hack: Estimate using (say) 90% of the data, and then check the fit on the remaining 10%

train <- sample(1:length(returns),size=round(0.9*length(returns)))
CWX.t_train <- coefficients(fitdistr(returns[train],"t"))
## Warning: NaNs produced
## Warning: NaNs produced
## Warning: NaNs produced
## Warning: NaNs produced
## Warning: NaNs produced
## Warning: NaNs produced
## Warning: NaNs produced
## Warning: NaNs produced
## Warning: NaNs produced
## Warning: NaNs produced
## Warning: NaNs produced
## Warning: NaNs produced
## Warning: NaNs produced
## Warning: NaNs produced
## Warning: NaNs produced
## Warning: NaNs produced
## Warning: NaNs produced
## Warning: NaNs produced
## Warning: NaNs produced
## Warning: NaNs produced
## Warning: NaNs produced
## Warning: NaNs produced
## Warning: NaNs produced
## Warning: NaNs produced
## Warning: NaNs produced
## Warning: NaNs produced
## Warning: NaNs produced
## Warning: NaNs produced
## Warning: NaNs produced
## Warning: NaNs produced
## Warning: NaNs produced
## Warning: NaNs produced
## Warning: NaNs produced
## Warning: NaNs produced
returns.test_standardized <- (returns[-train]-CWX.t_train[1])/CWX.t_train[2]
ks.test(returns.test_standardized,pt,df=CWX.t_train[3])
## Warning: ties should not be present for the Kolmogorov-Smirnov test
## 
##  One-sample Kolmogorov-Smirnov test
## 
## data:  returns.test_standardized
## D = 0.0544, p-value = 0.2224
## alternative hypothesis: two-sided

KS Test, Data vs. Data

n <- length(returns)
half <- round(n/2)
ks.test(returns[1:half], returns[(half+1):n])
## Warning: p-value will be approximate in the presence of ties
## 
##  Two-sample Kolmogorov-Smirnov test
## 
## data:  returns[1:half] and returns[(half + 1):n]
## D = 0.0599, p-value = 0.002569
## alternative hypothesis: two-sided

Chi-Squared Test for Discrete Distributions

Compare an actual table of counts to a hypothesized probability distribution

e.g., as many up days as down?

up_or_down <- ifelse(returns > 0, 1, -1)
# 1936 down days, 1772 up days
chisq.test(table(up_or_down),p=c(1/2,1/2))
## 
##  Chi-squared test for given probabilities
## 
## data:  table(up_or_down)
## X-squared = 7.34, df = 1, p-value = 0.006743

Chi-Squared Test: Degrees of Freedom

Chi-Squared Test for Continuous Distributions

Chi-Squared for Continuous Data (cont’d.)

cats.hist <- hist(cats$Hwt,plot=FALSE)
cats.hist$breaks
## [1]  6  8 10 12 14 16 18 20 22
cats.hist$counts
## [1] 20 45 42 23 11  2  0  1

Chi-Squared for Continuous Data (cont’d.)

Use these for a \(\chi^2\) test:

# Why the padding by -Inf and Inf?
p <- diff(pgamma(c(-Inf,cats.hist$breaks,Inf),shape=cats.gamma["shape"],
                 scale=cats.gamma["scale"]))
# Why the padding by 0 and 0?
x2 <- chisq.test(c(0,cats.hist$counts,0),p=p)$statistic
## Warning: Chi-squared approximation may be incorrect
# Why +2? Why -length(cats.gamma)?
pchisq(x2,df=length(cats.hist$counts)+2 - length(cats.gamma))
## X-squared 
##    0.8546

Don’t need to run hist first; can also use cut to discretize (see ?cut)

Chi-Squared for Continuous Data (cont’d.)

Misc. Tests for Misc. Distributions

More Advanced and Formal Tests

Summary

Aside: Some Math for MM and GMM

Math for MM and GMM (cont’d.)

Math for MM and GMM (cont’d.)

Aside: Some Math for the MLE