title: ‘Lecture 11: Distributions as Models’ author: “36-350”" date: “1 October 2014”" output: beamer_presentation transition: none font-family: Garamond —
data("cats",package="MASS")
Cats may be too saccharine (unless we think about where cats$Hwt
comes from)
The “pdfetch” package grabs economic and financial information from public-domain sources, e.g., Federal Reserve or Yahoo Finance
require(pdfetch)
## Loading required package: pdfetch
# Corrections Corp. of America
CXW <- pdfetch_YAHOO("CXW",fields="adjclose",
from=as.Date("2000-01-01",
to=as.Date("2014-09-30")))
summary(CXW)
## Index CXW
## Min. :2000-01-03 Min. : 0.69
## 1st Qu.:2003-09-12 1st Qu.: 6.25
## Median :2007-05-21 Median :13.65
## Mean :2007-05-19 Mean :14.26
## 3rd Qu.:2011-01-24 3rd Qu.:19.47
## Max. :2014-10-01 Max. :36.12
plot(CXW)
If the prison-industrial complex is too grim, refer back to the cats
The price \(p_t\) doesn’t matter, what matters are the returns \(r_t = \log{(p_t/p_{t-1})}\)
returns <- na.omit(as.vector(diff(log(CXW$CXW))))
summary(returns)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.4020 -0.0105 0.0000 0.0005 0.0108 0.4430
plot(returns,type="l")
quantile(x,probs)
calculates the quantiles at probs
from x
quantile(returns,c(0.25,0.5,0.75))
## 25% 50% 75%
## -0.01046 0.00000 0.01079
ecdf()
- e mpirical c umulative d istribution f unction; no assumptions but also no guess about distribution between the observations
In math, ECDF is often written as \(\widehat{F}\) or \(\widehat{F}_n\)
plot(ecdf(returns), main="Empirical CDF of CXW returns")
Conceptually, quantile
and ecdf
are inverses to each other
hist(x)
calculates a histogram from x - divide the data range up into equal-width bins and count how many fall into each bin - Or divide bin counts by (total count)*(width of bin), and get an estimate of the probability density function (pdf)
Produces plot as a default side-effect
hist(returns,n=101,probability=TRUE)
density(x)
estimates the density of x
by counting how many observations fall in a little window around each point, and then smoothing
“Bandwidth” \(\approx\) width of window around each point
Technically, a “kernel density estimate”
Take 36-402 in the spring for much more
Remember: density()
is an estimate of the pdf, not The Truth
density
returns a collection of \(x,y\) values, suitable for plotting
plot(density(returns),main="Estimated pdf of CXW returns")
hist(returns,n=101,probability=TRUE)
lines(density(returns),lty="dashed")
table()
- tabulate outcomes, most useful for discrete spaces; remember to normalize if you want probabilities
plot(table(cats$Sex)/nrow(cats),ylab="probability")
d
foo = the probability d ensity (if continuous) or probability mass function of foo (pdf or pmf)p
foo = the cumulative p robability function (CDF)q
foo = the q uantile function (inverse to CDF)r
foo = draw r andom numbers from foo
(first argument always the number of draws)?Distributions
to see which distributions are built in
If you write your own, follow the conventions
dnorm(x=c(-1,0,1),mean=1,sd=0.1)
## [1] 5.521e-87 7.695e-22 3.989e+00
pnorm(q=c(2,-2)) # defaults to mean=0,sd=1
## [1] 0.97725 0.02275
dbinom(5,size=7,p=0.7,log=TRUE)
## [1] -1.147
qchisq(p=0.95,df=5)
## [1] 11.07
rt(n=4,df=2)
## [1] 0.1698 0.0581 0.1540 -0.3191
curve
is very useful for the d
, p
, q
functions:
curve(dgamma(x,shape=45,scale=1.9),from=0,to=200)
N.B.: the r
functions aren’t things it makes much sense to plot
gamma.est_MM <- function(x) {
m <- mean(x); v <- var(x)
return(c(shape=m^2/v, scale=v/m))
}
gamma.mean <- function(shape,scale) { return(shape*scale) }
gamma.var <- function(shape,scale) { return(shape*scale^2) }
gamma.discrepancy <- function(shape,scale,x) {
return((mean(x)-gamma.mean(shape,scale))^2 + (var(x)-gamma.mean(shape,scale))^2)
}
loglike.foo <- function(params, x) {
sum(dfoo(x=x,params,log=TRUE))
}
(will see how to do this for real later in class)
Usually (but not always) efficient: converges on the truth at least as fast as anything else
Sometimes the data is too aggregated or mangled to use the MLE (as with the income data in lab 5)
MLE for one-dimensional distributions can be done through fitdistr
in the MASS
package
It knows about most the standard distributions, but you can also give it arbitrary probability density functions and it will try to maximize them
A starting value for the optimization is optional for some distributions, required for others (including user-defined densities)
Returns the parameter estimates and standard errors
SEs come from large-\(n\) approximations so use cautiously
Fit the gamma distribution to the cats’ hearts:
require(MASS)
## Loading required package: MASS
fitdistr(cats$Hwt, densfun="gamma")
## shape rate
## 20.300 1.910
## ( 2.373) ( 0.226)
Returns: estimates above, standard errors below
Fit the Students-\(t\) distribution to the returns:
fitdistr(returns,"t")
## Warning: NaNs produced
## Warning: NaNs produced
## Warning: NaNs produced
## Warning: NaNs produced
## Warning: NaNs produced
## Warning: NaNs produced
## Warning: NaNs produced
## Warning: NaNs produced
## Warning: NaNs produced
## Warning: NaNs produced
## Warning: NaNs produced
## Warning: NaNs produced
## Warning: NaNs produced
## Warning: NaNs produced
## Warning: NaNs produced
## Warning: NaNs produced
## Warning: NaNs produced
## m s df
## 0.0001751 0.0126459 1.8813709
## (0.0002681) (0.0003039) (0.0784463)
Here parameters are location (m), scale (s) and degrees of freedom (very heavy tails)
User-defined density:
dpareto <- function(x,exponent,xmin,log=FALSE) {
f <- (exponent-1)/xmin * (x/xmin)^(-exponent)
f <- ifelse(x<xmin,NA,f)
if(!log) { return(f) } else (return(log(f)))
}
# Fit pareto to large absolute returns
# Parameters given outside the "start" list are fixed
fitdistr(abs(returns)[abs(returns)>0.05], densfun=dpareto,
start=list(exponent=2.5), xmin=0.05)
## Warning: one-dimensional optimization by Nelder-Mead is unreliable:
## use "Brent" or optimize() directly
## exponent
## 2.898
## (0.122)
gamma.est_MM(rgamma(100,shape=19,scale=45))
## shape scale
## 21.09 41.80
gamma.est_MM(rgamma(1e5,shape=19,scale=45))
## shape scale
## 18.99 45.05
gamma.est_MM(rgamma(1e6,shape=19,scale=45))
## shape scale
## 18.99 45.03
Use your eyes: Graphic overlays of theory vs. data
plot(density(cats$Hwt))
cats.gamma <- gamma.est_MM(cats$Hwt)
curve(dgamma(x,shape=cats.gamma["shape"],scale=cats.gamma["scale"]),add=TRUE,col="blue")
# Really bad and good days for CXW shareholders, per model:
qnorm(c(0.01,0.99),mean=mean(returns),sd=sd(returns))
## [1] -0.07588 0.07695
# As it happened:
quantile(returns,c(0.01,0.99))
## 1% 99%
## -0.09373 0.09415
qqnorm
, qqline
are specialized for checking normalityqqnorm(returns); qqline(returns)
qqplot(x,y)
will do a Q-Q plot of one vector against anotherqqplot(returns,qt((1:500)/501,df=2))
plot(ecdf(pnorm(returns, mean=mean(returns), sd=sd(returns))),
main="Calibration of Gaussian distribution for returns")
abline(0,1,col="grey")
Again, way too many large changes (in either direction)
CXW.t <- coefficients(fitdistr(returns,"t"))
## Warning: NaNs produced
## Warning: NaNs produced
## Warning: NaNs produced
## Warning: NaNs produced
## Warning: NaNs produced
## Warning: NaNs produced
## Warning: NaNs produced
## Warning: NaNs produced
## Warning: NaNs produced
## Warning: NaNs produced
## Warning: NaNs produced
## Warning: NaNs produced
## Warning: NaNs produced
## Warning: NaNs produced
## Warning: NaNs produced
## Warning: NaNs produced
## Warning: NaNs produced
plot(ecdf(pt((returns-CXW.t[1])/CXW.t[2], df=CXW.t[3])),
main="Calibration of t distribution for returns")
abline(0,1,col="grey")
plot(ecdf(pgamma(cats$Hwt, shape=cats.gamma["shape"], scale=cats.gamma["scale"])),
main="Calibration of gamma distribution for cats' hearts")
abline(0,1,col="grey")
Challenge: Write a general function for making a calibraton plot, taking a data vector, a cumulative probability function, and a parameter vector
ks.test(returns,pnorm,mean=0,sd=0.0125)
## Warning: ties should not be present for the Kolmogorov-Smirnov test
##
## One-sample Kolmogorov-Smirnov test
##
## data: returns
## D = 0.085, p-value < 2.2e-16
## alternative hypothesis: two-sided
Hack: Estimate using (say) 90% of the data, and then check the fit on the remaining 10%
train <- sample(1:length(returns),size=round(0.9*length(returns)))
CWX.t_train <- coefficients(fitdistr(returns[train],"t"))
## Warning: NaNs produced
## Warning: NaNs produced
## Warning: NaNs produced
## Warning: NaNs produced
## Warning: NaNs produced
## Warning: NaNs produced
## Warning: NaNs produced
## Warning: NaNs produced
## Warning: NaNs produced
## Warning: NaNs produced
## Warning: NaNs produced
## Warning: NaNs produced
## Warning: NaNs produced
## Warning: NaNs produced
## Warning: NaNs produced
## Warning: NaNs produced
## Warning: NaNs produced
## Warning: NaNs produced
## Warning: NaNs produced
## Warning: NaNs produced
## Warning: NaNs produced
## Warning: NaNs produced
## Warning: NaNs produced
## Warning: NaNs produced
## Warning: NaNs produced
## Warning: NaNs produced
## Warning: NaNs produced
## Warning: NaNs produced
## Warning: NaNs produced
## Warning: NaNs produced
## Warning: NaNs produced
## Warning: NaNs produced
## Warning: NaNs produced
## Warning: NaNs produced
returns.test_standardized <- (returns[-train]-CWX.t_train[1])/CWX.t_train[2]
ks.test(returns.test_standardized,pt,df=CWX.t_train[3])
## Warning: ties should not be present for the Kolmogorov-Smirnov test
##
## One-sample Kolmogorov-Smirnov test
##
## data: returns.test_standardized
## D = 0.0544, p-value = 0.2224
## alternative hypothesis: two-sided
n <- length(returns)
half <- round(n/2)
ks.test(returns[1:half], returns[(half+1):n])
## Warning: p-value will be approximate in the presence of ties
##
## Two-sample Kolmogorov-Smirnov test
##
## data: returns[1:half] and returns[(half + 1):n]
## D = 0.0599, p-value = 0.002569
## alternative hypothesis: two-sided
Compare an actual table of counts to a hypothesized probability distribution
e.g., as many up days as down?
up_or_down <- ifelse(returns > 0, 1, -1)
# 1936 down days, 1772 up days
chisq.test(table(up_or_down),p=c(1/2,1/2))
##
## Chi-squared test for given probabilities
##
## data: table(up_or_down)
## X-squared = 7.34, df = 1, p-value = 0.006743
chisq.test
assumes that all the probabilities in \(p\) were fixed, not estimated from the data used for testing, so \(df = \) number of cells in the table \(-1\)x
in chisq.test()
p
foo to calculate the theoretical probability of each bin; this is p
chisq.test
hist()
gives us break points and counts:cats.hist <- hist(cats$Hwt,plot=FALSE)
cats.hist$breaks
## [1] 6 8 10 12 14 16 18 20 22
cats.hist$counts
## [1] 20 45 42 23 11 2 0 1
Use these for a \(\chi^2\) test:
# Why the padding by -Inf and Inf?
p <- diff(pgamma(c(-Inf,cats.hist$breaks,Inf),shape=cats.gamma["shape"],
scale=cats.gamma["scale"]))
# Why the padding by 0 and 0?
x2 <- chisq.test(c(0,cats.hist$counts,0),p=p)$statistic
## Warning: Chi-squared approximation may be incorrect
# Why +2? Why -length(cats.gamma)?
pchisq(x2,df=length(cats.hist$counts)+2 - length(cats.gamma))
## X-squared
## 0.8546
Don’t need to run hist
first; can also use cut
to discretize (see ?cut
)
ks.test
with an independent test set