Name:
Andrew ID:
Collaborated with:
This lab is to be completed in class. You can collaborate with your classmates, but you must identify their names above, and you must submit your own lab as an Rmd file on Blackboard, by 11:59pm on the day of the lab.
There are no homework questions here. (Lucky you! But don’t worry, you still have something to think about at home: you should be working on your final project…)
pros.df = read.table("http://statweb.stanford.edu/~tibs/ElemStatLearn/datasets/prostate.data")
dim(pros.df)
## [1] 97 10
head(pros.df)
## lcavol lweight age lbph svi lcp gleason pgg45 lpsa
## 1 -0.5798185 2.769459 50 -1.386294 0 -1.386294 6 0 -0.4307829
## 2 -0.9942523 3.319626 58 -1.386294 0 -1.386294 6 0 -0.1625189
## 3 -0.5108256 2.691243 74 -1.386294 0 -1.386294 7 20 -0.1625189
## 4 -1.2039728 3.282789 58 -1.386294 0 -1.386294 6 0 -0.1625189
## 5 0.7514161 3.432373 62 -1.386294 0 -1.386294 6 0 0.3715636
## 6 -1.0498221 3.228826 50 -1.386294 0 -1.386294 6 0 0.7654678
## train
## 1 TRUE
## 2 TRUE
## 3 TRUE
## 4 TRUE
## 5 TRUE
## 6 TRUE
Randomly split the prostate cancer data frame into \(k=5\) folds of roughly equal size. (Make sure your code is general enough to handle an arbitrary number of folds \(k\); you will be asked to change the number of folds in questions that follow.) Report the number of observations that fall in each fold.
Over the folds you computed in the previous question, compute the cross-validation error of the linear model that regresses lpsa
on lcavol
and lweight
.
Write a function pros.cv()
, which takes three arguments: df
, a data frame of prostate cancer measurements, with a default of pros.df
; k
, an integer determining the number of cross-validation folds, with a default of 5; and seed
, an integer to be passed to set.seed()
before defining the folds, with a default of NULL (meaning no seed shall be set). Your function should split up the given data df
into k
folds of roughly equal size, and using these folds, compute the cross-validation error of the linear model that regresses lpsa
on lcavol
and lweight
. Its output should simply be this cross-validation error.
Investigate the result of pros.cv()
for different values of k
, specifically, for k
equal to 2, 5, 10, and 97. For each value, run pros.cv()
some large number of times (say, 50) and report the average of the cross-validation error estimates, and the standard deviation of these estimates. Then, plot them in an informative way (say, a box plot with boxplot()
). What do you notice? Is this surprising?
In general, is 2-fold cross-validation the same as sample-splitting? Why or why not?
In general, what can you say about the differences in cross-validation as the number of folds varies? What is different about cross-validation with 2 folds, versus 5 folds, versus \(n\) folds (with \(n\) being the number of observations in the data set)?
Advanced. Modify your function pros.cv()
so that it takes another argument: formula.str
, a string in the format of a formula specifying which linear model is to be evaluated by cross-validation, with the default being “lpsa ~ lcavol + lweight”. Demonstrate the use of your function for different formulas, i.e., different linear regression models.
hiv.df = read.table("http://www.stat.cmu.edu/~ryantibs/statcomp-F16/data/hiv.dat")
dim(hiv.df)
## [1] 1073 241
hiv.df[1:5, c(1,sample(2:ncol(hiv.df),8))]
## y p215 p41 p126 p25 p116 p154 p230 p17
## 1 14.612804 1 1 0 0 0 0 0 0
## 2 25.527251 1 1 0 0 0 0 0 0
## 3 0.000000 0 0 0 0 0 0 0 0
## 4 7.918125 1 0 0 0 0 0 0 0
## 5 11.394335 1 0 0 0 0 0 0 0
Use 5-fold cross-validation to estimate the test error of a linear model that regresses the drug resistance on all of the genetic mutation indicators. (You will likely get some warnings about the linear model encountering a rank-deficient fit: why do these occur?)
Use 5-fold cross-validation to estimate the test error of a regression tree with the drug resistance measure as response and the genetic mutation indicators as predictors. To fit a regression tree, you can use the function rpart()
from the package rpart
. (Its notation is similar to lm()
both for training and prediction.) In terms of prediction accuracy, does the regression tree improve on the linear model?
Advanced. Use 5-fold cross-validation to estimate the test error of a gradient boosted machine with the drug resistance measure as response and the genetic mutation indicators as predictors. To fit a gradient boosted machine, you can use the function xgboost()
from the package xgboost
. (This might require a bit of tinkering to set up; you need to specify obj="reg:linear"
; and if you’d like a concrete place to start with the boosting settings, then you can try max.depth=20
, nround=10
.) In terms of prediction acccuracy, how does boosting fare, compare to a single tree?
Advanced. Implement your own function for \(k\)-nearest neighbors regression. Then, run 5-fold cross-validation to estimate test error, for a few choices of \(k\). Discuss your findings in comparison to those for the linear model, regression tree, and boosting.