Name:
Andrew ID:
Collaborated with:

This lab is to be completed in class. You can collaborate with your classmates, but you must identify their names above, and you must submit your own lab as an Rmd file on Blackboard, by 11:59pm on the day of the lab.

There are no homework questions here. (Lucky you! But don’t worry, you still have something to think about at home: you should be working on your final project…)

Cross-validation with the prostate cancer data

Below, we read in data on 97 men who have prostate cancer (from the book The Elements of Statistical Learning), which we saw in Monday’s lab and in previous mini-lectures on statistical modeling. (You don’t need to do anything yet.)

pros.df = read.table("http://statweb.stanford.edu/~tibs/ElemStatLearn/datasets/prostate.data")
dim(pros.df)

## [1] 97 10

head(pros.df)

##       lcavol  lweight age      lbph svi       lcp gleason pgg45       lpsa
## 1 -0.5798185 2.769459  50 -1.386294   0 -1.386294       6     0 -0.4307829
## 2 -0.9942523 3.319626  58 -1.386294   0 -1.386294       6     0 -0.1625189
## 3 -0.5108256 2.691243  74 -1.386294   0 -1.386294       7    20 -0.1625189
## 4 -1.2039728 3.282789  58 -1.386294   0 -1.386294       6     0 -0.1625189
## 5  0.7514161 3.432373  62 -1.386294   0 -1.386294       6     0  0.3715636
## 6 -1.0498221 3.228826  50 -1.386294   0 -1.386294       6     0  0.7654678
##   train
## 1  TRUE
## 2  TRUE
## 3  TRUE
## 4  TRUE
## 5  TRUE
## 6  TRUE

Randomly split the prostate cancer data frame into \(k=5\) folds of roughly equal size. (Make sure your code is general enough to handle an arbitrary number of folds \(k\); you will be asked to change the number of folds in questions that follow.) Report the number of observations that fall in each fold.
Over the folds you computed in the previous question, compute the cross-validation error of the linear model that regresses lpsa on lcavol and lweight.
Write a function pros.cv(), which takes three arguments: df, a data frame of prostate cancer measurements, with a default of pros.df; k, an integer determining the number of cross-validation folds, with a default of 5; and seed, an integer to be passed to set.seed() before defining the folds, with a default of NULL (meaning no seed shall be set). Your function should split up the given data df into k folds of roughly equal size, and using these folds, compute the cross-validation error of the linear model that regresses lpsa on lcavol and lweight. Its output should simply be this cross-validation error.
Investigate the result of pros.cv() for different values of k, specifically, for k equal to 2, 5, 10, and 97. For each value, run pros.cv() some large number of times (say, 50) and report the average of the cross-validation error estimates, and the standard deviation of these estimates. Then, plot them in an informative way (say, a box plot with boxplot()). What do you notice? Is this surprising?
In general, is 2-fold cross-validation the same as sample-splitting? Why or why not?
In general, what can you say about the differences in cross-validation as the number of folds varies? What is different about cross-validation with 2 folds, versus 5 folds, versus \(n\) folds (with \(n\) being the number of observations in the data set)?
Advanced. Modify your function pros.cv() so that it takes another argument: formula.str, a string in the format of a formula specifying which linear model is to be evaluated by cross-validation, with the default being “lpsa ~ lcavol + lweight”. Demonstrate the use of your function for different formulas, i.e., different linear regression models.

Making predictions with the HIV data set

Below, we read in some data on HIV from Rhee et al. (2003), “Human immunodeficiency virus reverse transcriptase and protease sequence database”. There are 1073 observations of the following nature. The response variable (first column) is a measure of drug resistance, for a particular HIV drug. The 241 predictor variables (all but first column) are each binary indicators of the presence/absence of mutation at a particular gene mutation site. The goal is to predict HIV drug resistance from this genetic mutation information. (You don’t have to do anything yet.)

hiv.df = read.table("http://www.stat.cmu.edu/~ryantibs/statcomp-F16/data/hiv.dat")
dim(hiv.df)

## [1] 1073  241

hiv.df[1:5, c(1,sample(2:ncol(hiv.df),8))]

##           y p215 p41 p126 p25 p116 p154 p230 p17
## 1 14.612804    1   1    0   0    0    0    0   0
## 2 25.527251    1   1    0   0    0    0    0   0
## 3  0.000000    0   0    0   0    0    0    0   0
## 4  7.918125    1   0    0   0    0    0    0   0
## 5 11.394335    1   0    0   0    0    0    0   0

Use 5-fold cross-validation to estimate the test error of a linear model that regresses the drug resistance on all of the genetic mutation indicators. (You will likely get some warnings about the linear model encountering a rank-deficient fit: why do these occur?)
Use 5-fold cross-validation to estimate the test error of a regression tree with the drug resistance measure as response and the genetic mutation indicators as predictors. To fit a regression tree, you can use the function rpart() from the package rpart. (Its notation is similar to lm() both for training and prediction.) In terms of prediction accuracy, does the regression tree improve on the linear model?
Advanced. Use 5-fold cross-validation to estimate the test error of a gradient boosted machine with the drug resistance measure as response and the genetic mutation indicators as predictors. To fit a gradient boosted machine, you can use the function xgboost() from the package xgboost. (This might require a bit of tinkering to set up; you need to specify obj="reg:linear"; and if you’d like a concrete place to start with the boosting settings, then you can try max.depth=20, nround=10.) In terms of prediction acccuracy, how does boosting fare, compare to a single tree?
Advanced. Implement your own function for \(k\)-nearest neighbors regression. Then, run 5-fold cross-validation to estimate test error, for a few choices of \(k\). Discuss your findings in comparison to those for the linear model, regression tree, and boosting.

Lab 13f: Sample-Splitting and Cross-Validation, Statistical Prediction Machines

Statistical Computing, 36-350

Friday December 2, 2016

Cross-validation with the prostate cancer data

Making predictions with the HIV data set