Name:
Andrew ID:
Collaborated with:

This lab is to be done in class (completed outside of class if need be). You can collaborate with your classmates, but you must identify their names above, and you must submit your own lab as an knitted HTML file on Canvas, by Thursday 10pm, this week.

This week’s agenda: understanding training and testing errors, implementing sample-splitting and cross-validation (optional), and trying a bunch of statistical prediction methods (also optional).

Practice with training and test errors

The code below generates and plots training and test data from a simple univariate linear model, as in lecture. (You don’t need to do anything yet.)

set.seed(1)
n = 30
x = sort(runif(n, -3, 3))
y = 2*x + 2*rnorm(n)
x0 = sort(runif(n, -3, 3))
y0 = 2*x0 + 2*rnorm(n)

par(mfrow=c(1,2))
xlim = range(c(x,x0)); ylim = range(c(y,y0))
plot(x, y, xlim=xlim, ylim=ylim, main="Training data")
plot(x0, y0, xlim=xlim, ylim=ylim, main="Test data")

Sample-splitting with the prostate cancer data

Below, we read in data on 97 men who have prostate cancer (from the book The Elements of Statistical Learning). (You don’t need to do anything yet.)

pros.df = read.table(
  "https://web.stanford.edu/~hastie/ElemStatLearn/datasets/prostate.data")
dim(pros.df)
## [1] 97 10
head(pros.df)
##       lcavol  lweight age      lbph svi       lcp gleason pgg45       lpsa
## 1 -0.5798185 2.769459  50 -1.386294   0 -1.386294       6     0 -0.4307829
## 2 -0.9942523 3.319626  58 -1.386294   0 -1.386294       6     0 -0.1625189
## 3 -0.5108256 2.691243  74 -1.386294   0 -1.386294       7    20 -0.1625189
## 4 -1.2039728 3.282789  58 -1.386294   0 -1.386294       6     0 -0.1625189
## 5  0.7514161 3.432373  62 -1.386294   0 -1.386294       6     0  0.3715636
## 6 -1.0498221 3.228826  50 -1.386294   0 -1.386294       6     0  0.7654678
##   train
## 1  TRUE
## 2  TRUE
## 3  TRUE
## 4  TRUE
## 5  TRUE
## 6  TRUE

Sample-splitting with the wage data

Below, we read in data on 3000 individuals living in the mid-Atlantic regression, measuring various demographic and economic variables (adapted from the book An Introduction to Statistical Learning). (You don’t have to do anything yet.)

wage.df = read.csv("http://www.stat.cmu.edu/~ryantibs/statcomp-S18/data/wage.csv", 
                   skip=16)
dim(wage.df)
## [1] 3000   11
head(wage.df, 5)
##        year age     sex           maritl     race       education
## 231655 2006  18 1. Male 1. Never Married 1. White    1. < HS Grad
## 86582  2004  24 1. Male 1. Never Married 1. White 4. College Grad
## 161300 2003  45 1. Male       2. Married 1. White 3. Some College
## 155159 2003  43 1. Male       2. Married 3. Asian 4. College Grad
## 11443  2005  50 1. Male      4. Divorced 1. White      2. HS Grad
##                    region       jobclass         health health_ins
## 231655 2. Middle Atlantic  1. Industrial      1. <=Good      2. No
## 86582  2. Middle Atlantic 2. Information 2. >=Very Good      2. No
## 161300 2. Middle Atlantic  1. Industrial      1. <=Good     1. Yes
## 155159 2. Middle Atlantic 2. Information 2. >=Very Good     1. Yes
## 11443  2. Middle Atlantic 2. Information      1. <=Good     1. Yes
##             wage
## 231655  75.04315
## 86582   70.47602
## 161300 130.98218
## 155159 154.68529
## 11443   75.04315

Cross-validation with the prostate cancer data (optional)

Making predictions with the HIV data set (optional)

Below, we read in some data on HIV from Rhee et al. (2003), “Human immunodeficiency virus reverse transcriptase and protease sequence database”. There are 1073 observations of the following nature. The response variable (first column) is a measure of drug resistance, for a particular HIV drug. The 241 predictor variables (all but first column) are each binary indicators of the presence/absence of mutation at a particular gene mutation site. The goal is to predict HIV drug resistance from this genetic mutation information. (You don’t have to do anything yet.)

hiv.df = read.table("http://www.stat.cmu.edu/~ryantibs/statcomp-S18/data/hiv.dat")
dim(hiv.df)
## [1] 1073  241
hiv.df[1:5, c(1,sample(2:ncol(hiv.df),8))]
##           y p71 p46 p211 p120 p207 p45 p178 p169
## 1 14.612804   0   0    1    0    0   0    0    0
## 2 25.527251   0   0    1    0    0   0    0    0
## 3  0.000000   0   0    0    0    1   0    0    0
## 4  7.918125   0   0    0    0    0   0    0    0
## 5 11.394335   0   0    0    0    0   0    0    0