Name:
Andrew ID:
Collaborated with:

This lab is to be done in class (completed outside of class time if need be). You can collaborate with your classmates, but you must identify their names above, and you must submit your own lab as an knitted PDF file on Gradescope, by Monday 9pm, next week.

This week’s agenda: understanding training and testing errors, implementing sample-splitting and cross-validation, and trying a bunch of statistical prediction methods (optional).

Q1. Practice with training and test errors

The code below generates and plots training and test data from a simple univariate linear model, as in lecture. (You don’t need to do anything yet.)

set.seed(1)
n = 30
x = sort(runif(n, -3, 3))
y = 2*x + 2*rnorm(n)
x0 = sort(runif(n, -3, 3))
y0 = 2*x0 + 2*rnorm(n)

par(mfrow=c(1,2))
xlim = range(c(x,x0)); ylim = range(c(y,y0))
plot(x, y, xlim=xlim, ylim=ylim, main="Training data")
plot(x0, y0, xlim=xlim, ylim=ylim, main="Test data")

# YOUR CODE GOES HERE
# YOUR CODE GOES HERE

Q2. Sample-splitting with the prostate cancer data

Below we read in the prostate cancer data set that we looked in previous labs.

pros.df = read.table(
  "https://web.stanford.edu/~hastie/ElemStatLearn/datasets/prostate.data")
dim(pros.df)
## [1] 97 10
head(pros.df, 3)
## lcavol lweight age lbph svi lcp gleason pgg45 lpsa train
## 1 -0.5798185 2.769459 50 -1.386294 0 -1.386294 6 0 -0.4307829 TRUE
## 2 -0.9942523 3.319626 58 -1.386294 0 -1.386294 6 0 -0.1625189 TRUE
## 3 -0.5108256 2.691243 74 -1.386294 0 -1.386294 7 20 -0.1625189 TRUE
# YOUR CODE GOES HERE
# YOUR CODE GOES HERE

Q3. Sample-splitting with the wage data

Below we read in the wage data set that we looked in previous labs.

wage.df = read.csv("http://www.stat.cmu.edu/~ryantibs/statcomp/data/wage.csv", 
                   skip=16)
dim(wage.df)
## [1] 3000   11
head(wage.df, 5)
## year age sex maritl race education region
## 231655 2006 18 1. Male 1. Never Married 1. White 1. < HS Grad 2. Middle
Atlantic
## 86582 2004 24 1. Male 1. Never Married 1. White 4. College Grad 2. Middle
Atlantic
## 161300 2003 45 1. Male 2. Married 1. White 3. Some College 2. Middle
Atlantic
## 155159 2003 43 1. Male 2. Married 3. Asian 4. College Grad 2. Middle
Atlantic
## 11443 2005 50 1. Male 4. Divorced 1. White 2. HS Grad 2. Middle Atlantic
## jobclass health health_ins wage
## 231655 1. Industrial 1. <=Good 2. No 75.04315
## 86582 2. Information 2. >=Very Good 2. No 70.47602
## 161300 1. Industrial 1. <=Good 1. Yes 130.98218
## 155159 2. Information 2. >=Very Good 1. Yes 154.68529
## 11443 2. Information 1. <=Good 1. Yes 75.04315
# YOUR CODE GOES HERE
# YOUR CODE GOES HERE
# YOUR CODE GOES HERE

Q4. Cross-validation with the prostate cancer data

# YOUR CODE GOES HERE
# YOUR CODE GOES HERE
# YOUR CODE GOES HERE
# YOUR CODE GOES HERE
# YOUR CODE GOES HERE

Q5. Making predictions on the HIV data set (optional)

Below, we read in some data on HIV from Rhee et al. (2003), “Human immunodeficiency virus reverse transcriptase and protease sequence database”. There are 1073 observations of the following nature. The response variable (first column) is a measure of drug resistance, for a particular HIV drug. The 241 predictor variables (all but first column) are each binary indicators of the presence/absence of mutation at a particular gene mutation site. The goal is to predict HIV drug resistance from this genetic mutation information. (You don’t have to do anything yet.)

hiv.df = read.table("http://www.stat.cmu.edu/~ryantibs/statcomp/data/hiv.dat")
dim(hiv.df)
## [1] 1073  241
hiv.df[1:5, c(1,sample(2:ncol(hiv.df),8))]
##           y p50 p239 p219 p135 p111 p20 p121 p152
## 1 14.612804   0    0    0    0    0   0    0    0
## 2 25.527251   0    0    0    0    0   1    0    0
## 3  0.000000   0    0    0    0    0   0    0    0
## 4  7.918125   0    0    1    0    0   0    0    0
## 5 11.394335   0    0    1    1    0   0    0    0
# YOUR CODE GOES HERE
# YOUR CODE GOES HERE
# YOUR CODE GOES HERE
# YOUR CODE GOES HERE