Name:
Andrew ID:
You must work alone.
You will be submitting a knitted HTML file to Canvas, as usual, by Sunday December 9 at 11:59pm.
For full credit on each question, make sure you follow exactly what is asked, and answer each prompt. The total is 66 points (and there are 10 bonus points).
You may only ask clarification questions on Piazza; you may not ask questions for help. This is a final exam, not a lab assignment.
This should go without saying (but we have had several problems in past years): do not cheat. It ends poorly for everyone.
1a (4 points). Simulate 1000 pairs \((X_1,X_2)\) according to \[
X_1 \sim N(0,1.5) \quad \text{and} \quad X_2 \sim N(0,1), \quad \text{independently}.
\] (Hhere \(N(\mu,\sigma^2)\) denotes the normal distribution with mean \(\mu\) and variance \(\sigma^2\).) Call this data “group A”, and save the data points in matrix called a.mat
, of dimension 1000 x 2. Simulate 1000 pairs \((X_1,X_2)\) according to \[
X_1 \sim N(2.5,1) \quad \text{and} \quad X_2 \sim N(3,1.5), \quad \text{independently}.
\] Call this data “group B”, and save the data points in a matrix b.mat
, of dimension 1000 x 2. Compute the column means and column standard deviations of a.mat
and b.mat
, and show that they’re close to what you would expect.
1b (4 points). On a single plot, plot \(X_2\) versus \(X_1\), using all the points from groups A and B. Color the points from group A in black, and those from group B in red. Set the x label to “X1”, the y label to “X2”, and leave the title blank. Make sure the axes limits are just big enough to show all the points on the plot, and set these programmatically (do not set these using hard constants). Add a legend to explain the color coding. Do you see significant overlap in the points from the two groups?
1c (3 points). Define a data frame dat
of dimension 2000 x 3, as follows. First, stack the rows of a.mat
and b.mat
together, to get a data frame of dimension 2000 x 2. Then, append a column of 0s and 1s, that indicates whether the points (rows) come from group A or B (with 0 for A and 1 for B). Finally, name the columns of dat
to be X1
, X2
, Y
respectively. Show the first 3 and last 3 rows of dat
.
1d (2 points). Perform a logistic regression of Y
(response) on X1
and X2
(variables), using the data dat
. Use the function glm()
with family="binomial"
, and save the output object as log.mod
. Allow for an intercept in the model (the default). Report the intercept, and the coefficients for X1
and X2
.
1e (5 points). Let \(\beta_1,\beta_2\) be the coefficients of \(X_1,X_2\) that you computed in the previous logisitic regression, and let \(\beta_0\) be the intercept. Consider the line defined by the equation \[ \beta_0 + \beta_1 X_1 + \beta_2 X_2 = 0. \] Via simple algebra, rewrite this equation as \[ X_2 = a + b X_1, \] for a particular \(a,b\) that you will compute. Then reproduce your plot from Q1b, and plot the line \(X_2 = a + b X_1\) on top, as a thick blue dashed line. (In case you’re curious, this is called the decision boundary from your logistic regression.) Modify your legend so that your line explained. Does your line look like a reasonable separator of groups A and B?
2a (2 points). Refer back to the logistic regression object you computed in Q1d. Using summary()
, form a vector of the standard errors of the intercept and coefficients. Display this vector (it should be of length 3).
Perform this iterative strategy (in case you’re curious, this is known as bootstrapping) on your data set from Q1c. Hint: a for()
loop is probably simplest. What are the standard deviations you get for the intercept and \(X_1\) and \(X_2\) coefficients? Are they close to the standard errors you found in Q2a? (They should be.)
2c (2 points). Rerun your code from Q2b, but now sampling without replacement (before you were sampling with replacement). What are the standard errors you estimate in the end? And does this make sense to you?
3a (3 points). Below is a function predict.fun()
that your instructor wrote to get predictions out of a logistic regression model that was fit on two variables. The user passes in log
, a fitted logistic regression object (from glm()
), and x1
and x2
, values of the two variables at which he/she wants to make predictions. The function is supposed to return the predicted class (0 or 1), at each value of x1
and x2
that is passed. Note that x1
and x2
can be vectors; so, e.g., if these are of length 5, then the function should also return a vector of length 5, giving the class predictions for each of the 5 pairs of x1
and x2
points.
Unfortunately, your instructor was sloppy and introduced bugs in his implementation. Fix them, then check that the function returns the outputs as specified in the comments for the examples below. Note that log.mod
is your fitted logistic regression object from Q1d. When done, remove eval=FALSE
in the option for the code chunk. Hint: the argument type="response"
in the call to predict
is not a bug. This is making sure that the predict()
function returns probabilities. These are are from the logistic model; if the probability is above 1/2, we predict class 1, and if it is below 1/2, we predict class 0.
predict.fun = function(mod, x1, x2) {
prob = predict(mod, newdata=data.frame(x1=x1, x2=x2), type="response")
if (prob <= 0.5) return(0)
else return(1)
}
predict.fun(log.mod, -10, -10) # Should return 0
predict.fun(log.mod, 10, 10) # Should return 1
predict.fun(log.mod, c(-10,10), c(-10,10)) # Should return c(0,1)
predict.fun2()
that your instructor wrote that serves the same purpose (and has the same argument structure) as predict.fun()
, but uses a different strategy. The previous one calls predict()
on a fitted logistic regression object; this one directly extracts the coefficients from the logistic regression object, call them \(\beta_0,\beta_1,\beta_2\) in math notation, and then computes the linear combination \[
\beta_0 + \beta_1 X_1 + \beta_2 X_2
\] for each value of the variables \(X_1\) and \(X_2\) in consideration. It then compares this linear combination to 0: if below 0, it predicts class 0, and if above 0, it predicts class 1. There are several bugs in predict.fun2()
. Fix them, check that you get the outputs as specified in the comments for the examples below, then remove eval=FALSE
.predict.fun2 = function(mod, x1, x2) {
link = sum(c(x1,x2) * coef(mod))
if (link <= 0) return(0)
else return(1)
}
predict.fun2(log.mod, -10, -10) # Should return 0
predict.fun2(log.mod, 10, 10) # Should return 1
predict.fun2(log.mod, c(-10,10), c(-10,10)) # Should return c(0,1)
3c (8 points). Below is a function plot.predictions()
that your instructor wrote that is supposed to take a fitted logistic regression model mod
, and make predictions from this model over a grid of x1
and x2
values, where the grid is 100 x 100, and has endpoints defined by x1.lim
and x2.lim
. The function is then supposed to produce a plot that colors the points on this grid by their predicted classes, with col0
for class 0 and col1
for class 1 (these are both strings to be passed in by the user, but with default values of “black” and “red”, respectively).
Your instructor didn’t introduce any bugs (phew!) but stopped short and didn’t finish the implementation. Complete the missing code, as indicated by the comments. Then run it with appropriate arguments to produce a plot based on your logistic regression model log.mod
from Q1d. Set the grid endpoints x1.lim
and x2.lim
so that your plot is just big enough to capture all the data from Q1a. Then, on top of the plot you get from plot.predictions()
, plot the data points themselves, and the decision boundary, just as you did in Q1e (you may copy over your code from Q1e here; just make sure everything is on one plot in the end).
plot.predictions = function(mod, x1.lim, x2.lim, col0="black", col1="red") {
# Define x1 and x2 variables over the grid. Fill in the ?? below, then
# uncomment the lines
#x1.grid = seq(??, ??, len=100)
#x2.grid = seq(??, ??, len=100)
x1 = rep(x1.grid, times=100)
x2 = rep(x2.grid, each=100)
# Call predict.fun() to get the predicted classes. (You can change this
# to predict.fun2() if you want; just use whichever one you got working)
pred = predict.fun(mod, x1, x2)
# Define the colors based on the predicted classes. Remember that col0
# is for class 0, and col1 is for class 1. Fill in the ??, uncomment
#cols = ifelse(??, ??, ??)
# Plot the grid points with the right colors. Fill in the ??, uncomment
#plot(??, ??, col=cols, xlab="X1", ylab="X2", pch=20, cex=0.1)
}
read.table()
; make sure to set stringsAsFactors=FALSE
; save the resulting data frame as horse.df
; check that it has dimension 300 x 28; and set its column names to the string vector nms
below. Print out the first 5 rows and (all columns) of horse.df
to the console. To read more about the data set, visit https://archive.ics.uci.edu/ml/datasets/Horse+Colic.nms = c("Surgery", "Age", "Hosp_ID", "Temp_Rect", "Pulse", "Resp", "Temp_Extr",
"Pulse_Peri", "Mucous", "Cap_Time", "Pain", "Peris", "Ab_Dist", "Tube",
"Reflux", "Reflux_PH", "Feces", "Abdomen", "Cell_Vol", "Total_Protein",
"Ab_Appear", "Ab_Total_Prot", "Outcome", "Lesion", "Lesion_Site",
"Lesion_Type", "Lesion_Subt", "Lesion_Code")
4b (3 points). How many columns in horse.df
are of character type, and of integer type? How many columns have missing values (represented by “?” in this data set), and what are they (what are their columns names)? The UCI website reports that about 30% of this data set is missing. Is this accurate? What do you calculate the missingness percentage to be?
4c (3 points). Convert all of the “?” values in horse.df
to NA values. After this, convert all of the character columns to numeric, but make sure that the horse.df
object is still a data frame (just explicitly cast it to one, if you need to; otherwise you won’t be able to use tidyverse + pipes in what follows). When done, print the first 5 rows (and all columns).
In the following questions, use the tidyverse (specifically, dplyr
verbs) along with pipes to construct your answer. Each part should require just a single flow of pipes.
4d (3 points). Arrange the horses (the rows in horse.df
) by Surgery
(1 means yes surgery, 2 means no surgery), and then by descending values of Pulse
. Report the Surgery
, Pulse
, Temp_Rect
(rectal temperature), and Ab_Dist
(abdominal distention: 1 being best, 4 being worst) variables, for all horses with Pulse
at least 130.
4e (3 points). For each Pain
level (1 being best, 5 being worst), compute the average Pulse
(and make sure to exclude NA values in this calculation), then report the Pain
and average Pulse
in decreasing order of average Pulse
.
4f (4 points). For horses with Pain
at least 3 and irregular Feces
(1 means regular, 2 through 4 mean irregular), produce a plot of Temp_Rect
versus Pulse
, then report a cross-table of Ab_Dist
versus Outcome
. (By this, we mean a table that counts the number of horses with each combination of Ab_Dist
and Outcome
levels. To check your work, there should be 11 horses with Ab_Dist
equal to 1 and Outcome
equal to 1.) Hint: use the “tee” operator %T>%
.
Challenge (10 points). A test data set, on 68 additional horses, is available at https://archive.ics.uci.edu/ml/machine-learning-databases/horse-colic/horse-colic.test. On the original data set from Q4a, call this the training data set, fit a suitable model to predict Ab_Dist
(abdominal distention) from the other variables. You should treat Ab_Dist
as a categorical variable, not as a continuous variable. Use your fitted model to make predictions on the test data set, and report the percentage of correctly classified horses (the percentage of horses in the test set for which your model predicted the correct level of Ab_Dist
). Compare this to the percentage of correctly classified horses from the “naive” model, which simply predicts the most common level of Ab_Dist
that was observed in the training set. Does your model do significantly better? Note: missingness here is a challenge to deal with. If you were to simply remove all horses in the training set for which that had any missing data, then you would be left with 6 total (recall, you started with 300) … so this is not a good idea. A better idea is to replace all missing values by the mean of that variable (this is a simple form of imputation).