Name:
Andrew ID:
Collaborated with:
This lab is to be done in class (completed outside of class if need be). You can collaborate with your classmates, but you must identify their names above, and you must submit your own lab as an knitted HTML file on Canvas, by Thursday 10pm, this week.
This week’s agenda: exploratory data analysis, cleaning data, fitting linear models, and using associated utility functions.
Recall the data set on 97 men who have prostate cancer (from the book The Elements of Statistical Learning). Reading it into our R session:
pros.df =
read.table("http://www.stat.cmu.edu/~ryantibs/statcomp-S18/data/pros.dat")
dim(pros.df)
## [1] 97 9
head(pros.df, 3)
## lcavol lweight age lbph svi lcp gleason pgg45 lpsa
## 1 -0.5798185 2.769459 50 -1.386294 0 -1.386294 6 0 -0.4307829
## 2 -0.9942523 3.319626 58 -1.386294 0 -1.386294 6 0 -0.1625189
## 3 -0.5108256 2.691243 74 -1.386294 0 -1.386294 7 20 -0.1625189
1a. Define pros.df.subset
to be the subset of observations (rows) of the prostate data set such the lcp
measurement is greater than the minimum value (the minimum value happens to be log(0.25)
, but you should not hardcode this value and should work it out from the data). As in lecture, plot histograms of all of the variables in pros.df.subset
. Comment on any differences you see between these distributions and the ones in lecture.
1b. Also as in lecture, compute and display correlations between all pairs of variables in pros.df.subset
. Report the two highest correlations between pairs of (distinct) variables, and also report the names of the associated variables. Are these different from answers that were computed on the full data set?
1c. Compute, using lm()
, a linear regression model of lpsa
(log PSA score) on lcavol
(log cancer volume). Do this twice: once with the full data set, pros.df
, and once with the subsetted data, pros.df.subset
. Save the results as pros.lm.
and pros.subset.lm
, respectively. Using coef()
, display the coefficients (intercept and slope) from each linear regression. Are they different?
1d. Let’s produce a visualization to help us figure out how different these regression lines really are. Plot lpsa
versus lcavol
, using the full set of observations, in pros.df
. Label the axes appropriately. Then, mark the observations in pros.df.subset
by small filled red circles. Add a thick black line to your plot, displaying the fitted regression line from pros.lm
. Add a thick red line, displaying the fitted regression line from pros.subset.lm
. Add a legend that explains the color coding.
1e. Compute again a linear regression of lpsa
on lcavol
, but now on two different subsets of the data: the first consisting of patients with SVI, and the second consistent of patients without SVI. Display the resulting coefficients (intercept and slope) from each model, and produce a plot just like the one in the last question, to visualize the different regression lines on top of the data. Do these two regression lines differ, and in what way?
There are now over 1,000 confirmed planets outside of our solar system. They have been discovered through a variety of methods, with each method providing access to different information about the planet. Many were discovered by NASA’s Kepler space telescope, which observes the “transit” of a planet in front of its host star. In these problems you will use data from the NASA Exoplanet Archive to investigate some of the properties of these exoplanets. (You don’t have to do anything yet, this was just by way of background.)
2a. A data table of dimension 1892 x 10 on exoplanets (planets outside of our solar system, i.e., which orbit anothet star, other than our sun) is up at http://www.stat.cmu.edu/~ryantibs/statcomp-S18/data/exoplanets.csv. Load this data table into your R session with read.csv()
and save the resulting data frame as exo.df
. (Hint: the first 13 lines of the linked filed just explain the nature of the data, so you can use an appropriate setting for the skip
argument in read.csv()
.) Check that exo.df
has the right dimensions, and display its first 3 rows.
2b. The column pl_discmethod
of exo.df
documents the method by which the planet was discovered. How many planets were discovered by the “Transit” method? How many were discovered by the “Radial Velocity” method? How method different methods of discovery are there total, and what is the most common? Least common?
2c. The last 6 columns of the exo.df
data frame, pl_pnum
,pl_orbper
, pl_orbsmax
, pl_massj
, pl_msinij
, and st_mass
, are all numeric variables. Define a matrix exo.mat
that contains these variables as columns. Display the first 3 rows of exo.mat
.
2d. As we can see, at least one of the columns, pl_massj
, has many NA values in it. How many missing values are present in each of the 6 columns of exo.mat
? (Hint: you can do this easily with vectorization, you shouldn’t use any loops.)
2e. Define ind.clean
to be the vector of row indices corresponding to the “clean” rows in exo.mat
, i.e., rows for which there are no NAs among the 6 variables. (Hint: again, you can do this easily with vectorization, you shouldn’t use any loops.) Use ind.clean
to define exo.mat.clean
, a new matrix containing the corresponding clean rows of exo.mat
. How many rows are left in exo.mat.clean
?
2f. Yikes! You should have seen that, because there are so many missing values (NA values) in exo.mat
, we only have 7 rows with complete observations! This is far too little data. Because of this, we’re going to restrict our attention to the variables pl_orbper
, st_mass
, and pl_orbsmax
. Redefine exo.mat
to contain just these 3 variables as columns. Then repeat the previous part, i.e., define ind.clean
to be the vector of row indices corresponding to the “clean” rows in exo.mat
, and define exo.mat.clean
accordingly. Now, how many rows are left in exo.mat.clean
?
3a. Compute histograms of each of the variables in exo.mat.clean
. Set the titles and label the x-axes appropriately (indicating the variable being considered). What do you notice about these distributions?
3b. Apply a log transformation to the variables in exo.mat.clean
, saving the resulting matrix as exo.mat.clean.log
. Name the columns of exo.mat.clean.log
to be “pl_orbper_log”, “st_mass_log”, and “pl_orbsmax_log”, respectively, to remind yourself that these variables are log transformed. Recompute histograms as in the last question. Now what do you notice about these distributions?
3c. Plot the relationships between pairs of variables in exo.mat.clean.log
with pairs()
. What do you notice?
For our exoplanet data set, the orbital period \(T\) is found in the variable pl_orbper
, and the mass of the host star \(M\) in the variable st_mass
, and the semi-major axis \(a\) in the variable pl_orbsmax
. Kepler’s third law states that (when the mass \(M\) of the host star is much greater than the mass \(m\) of the planet), the orbital period \(T\) satisfies:
\[ T^2 \approx \frac{4\pi^2}{GM}a^3. \]
Above, \(G\) is Newton’s constant. (You don’t have to do anthing yet, this was just by way of background.)
4a. We are going to consider only the observations in exo.mat.clean.log
for which the mass of the host star is between 0.9 and 1.1 (on the log scale, between \(\log(0.9) \approx -0.105\) and \(\log(1.1) \approx 0.095\)), inclusive. Define exo.reg.data
to be the corresponding matrix. Check that it has 439 rows. It will help for what follows to convert exo.reg.data
to be a data frame, so do that as well, and check that it still has the right number of rows.
4b. Perform a linear regression of a response variable \(\log(T)\) onto predictor variables \(\log(M)\) and \(\log(a)\), using only planets for which the host star mass is between 0.9 and 1.1, i.e., the data in exo.reg.data
. Save the result as exo.lm
, and save the coefficients as exo.coef
. What values do you get for the coefficients of the predictors \(\log(M)\) and \(\log(a)\)? Does this match what you would expect, given Kepler’s third law (displayed above)?
4c. Call summary()
on your regression object exo.lm
and display the results. Do the estimated coefficients appear significantly different from zero, based on their p-values? Challenge: use the summary()
object to answer the following question. Are the theoretical values for the coefficients of \(\log(M)\) and \(\log(a)\) from Kepler’s third law within 2 standard errors of the estimated ones?
Challenge. What value do you get for the intercept in your regression model, and does this match what you would expect, given Kepler’s third law?
4d. Using fitted()
, retrieve the fitted values from your regression object exo.lm
. Mathematically, this is giving you: \[
\beta_0 + \beta_1 \log(M) + \beta_2 \log(a)
\] for all the values of log mass \(\log(M)\) and log semi-major axis \(\log(a)\) in your data set, where \(\beta_0,\beta_1,\beta_3\) are the estimated regression coefficients. Thus you can also compute these fitted values from the coefficients stored in exo.coef
. Show that the two approaches give the same results.
4e. Using residuals()
, retrieve the residuals from your regression object exo.lm
, Similarly. Mathematically, this is giving you: \[
\log(T) - (\beta_0 + \beta_1 \log(M) + \beta_2 \log(a))
\] for all the values of log orbital period \(\log(T)\), log mass \(\log(M)\), and log semi-major axis \(\log(a)\) in your data set, where again \(\beta_0,\beta_1,\beta_3\) are the estimated regression coefficients. Thus you can also compute these residuals from the coefficients in exo.coef
. Show that the two approaches give the same results.
4f. Compute the mean of the residuals as computed in the last question (found using either approach). Is it close to 0? Challenge: this is not a coincidence, but a fundamental property of linear regression. Can you explain why this is happening? A clue: rerun the linear regression model, but without an intercept this time; what do you find with the mean of the residuals now?
Challenge. Compute the average values of log mass \(\log(M)\) and log semi-major axis \(\log(a)\) over your data set, saving these as st_mass_log_ave
and pl_orbsmax_log
, respectively. Using predict()
, predict from your linear model the log orbital period when the log mass is st_mass_log_ave
and the log semi-major axis is pl_orbsmax_log
. Mathematically, this is: \[
\beta_0 + \beta_1 \mathrm{avg}(\log(M)) + \beta_2 \mathrm{avg}(\log(a))
\] where \(\mathrm{avg}\log(M)\) is the average log mass and \(\mathrm{avg}(\log(a))\) is the average log semi-major axis, and again \(\beta_0,\beta_1,\beta_3\) are the estimated regression coefficients. Thus you can also compute this predicted value from the coefficients in exo.coef
. Show that the two approaches give the same results. Finally, compare the predicted value (from either approach) to the average log orbital period \(\log(T)\). Are they the same? This is not a coincidence, but again a fundamental property of linear regression. Can you explain why this is happening?