Name:
Andrew ID:
Collaborated with:

This lab is to be completed in class. You can collaborate with your classmates, but you must identify their names above, and you must submit your own lab as an Rmd file on Blackboard, by 11:59pm on the day of the lab.

There are Homework 9 questions dispersed throughout. These must be written up in a separate Rmd document, together with all Homework 9 questions from other labs. Your homework writeup must start as this one: by listing your name, Andrew ID, and who you collaborated with. You must submit your own homework as a knit HTML file on Blackboard, by 11:59pm on Tuesday November 8. This document contains 15 of the 45 total points for Homework 9.

Reading in, exploring exoplanets

There are now over 1,000 confirmed planets outside of our solar system. They have been discovered through a variety of methods, with each method providing access to different information about the planet. Many were discovered by NASA’s Kepler space telescope, which observes the “transit” of a planet in front of its host star. In these problems you will use data from the NASA Exoplanet Archive to investigate some of the properties of these exoplanets. (You don’t have to do anything yet, this was just by way of background.)
A data table of dimension 1892 x 10 on exoplanets (planets outside of our solar system, i.e., which orbit anothet star, other than our sun) is up at http://www.stat.cmu.edu/~ryantibs/statcomp-F16/data/exoplanets.csv. Load this data table into your R session with read.csv() and save the resulting data frame as exo.df. (Hint: the first 13 lines of the linked filed just explain the nature of the data, so you can use an appropriate setting for the skip argument in read.csv().) Check that exo.df has the right dimensions, and display its first 5 rows.
The column pl_discmethod of exo.df documents the method by which the planet was discovered. How many planets were discovered by the “Transit” method? How many were discovered by the “Radial Velocity” method? How method different methods of discovery are there total, and what is the most common? Least common?
The last 6 columns of the exo.df data frame, pl_pnum,pl_orbper, pl_orbsmax, pl_massj, pl_msinij, and st_mass, are all numeric variables. Define a matrix exo.mat that contains these variables as columns. Display the first 5 rows of exo.mat.
As we can see, at least one of the columns, pl_massj, has many NA values in it. How many missing values are present in each of the 6 columns of exo.mat? (Hint: you can do this easily with vectorization, you shouldn’t use any loops.)
Define ind.clean to be the vector of row indices corresponding to the “clean” rows in exo.mat, i.e., rows for which there are no NAs among the 6 variables. (Hint: again, you can do this easily with vectorization, you shouldn’t use any loops.) Use ind.clean to define exo.mat.clean, a new matrix containing the corresponding clean rows of exo.mat. How many rows are left in exo.mat.clean?
Yikes! You should have seen that, because there are so many missing values (NA values) in exo.mat, we only have 7 rows with complete observations! This is far too little data. Because of this, we’re going to restrict our attention to the variables pl_orbper, st_mass, and pl_orbsmax. Redefine exo.mat to contain just these 3 variables as columns. Then repeat the previous part, i.e., define ind.clean to be the vector of row indices corresponding to the “clean” rows in exo.mat, and define exo.mat.clean accordingly. Now, how many rows are left in exo.mat.clean?
Compute histograms of each of the variables in exo.mat.clean. Set the titles and label the x-axes appropriately (indicating the variable being considered.) What do you notice about these distributions?
Apply a log transformation to the variables in exo.mat.clean, saving the resulting matrix as exo.mat.clean.log. Name the columns of exo.mat.clean.log to be “pl_orbper_log”, “st_mass_log”, and “pl_orbsmax_log”, respectively, to remind yourself that these variables are log transformed. Recompute histograms as in the last question. Now what do you notice about these distributions?
Display all correlations between pairs of variables in exo.mat.clean.log. Report, programmatically (not just by eye), the most correlated pair, and the value of this correlation.
Plot the relationships between pairs of variables in exo.mat.clean.log with pairs(). What do you notice?

Linear regression modeling

For our exoplanet data set, the orbital period \(T\) is found in the variable pl_orbper, and the mass of the host star \(M\) in the variable st_mass, and the semi-major axis \(a\) in the variable pl_orbsmax. Kepler’s third law states that (when the mass \(M\) of the host star is much greater than the mass \(m\) of the planet), the orbital period \(T\) satisfies:

\[T^2 \approx \frac{4\pi^2}{GM}a^3.\]

Above, \(G\) is Newton’s constant. (You don’t have to do anthing yet, this was just by way of background.)
We are going to consider only the observations in exo.mat.clean.log for which the mass of the host star is between 0.9 and 1.1 (on the log scale, between \(\log(0.9) \approx -0.105\) and \(\log(1.1) \approx 0.095\)), inclusive. Define exo.reg.data to be the corresponding matrix. Check that it has 439 rows. It will help for what follows to convert exo.reg.data to be a data frame, so do that as well, and check that it still has the right number of rows.
Perform a linear regression of a response variable \(\log(T)\) onto predictor variables \(\log(M)\) and \(\log(a)\), using only planets for which the host star mass is between 0.9 and 1.1, i.e., the data in exo.reg.data. (Hint: use lm(), and study the last slide of the mini-lecture “Exploratory Data Analysis”, or skip ahead to the mini-lecture “Linear Models”, if you need to.) What values do you get for the coefficients of the predictors \(\log(M)\) and \(\log(a)\)? (Hint: use coef().) Does this match what you would expect, given Kepler’s third law (displayed above)?

Hw9 Bonus. What value do you get for the intercept, and does this match what you would expect, given Kepler’s third law (displayed above)?

What are the standard errors associated with the coefficients of \(\log(M)\) and \(\log(a)\)? (Hint: use summary().)

Hw9 Q1 (5 points). Write a function called exo.reg(), that has just one input exo.reg.data.cur, performs a regression of the of the log orbital period, \(\log(T)\), onto the log host star mass and log semi-major axis, \(\log(M)\) and \(\log(a)\), using the data in exo.reg.data.cur, and returns a vector of the 3 regression coefficients (intercept, coefficient for \(\log(M)\), and coefficient for \(\log(a)\)). Verify, that when run with exo.reg.data.cur=exo.reg.data, which should be the default, it gives the same coefficients that you computed previously.

Hw9 Q2 (10 points). Compute jackknife estimates of the standard errors of the estimated regression coefficients of \(\log(M)\) and \(\log(a)\), from the exo.reg.data data set. Your code here should use either a for() loop, or one of the apply() functions, and should involve repeatedly calling the function exo.reg() defined in the last question. (Hint: you may wish to revisit Lab 7w as a refresh on jackknife estimates of standard errors.) How do your estimates here compare to this read off of the summary of the linear model object, reported in the lab above?

Lab 10m: Exploratory Data Analysis

Statistical Computing, 36-350

Monday October 31, 2016

Reading in, exploring exoplanets

Linear regression modeling