We continue examining the diffusion of tetracycline among doctors in Illinois in the early 1950s, building on our work in labs 7 and 8. You will need the data sets ckm_nodes.csv
and ckm_network.dat
from the labs.
Clean the data to eliminate doctors for whom we have no adoption-date information, as in the labs. Only use this cleaned data in the rest of the assignment.
Create a new data frame which records, for every doctor, for every month, whether that doctor began prescribing tetracycline that month, whether they had adopted tetracycline before that month, the number of their contacts who began prescribing strictly before that month, and the number of their contacts who began prescribing in that month or earlier. Explain why the dataframe should have 6 columns, and 2125 rows. Try not to use any loops.
For quibblers, pedants, and idle hands itching for work to do: The \(p_k\) values from problem 3 aren’t all equally precise, because they come from different numbers of observations. Also, if each doctor with \(k\) adoptee contacts is independently deciding whether or not to adopt with probability \(p_k\), then the variance in the number of adoptees will depend on \(p_k\). Say that the actual proportion who decide to adopt is \(\hat{p}_k\). A little probability (exercise!) shows that in this situation, \(\mathbb{E}[\hat{p}_k] = p_k\), but that \(\mathrm{Var}[\hat{p}_k] = p_k(1-p_k)/n_k\), where \(n_k\) is the number of doctors in that situation. (We estimate probabilities more precisely when they’re really extreme [close to 0 or 1], and/or we have lots of observations.) We can estimate that variance as \(\hat{V}_k = \hat{p}_k(1-\hat{p}_k)/n_k\). Find the \(\hat{V}_k\), and then re-do the estimation in (4a) and (4b) where the squared error for \(p_k\) is divided by \(\hat{V}_k\). How much do the parameter estimates change? How much do the plotted curves in (4c) change?