Name:
Andrew ID:
Collaborated with:
On this homework, you can collaborate with your classmates, but you must identify their names above, and you must submit your own homework as an knitted HTML file on Canvas, by Sunday 10pm, this week.
all.equal
that you obtain the same data frame (Hint: if it’s not the same, look at the options in write.csv
).df = data.frame(x = 1:10, y = rnorm(10))
df
), read it back into R, and check using all.equal
that that you obtain the same data frame.df = data.frame(x = c("1", "2", "3"), stringsAsFactors = FALSE)
One useful application of reading and writing data is to save progress in long-running programs. We provide the run_simulation()
function below, which inputs a numeric variable initial
representing the first number in the sequence, and outputs the number of iterations needed to reach 1. Modify run_simulation()
to save the value of current
and iter
at the end of the body of the while()
loop. That is, you should be saving after every iteration of the loop. Here, current
is a numeric variable that represents the current number in the sequence being calculated and iter
is the number of iterations this function has computed (i.e., the current length of the sequence). Use save()
to write current
and iter
to a file associated with the value of initial
(Hint: Use the string representation of initial
as a suffix or prefix to the file path.) This file should be overwritten for each iteration (so that it only stores the most recent values of current
and iter
).
Check that run_simulation(837799) == 524
and display the contents of the associated file with load()
followed by printing the current
and iter
.
next_simulation = function(past) {
if (past %% 2 == 0) {
return (past / 2)
} else {
return (3 * past + 1)
}
}
run_simulation <- function(initial) {
current = initial
iter = 0
while (current != 1) {
iter = iter + 1
current = next_simulation(current)
}
return(iter)
}
Challenge. This simulation is based on the Collatz sequence. The associated Collatz Conjecture proposes that every sequence defined by repeatedly evaluating next_simulation()
will always reach the value 1. If you can provide a full proof of the conjecture you can get an A+ in the course. (You might want to read the link if you are interested…) For normal challenge points, write a program to determine the longest sequence (defined as the iterations required to hit 1) starting with numbers below n
. Run your program for inputs n = 100
, n = 1000
, and n = 1000000
. (Hint: You probably want to read about “Memoization” if you try this.)
1d. After writing a function to save our results, we now want to write a function that resumes computing our results. Write a function resume_simulation(initial)
which loads the progress (if any) from the file associated with initial
and resumes the simulation from the loaded value of current
and iter
. (Note: this function should be pretty similar to run_simulation()
.) Then, modify the run_simulation()
function by adding a quit
argument, which should be a positive numeric. This should make run_simulation()
artificially quit the simulation early after quit
iterations. In this case, run_simulation()
should return NA
. Set quit=-1
by default, and write the function so quit=-1
means the function will never quit the simulation early. Check that run_simulation(9, quit=3)
followed by resume_simulation(9)
gives the same answer as run_simulation(9)
.
2a. The Social Security Administration, among other things, maintains a list of the most popular baby names. Load the file located at the URL http://www.stat.cmu.edu/~ryantibs/statcomp-S18/data/PA.txt
into R as a data frame pa.names
with variable names State
, Gender
, Year
, Name
and Count
. This is a fun dataset to browse: for instance you can see the name “Elvis” suddenly jumped in popularity in the mid 1950s. For those interested, we obtained this data from https://www.ssa.gov/oact/babynames/state/namesbystate.zip
. Print the first three rows of the data frame.
2b. The current data frame is ordered by year; create a data frame pa.names.by.count
that is ordered by decreasing count. Break ties with the alphabetical order of the names (“A” before “B”). (Hint: check the documentation of order()
to figure out how to break ties.) Print the first three and last three rows.
2c. Write a function to verify that pa.names.by.count
is correctly ordered (including tie breaking). (Hint: use the is.unsorted()
function). Your function should take in a data frame with at least two columns named Count
and Name
, and should verify that Count
is in decreasing order and that Name
is in alphabetical order for rows with the same value of Count
. Test that your function works correctly on two toy data frames of at most 6 rows where one data frame is correctly ordered (i.e., should return TRUE
when passed into your function) and the other is not (i.e, should return FALSE
). You will need to construct these toy data frames yourself. Then use your function to verify that pa.names.by.count
is correctly ordered.
3a. Load the file at the URL http://www.stat.cmu.edu/~ryantibs/statcomp-S18/data/NC.txt
as nc.names
using the same variable names as pa.names
. Count how many names nc.names
has in common with pa.names
. Similar to pa.names
, make sure the variables in nc.names
are called State
, Gender
, Year
, Name
and Count
. Print the first three rows of nc.names
.
3b. Merge the two files to create a dataframe manual.merge
which contains columns for counts in each state. The resulting data frame should have columns Name
, Gender
, Year
, PA Counts
, NC Counts
. If a name does not appear in one of the data frame, make the count in the merged data frame under the appropriate column equal to zero. Do not use the merge()
function. Print the first three and last three rows of the merged data frame. You do not need to write this as a function. (Hint: you might want to follow a similar strategy as what we did in the lab when we manually merged the winning male and female sprinters based on the Country and year.)
3c. Verify the 3b is correct by using merge()
to create merge.merged
. Check that merge.merged
is equivalent (up to ordering of the rows) to manual.merged
using all.equal
(with some reordering).