General Comments:

Overall the section does a decent job of explaining the steps for the
SPEW algorithm.  I get the impression that the code is well designed,
and made with an eye towards the future and easy expansion and reuse.
The biggest problem throughout this section is that you don't take
credit for all of the work you have done!  There were a few things
that confused me, which I describe below, but most of my suggestions
are to take more credit for your work.


Specific Comments:

Paragraph 1:
I think you can combine the first two sentences:

"In the previous section we showed that we require population counts,
geographies, and microdata to generate synthetic ecosystems."


Paragraph 2:
In general in this paragraph it feels like you say SPEW does
something, and then say that it happens to solve a problem.  I think
it would read better and be more convincing if instead you framed it
in terms of:  We designed SPEW to do something so that it solves this
problem.

I think you could begin by just saying what exactly the purpose of
SPEW is:

"We created the R Pacakge SPEW to provide a general engine for
generation of synthetic ecosystems from the three data sources
described above."

I think begining with what you did adds emphasis to it, and calls your
accomplisment to the reader's attention.

Then you can list some of the achievements of SPEW:

1) We designed SPEW to read data in a standard format, and this format
helps us understand precisely what our data sources must look like to
generate a synthetic ecosystem.
2) We also designed SPEW to general enough to create synthetic
ecosystems for any set of data sources.
3) This standardization and generality makes it easier to obtain
reliable results and extend the functionality to new methods.

I think changing this framework allows you to take credit for the
great work that you have done!

Figure 1:  You never reference this figure


Paragraph 3: I wonder if you could remove this paragraph and just
mention where the code is available when you introduce the package at
the beginning of paragraph 2.  All of the things you say about the
benefits of having the code available online are definitely true, but
I like the way Brian said that it's better if you let the reader
realize how awesome it is without having to tell them.

Top of Page 6:

spew should probably be SPEW for consistency purposes.

I think a set of mutually exclusive sets that join to create the full
set is a partition.  I wonder if you could define this here, and use
it to make talking about your partition into separate regions easier.

Page 6, paragraph 2:

You talk a lot about PUMS and PUMA, and the similarity of these two
acronyms really confused me.  Maybe it's impossible to not use both
acronyms, but maybe you can try to minimize going back and forth by
rearranging sentences so that they are all about PUMS, then all about
PUMA in this paragraph.

I was a bit confused about how the PUMA and the tracts interacted I
think that the idea is that there is a region ID, and this is the
PUMA, and within each PUMA there are tract numbers, and the tract
numbers are not unique, but the combination of PUMA and tract number
is. If there is this hierarchy of labeling, I think pointing that out
would be useful.


Pseudocode:

for loop line 3: For the pseudocode you say you attach people to
households. Are the people also simulated at some point?

for loop line 4: I wonder if you could name the other data.  Something
like: "Add supplementary variables (e.g.: schools, workplaces,
etc...)"  Then in the output you can say Synthetic Households, People,
and Supplementary Variables.

The ... at the end of the output makes me think that the output is not
fixed.  I understand that it will be, but it seemed strange to me when
I first saw it.

Last paragraph of page 6:  Again use SPEW instead of spew for
consistency

Since you haven't used IPF in this section yet, maybe you can expand
it out here.

My take away from this paragraph is that you are using simple random
sampling currently as a way to get a working version of the code
running, but ideally you could use better methods in the future.  I'm
not sure, but I'm guessing the code is designed so that changing the
random sampling step is relatively easy.  If this is the case, you
should say so!  It's another example of how modular your code is and I
think calling attention to that would be great.


Section 4.2:

Paragraph 2:  I don't know if you need to tell the reader the specfici
breakdown of the nodes/processor/cores breakdown.  I think just saying
you can have 1536 processes running in parallel is sufficient.

Paragraph 3:  While it is true that your code is "embarrasingly
parallel", I don't know that you need to tell the reader this!  Also,
you haven't actually defined embarrasingly parallel, so the reader may
only take away that the task was "embarrasingly" easy.  I don't
actually think this is the case.

Instead, I think "embarrasingly parallel" essentially means that the
sampling process for each tract/PUMA is independent.  Instead I think
you could rewrite this paragraph to say:
1)  the sampling processes are independent (maybe say intuitively why
this is the case as well)
2)  You can take advantage of this independence to run the sampling
processes in parallel.

While this task may be easier than other parallelization tasks, you
should still take credit for doing it!