Cosma Shalizi > 36-757 (Fall 2010)
Some Advice on Process for ADA, or, Riding the Big Hairy Research Project
ADA is supposed to take a large part of your thought and energy for the next
year or so. It is (most likely) bigger and complicated than anything you've
done before. Staying on top of a piece of work like this is harder than more
limited assignments. Many graduate programs teach this only through attrition,
i.e., people who don't figure it out on their own leave the program. Here,
however, are some pointers, based on what's worked for me and my friends. Your
mileage will vary.
(Some of this will carry over to your dissertation; some of it is even more
general advice for scholarly practice.)
Have a plan
Right now, what you should do for ADA is probably quite vague; the only
clear thing might be that it will be a lot of work. This makes it hard to
know what to do. Clarifying what you should be doing, and developing a
plan, can help lift vagueness-induced paralysis. Don't worry about getting
the plan exactly right at first. ("Truth comes more readily out of error
than out of confusion".)
- Figure out what your data is (you've all already done this)
- Figure out what your investigator's scientific problem is (they've told
you this in what they think is adequate detail)
- Translate the scientific problem into a statistical problem — i.e.,
what analysis you can do to the data which would solve the real-world
problem.
- Figure out what a solution to the statistical problem would look like,
how you'd know that you've solved it.
- Identify the major components of the solution, say 2--4 big pieces which,
fitted together, would solve your problem. These will probably themselves
be kind of vague, so —
- Recurse: take those components as statistical problems themselves and
break them down into simpler sub-problems. Keep going until you have
turned everything into sub-sub-(sub-)problems which are simple enough
that you have an idea of how to attack each one of them. It can help
to draw this out as a tree, of problems branching into sub-problems.
(If you can't figure out how to break up a part of the problem, don't
worry about that right now; just get at least one part of the tree
down in enough detail that you can begin.)
You have now reduced your problem to a collection of sub-problems
which are small enough to attack; list them and get started.
Write as you go
Remember that ADA culminates in producing a paper which should be good
enough to submit to a journal. Typically, people produce a report in
a last-minute burst of writing, but this usually leads to both stress and
poor writing. It is better to start now and keep working on the writing
as you go. You will produce a better piece of work with less stress, and
your ideas will be clearer.
You have likely had the experience of going to a teacher or class-mate
for help with a problem, and found that as you explained what you needed
help with, you realized what you needed to do. Writing as you go means
explaining what you are doing to yourself, which has many of the same
benefits.
- Write a draft right now
- You've just worked out a detailed description of what your problem is,
what a solution would look like, and your plan for getting from the
problem to the solution. Turn this into a draft report, with
paragraphs and complete sentences and so forth. Leave place-holders for
the stuff you've still got to do.
- If, in writing your draft, you realize that your plan is not as clear
as you thought it was, go back and revise the plan.
- Update your draft regularly
- Writing a draft now and not touching it until April is still better than
not starting writing until April. But better still is the slow steady
application of time to the writing. You might try setting aside, say, 30
minutes once a week to revise your draft — incorporate what you have learned
and done that week, revise the prose, make or improve figures, etc. Even if
you feel like you haven't done enough new work that week to spend 30 minutes
writing it up, you can certainly spend the time improving your write-up.
- Integrating writing and coding
- Lots of what will be in your report (figures, tables, etc.) will be the
result of computational analysis. While documenting what you did to get your
results is a good practice anyway (see below), it can be especially useful to
embed your analysis code into your write-up, so that the associations are
clear. The easy way, which is what I do, is to simply paste the relevant parts
of your R (or whatever) session/code into the LaTeX file of your document, and
then comment out the
code. Sweave is
a more elaborate system which will actually re-run your R code, re-doing the
analysis live. This provides a stronger guarantee that you are writing about
what you actually did, but it is more work and more run-time.
- (Most of you are probably better programmers than I am, but on the
off-chance that you're not, you might
find this helpful.)
Track your work in writing
Your memory is small, unreliable and fleeting, the project is large and
complicated, distractions are numerous and contextual cues are weak.
You will come back to things after a month, sometimes a day, and not
remember what you did, or how, or even perhaps why. Fortunately,
the wonderful technology of "writing" will remember for you.
- Write down everything you do
- As much as possible, try to write down everything you did for the project;
how exactly you did it; and save the output. I recommend buying a
physical notebook, which you can keep on your person, and use for
tracking work by hand, as well as maintaining an electronic copy of
what you are doing — either by updating your report, or by keeping
some sort of project log file.
- The standard isn't quite "Notes or it didn't happen", but
close.
- Write down what you want to do
- Lots of ideas will occur to you for things which might be useful or
interesting to do in connection with the project. Write them down,
again either in the notebook or in a special file of idea for the
project; revisit those periodically to see if any are worth
pursuing. Writing them down means that you don't have to devote
attention to remembering them!
- Consider a version-control system
- There are lots of free software systems for tracking revisions to
multiple files associated with projects (CVS, SVN/Subversion, the
unfortunately named Git, etc., etc.). These are mostly designed
for software, and have features to make it easier to people to work
on small parts of big projects without getting in each others' way,
but they also make it easy to keep track of revisions to your
documents (and code!), compare changes, and roll back some or all
of your work if turns out to have been a blind alley. I use Git,
but it's probably not very important which one you use.
Revise your plan
There is, or should be, a feedback between your plan and your work. Your
goals constrain what you do, but as you do your work, you learn more about
the problem, and this knowledge tells you about what you can and should do.
In other words, you ought to revise your goals in light of your actions.
Many people have trouble with the idea of deviating from the initial plan,
or wonder what good a revisable plan could be. But the plan was something
you made up to help yourself, a guess about how to guide yourself to a
place where you could better see which way you should go, not to a fixed
path you must follow without swerving or meet your doom.
Set aside time periodically to revisit your plan, see what has been
accomplished, what no longer makes sense, and what needs to be changed.
I suggest doing this less often than updating your draft, say every
two weeks or once a month (perhaps around the weeks when you'll present
in class).
Reading
- Get a starting point
- Get initial references from the faculty members your working with, both
on the scientific problem and on the statistical aspects. Ask
them for reviews, introductions, background, etc., as well as any
specific contributions you are building on, paralleling, debunking, etc.
Try to read broadly and soak in information. You are supposed to be
a collaborator in a process of scientific investigation, not a human
interface to R, so you need to really understand your problem domain
and what is already known about it.
- Work references backwards
- As you read, note what's cited in connection
with topics which puzzle or interest you. Write down the references,
track them down, and start reading the ones which look promising. Also,
ask your advisers about what to read on those topics. Recurse.
- Work references forward
- Use the citation databases to see what's been done which builds on stuff
you've read and found interesting; scan it to see which parts of it might
be useful to you. — Now would be a good time to check
out Mathematical Reviews, if you haven't already. It
exclusively publishes short summaries and critiques of papers in other
journals and books, explaining their contribution and their connection to
the rest of the literature, and providing, on
the website, really
outstanding linkage. It's most useful for work which is of mathematical
interest (as the name suggests), so better for say theoretical statistics
than really applied papers, but definitely worth exploring.
- Expect to have to keep reading
- You should identify the journals which
publish relevant work — they're the ones you're seeing the most
citations to — and start getting their tables of contents in electronic
form, if you're not already. (Basically all journals now offer alerting
e-mails, and most have RSS feeds, if you use those.) Likewise, sign up
for alerts from arxiv.org, the preprint
server. (Actually, you should plan to post your papers, once they are
written, to arxiv.org.) Spend a little time keeping up with
these. (It's easy to spend too much, which brings me to the the next
point.)
- Filter ruthlessly
- Almost all academic work you run across will be
completely irrelevant (and/or hopelessly bad). You need to develop skill
in triage: stopping with the title (and/or authors); stopping with the
abstract; stopping with the introduction; stopping with the conclusion;
stopping with a skim over the paper. Only a tiny fraction of papers will
be worth reading in depth.
- Track references
- Develop a system for keeping track of bibliographic information,
both for things you have read and might use, and for things which look
like they might be interesting one day.
- The very old-school approach is hand-written (or typed) 3x5 notecards.
There are probably people who still use this.
- The pretty old-school approach is subject-matter files on your
computer with the bibliographic information. I do this, because
I started in 1994, I'm very practiced with it, and it'd be too much
work to change.
- There are now many bibliographic database systems to help you keep
track of citation data, and in many cases to store PDFs of the
papers. Zotero in particular has many fans among people I trust;
I haven't used it.
- Alternatively, there are plenty of on-line bookmarking services.
I
use delicious; citeulike
however is especially adapted to citation management. The nice thing
about them, in addition to portability and the possibility of sharing
with co-workers, is that they let you use tagging as a very flexible
categorization scheme — I have stuff tagged as, say, relevant to
this project, that class, and such-and-such subject-matter topics all
at once.
Reading More
Kieran
Healy's Choosing
Your Workflow Applications, but requires two caveats: some of the advice is
targeted as social scientists rather than statisticians, and under no
circumstances should you ever write a paper in Word. (Really, nothing should
ever be written in Word, but the larger battle was lost long
ago, alas.)
I also strongly, strongly, strongly recommend reading Herbert
Simon's The
Sciences of the Artificial; the connection should be clear by the
time you're done.