Return to Howard's home page


Automated Culling of Data from the Internet

The Oxford English Dictionary includes this definition of cull: To gather, pick, pluck (flowers, fruit, etc.). Here I describe how to pick the fruits found in databases on the Internet to collect data for statistical analysis.

The basic concept is that we can use a variety of techniques to automate any manual process you might use to obtain information from the Internet. The specific case of interest is one where you fill in information onto a form on a web page and receive the information of interest on a new web page. This tutorial assumes that you are using a Unix (or Linux) operating system. It may be possible to perform the same steps on a Windows/DOS system, but I probably can't help you with the details.

Special thanks to Paula Pfleiger for introducing me to this whole concept.

The basic steps are

  1. Step 1   Decode how and where the information that you manually request is passed.
  2. Step 2   Create list(s) of the various inputs you would have had to type manually to get all the data you want.
  3. Step 3   Write scripts to automate the data collection procedure.
  4. Step 4   Extract the data from the web pages retrieved.

Some examples include collecting prices on stocks from a financial data base, collecting water source information from the EPA, compiling a price list from a list of ISBN book numbers, finding latitude and longitude from a list of zip codes, etc. The only limitation is your imagination and ingenuity (plus occasionally limits set by database owners on the number of queries, e.g. at www.555-1212.com).


Comments: hseltman@stat.cmu.edu

Return to Howard's home page