Return to main culling page


Automated Culling of Data from the Internet

Step 1: Decoding your manual request

The examples are in order of increasing complexity.
Chemicals example     Baseball example     Zip code example

The first step is to find out what request, when sent across the Internet, returns the data you want for some sample input. This request takes the form of a URL (uniform resource locator), often including a special suffix following a question mark.

Example 1: Chemicals

If you look through the NIST web site at www.nist.gov you will find Search for Species Data by Chemical Name (opens in new window). Here there is a box labelled "Enter a chemical species name". If you enter "methane" and press "Search", you will see a new page with all of the chemical data on methane. Now, find where your browser displays the URL (you may need to turn this on in the View menu). The URL is http://webbook.nist.gov/cgi/cbook.cgi?Name=methane&Units=SI . This is the easy type of database to access, because you can see the input information ("methane") in the URL.

As you might expect, you can enter a new URL directly (not using the Name box) as, e.g., http://webbook.nist.gov/cgi/cbook.cgi?Name=ethane&Units=SI to pull up ethane's information. We can alter the search, e.g by checking the "calorie-based" units and checking both "Ion energetics" and "Mass spectrum". By noting the URL of http://webbook.nist.gov/cgi/cbook.cgi?Name=methane&Units=CAL&cMS=on&cIE=on it is easy to see how to modify the details of the request.

Before proceeding, it is a good idea to check what happens when you enter faulty input data. Try entering a bogus chemical name, and you will see that the result is a web page that starts "Name not Found". If we want to build a robust data culling system, we should include error checking that looks for "Name not Found" as the error result.

You may now continue on to look at more complex examples of step 1, or you may want to continue right on to chemicals example, step 2.


Example 2: Baseball

If you look through the major league baseball site at www.mbl.com , you will come to Historical Stats (opens in new window). Here there is a box labelled "PLAYER LOCATOR". If you enter "Maris" and press "GO", you will see a new page with Roger Maris's statistics. Find where your browser displays the URL (you may need to turn this on in the View menu). The URL is http://www.mlb.com/NASApp/mlb/mlb/stats_historical/mlb_player_locator_results.jsp?playerLocator=Maris . This is the easy type of data base to access, because you can see the input information ("Maris") in the URL. As you might expect, you can enter a new URL directly (not using the Player Locator box) as, e.g., http://www.mlb.com/NASApp/mlb/mlb/stats_historical/mlb_player_locator_results.jsp?playerLocator=ruthven to pull up Dick Ruthven's stats.

But be careful! If you enter http://www.mlb.com/NASApp/mlb/mlb/stats_historical/mlb_player_locator_results.jsp?playerLocator=ruth you get a list of players with last names starting with "ruth", and then you must click on the one you want. If you click on "Babe Ruth", you get his stats under the URL http://www.mlb.com/NASApp/mlb/mlb/stats_historical/mlb_individual_stats_player.jsp?playerID=121578. To see what is going on, choose View/PageSource from your browser window while looking at the player list form. You will find a line in that file that says <td class="a9"> <a href="mlb_individual_stats_player.jsp?playerID=121578" class="a9primary">Babe Ruth</a></td>. The information needed to find Babe Ruth's stats is here, but this turns out to be a rather complicated example where we need to parse the result, see if it is of the stats form or of the player list form, then, if it is the player list form, find the matching player and reconstruct the appropriate URL.

Jump to baseball example, step 2


Example 3: Zip codes

http://www.usps.gov/ncsc/lookups/lookup_zip+4.html (which opens in a new window) is the zip code look up page for the US Postal Service. To look up the zip code for Carniegie Mellon University enter "Carnegie Mellon University" for the Firm, "5000 Forbes Avenue" for the Delivery Address, "Pittsburgh" for the City, and "PA" for the state. When you click "Process", a new web page with the zip+4 code appears. It's URL is "http://www.usps.gov/cgi-bin/zip4/zip4inq2", and since this does not include the input data, it could not possibly be the request we need to send to the Internet.

This case is more complicated (technically because the web page use the "post" method instead of the "get" method). You could learn all the details of forms processing and figure out how to simulate a call from this page, but here is a much easier method: Download the web page that makes the request using File/SaveAs. Edit the file by finding the method="post" line, which is part of the html "form" command. (Don't worry if you don't understand all of the html codes in the file.) Write down the action= string (which is /cgi-bin/zip4/zip4inq2 for this example), and change it to point to the full URL for the "readpost" program that you set up in your cgi directory. For non-CMU statistics people, here is readpost.c that you can use to create readpost. For CMU stat people see me for the location of an already set up version.

Now that you have your own version of the lookup form that posts the request to the special "readpost" web page, load this new version, e.g. by putting it into your public_html directory and entering its URL, or by using the appropriate "file:///" URL. Next, enter some sample information, submit the form, and you will see a string showing what request you sent. For this example, the string is Firm=Carnegie+Mellon+University&Urbanization=&Delivery+Address=5000+Forebes+Ave&City=Pittsburgh&State=PA&Zip+Code=&Submit=Process.

Now you can practice submitting your request directly, rather than through the form on the census web site. The command: echo "Firm=Carnegie+Mellon+University&Urbanization=&Delivery+Address=5000+Forebes+Ave&City=Pittsburgh&State=PA&Zip+Code=&Submit=Process" | lynx -dump -post_data "http://www.usps.gov/cgi-bin/zip4/zip4inq2" will show the web page with the CMU zip+4 result (15213-3890) on your terminal. (Or you can append "> jnk.html" to save it to a file.) Note that the URL in this unix command comes from combining the man web site of the zip+4 input form with the "action" part of the html "form" command that does the "post".

If you find it simpler, you can store the request string in a file, e.g. called "reqstr", and then use the command lynx -dump -post_data "http://www.usps.gov/cgi-bin/zip4/zip4inq2" < reqstr to get your web page result.

After playing around, I find that the only simplification allowed is to drop "&Submit=Process". Also note that within a field, we will need to replace spaces with plus signs.

Jump to zip code example, step 2


Continue on to step 2

Return to main culling page

Send comments/suggestions/corrections to: