The first step is to find out what request, when sent across the Internet, returns the data you want for some sample input. This request takes the form of a URL (uniform resource locator), often including a special suffix following a question mark.
As you might expect, you can enter a new URL directly (not using the Name box) as, e.g., http://webbook.nist.gov/cgi/cbook.cgi?Name=ethane&Units=SI to pull up ethane's information. We can alter the search, e.g by checking the "calorie-based" units and checking both "Ion energetics" and "Mass spectrum". By noting the URL of http://webbook.nist.gov/cgi/cbook.cgi?Name=methane&Units=CAL&cMS=on&cIE=on it is easy to see how to modify the details of the request.
Before proceeding, it is a good idea to check what happens when you enter faulty input data. Try entering a bogus chemical name, and you will see that the result is a web page that starts "Name not Found". If we want to build a robust data culling system, we should include error checking that looks for "Name not Found" as the error result.
You may now continue on to look at more complex examples of step 1, or you may want to continue right on to chemicals example, step 2.
But be careful! If you enter http://www.mlb.com/NASApp/mlb/mlb/stats_historical/mlb_player_locator_results.jsp?playerLocator=ruth you get a list of players with last names starting with "ruth", and then you must click on the one you want. If you click on "Babe Ruth", you get his stats under the URL http://www.mlb.com/NASApp/mlb/mlb/stats_historical/mlb_individual_stats_player.jsp?playerID=121578. To see what is going on, choose View/PageSource from your browser window while looking at the player list form. You will find a line in that file that says <td class="a9"> <a href="mlb_individual_stats_player.jsp?playerID=121578" class="a9primary">Babe Ruth</a></td>. The information needed to find Babe Ruth's stats is here, but this turns out to be a rather complicated example where we need to parse the result, see if it is of the stats form or of the player list form, then, if it is the player list form, find the matching player and reconstruct the appropriate URL.
Jump to baseball example, step 2
This case is more complicated (technically because the web page use the "post" method instead of the "get" method). You could learn all the details of forms processing and figure out how to simulate a call from this page, but here is a much easier method: Download the web page that makes the request using File/SaveAs. Edit the file by finding the method="post" line, which is part of the html "form" command. (Don't worry if you don't understand all of the html codes in the file.) Write down the action= string (which is /cgi-bin/zip4/zip4inq2 for this example), and change it to point to the full URL for the "readpost" program that you set up in your cgi directory. For non-CMU statistics people, here is readpost.c that you can use to create readpost. For CMU stat people see me for the location of an already set up version.
Now that you have your own version of the lookup form that posts the request to the special "readpost" web page, load this new version, e.g. by putting it into your public_html directory and entering its URL, or by using the appropriate "file:///" URL. Next, enter some sample information, submit the form, and you will see a string showing what request you sent. For this example, the string is Firm=Carnegie+Mellon+University&Urbanization=&Delivery+Address=5000+Forebes+Ave&City=Pittsburgh&State=PA&Zip+Code=&Submit=Process.
Now you can practice submitting your request directly, rather than through the form on the census web site. The command: echo "Firm=Carnegie+Mellon+University&Urbanization=&Delivery+Address=5000+Forebes+Ave&City=Pittsburgh&State=PA&Zip+Code=&Submit=Process" | lynx -dump -post_data "http://www.usps.gov/cgi-bin/zip4/zip4inq2" will show the web page with the CMU zip+4 result (15213-3890) on your terminal. (Or you can append "> jnk.html" to save it to a file.) Note that the URL in this unix command comes from combining the man web site of the zip+4 input form with the "action" part of the html "form" command that does the "post".
If you find it simpler, you can store the request string in a file, e.g. called "reqstr", and then use the command lynx -dump -post_data "http://www.usps.gov/cgi-bin/zip4/zip4inq2" < reqstr to get your web page result.
After playing around, I find that the only simplification allowed is to drop "&Submit=Process". Also note that within a field, we will need to replace spaces with plus signs.
Jump to zip code example, step 2
Send comments/suggestions/corrections to: