Automated Internet Culling: Step 3

Automated Culling of Data from the Internet: Writing automation scripts

Chemicals example Baseball example Zip code example

To automate your database requests, you need to create a script file that sends the appropriate requests, one at a time to the internet, and then prints the results or saves them to a file or files. The key to the whole approach of this tutorial is the use of the program "lynx" with the "-source" option. Lynx is essentially a minimal web browser that includes a single URL option. (One nice alternative is to use R instead of a scripting language and to use its url() function instead of Lynx. A complete example for getting census tracts from addresses is here.)

The first type of script used here is the "csh shell script". If you have some programming experience, but little or none with csh scripts, you can still probably follow and mimic what I do. You can get help with "man csh" on your unix system. One place to see the man page for csh is here . Additional scripting help may be found in links to umd . A good debugging hint is to add -v or -x or -xv to the end of the first line of your script to turn on command echoing, e.g. #!/bin/csh -xv.

For more complicated problems, I use perl scripts. Help on perl can be found at perl.com , and a good tutorial is Robert's . You can also get help with the unix command "perldoc". This is on-line at perldoc.perl.org.

If you can get it setup on your system, there is a great series of perl modules for WWW access and parsing including LWP, HTML::Parser, HTML::LinkExtor, and HTML::TokeParser available at www.cpan.org . With these, you make direct connections to the internet, rather than using "lynx". These are not set up on our system, so I don't discuss them here, but check out the book Data Munging with Perl .

Example 1: Chemicals

The csh script getcheminfo demonstrates the simplest possible approach to data culling. First download the script and use "chmod u+x getcheminfo" to make it executable. Be sure you also have the data input list, chemicals.input in the same directory. Then run "getcheminfo chemicals.input" to download all of the chemical information. The script is:

#!/bin/csh
# This script reads chemical names from the file in it's argument list,
# requests information on these chemicals from webbook.nist.gov using lynx,
# and saves the results to coresponding html files.

foreach chemical (`cat $1`)
  echo Looking up $chemical
  lynx -source "http://webbook.nist.gov/cgi/cbook.cgi?Name=$chemical&Units=SI" > $chemical.html
end

The first line specifies use of the csh shell. Since "cat $1" lists the contents of the first runstring argument, the "foreach" statement succesively assigns the lines in "chemicals.input", i.e. the chemical names, to the variable "chemical". First, the "echo" statement is used to print the chemical name to the terminal, so we can monitor the progress of the script. Then for each chemical, the appropriate URL is passed to the "lynx" web browser by substituting the chemical name into the "Name=" portion of the URL suffix. Finally the ">" redirects the output (the results web page) to the file whose name is constructed by concatenating ".html" onto the chemical name. You can view the files in your web browser using a URL similar to "file:///home/WEIRD/hseltman/mychemdir/methane.html".

This approach, "csh/foreach", works well for many problems. Because of limitations of the capablilities of the csh shell, we'll need to switch to "perl" for more complex problems.

Of course, in most cases we need to further extract information from the results web pages, and we probably don't want to store all the pages before doing that. This cleanup step is Step 4 . But for now a simple example of a cleanup step may be found in the getcheminfo2 script. It uses the unix command "grep" to remove only the line containing the word "Registry" from the result web page, and it uses ">>" to append the results from each chemical to a single output file.

#!/bin/csh
# This script reads chemical names from the file in it's argument list,
# requests information on these chemicals from webbook.nist.gov using lynx,
# and saves the results to the second name it it's argument.
# E.g. use getcheminfo2 infile outfile

set tmp = tmpOKtoDelete
echo Chemical CAS\# > $2
foreach chemical (`cat $1`)
  echo Looking up $chemical
  lynx -source "http://webbook.nist.gov/cgi/cbook.cgi?Name=$chemical&Units=SI" > $tmp
  set CAS = `cat $tmp | grep Registry`
  echo $chemical $CAS >> $2
end
rm $tmp

This script includes the use of the backwards single quotes around the "cat/grep" command combination. The backwards single quotes execute the unix command inside and return the output of the command, in this case putting it into the variable "CAS". If you run this script by typing "getcheminfo2 chemicals.input chemresult" at the unix prompt, you will see that we still would like more cleanup, but this is a big improvement over having to store and read individual html files.

Go on to step 4

Example 2: Baseball

To get files of all the statistics of our list of 1960 Pirates, we need to work harder than in the chemicals example. As mentioned in step 1 , for player with a unique last name (and whose last names are not a substring in another player's last name), a command like http://www.mlb.com/NASApp/mlb/mlb/stats_historical/mlb_player_locator_results.jsp?playerLocator=Maris gets us the results we want. But the big problem is that when there are several players with the same last name, we get a "player list" instead of results. When we click on Babe Ruth from the Ruth "player list" we get http://www.mlb.com/NASApp/mlb/mlb/stats_historical/mlb_individual_stats_player.jsp?playerID=121578. When we view the Page Source for the "player list" web page, we find a line in that file that says <td class="a9"> <a href="mlb_individual_stats_player.jsp?playerID=121578" class="a9primary">Babe Ruth</a></td>. Also note, that "Hitting Stats" is present when we get directly to a players page, but not for the "player list". So our task is, when we don't see "Hitting Stats", search the page for the line with the full name of the player, extract the "playerID", and then construct a URL for "lynx" of the form http://www.mlb.com/NASApp/mlb/mlb/stats_historical/mlb_individual_stats_player.jsp?playerID=121578.

Another nicety is to correctly handle a non-exisitent (or misspelled) player by looking for "No information on the player was found." or by noticing that our player is not on the "player list".

Here is getbaseball , a csh script that can get the appropriate baseball web pages:

#!/bin/csh
# Find baseball stats
# First argument is a player list file (First Last format), second is a year

if ( XX$1 == XX || XX$2 == XX ) then
  echo use getbaseball PlayerFile Year
  exit
endif

set mainURL = "http://www.mlb.com/NASApp/mlb/mlb/stats_historical/mlb_player_locator_results.jsp"
set altURL = "http://www.mlb.com/NASApp/mlb/mlb/stats_historical/mlb_individual_stats_player.jsp"
set tmp = tmpDeleteThis
set players = `cat $1`
set final = $#players
set index = 1

while ( $index < $final )
  set first = $players[$index]
  @ index++
  set last = $players[$index]
  @ index++

  lynx -source "$mainURL?playerLocator=$last" > $tmp

  # Use grep to return any lines with "No information"
  if ( XX`cat $tmp | grep "No information"` != XX ) then
    echo No info on $first $last

  # Use grep to return any lines with "Hitting Stats"
  else if ( XX`cat $tmp | grep "Hitting Stats"` != XX ) then
    # A unique match was found if there is a "Hitting Stats" section.
    # Save under players name
    mv $tmp $first$last.html

  # Last possibility is multiple players with this last name.
  else
    # Need to find player ID number that matches first and last names.
    # Use grep to find the line with the playerID on the line with the full name.
    set ID = `grep "$first $last" $tmp | sed -e 's/^.*ID=//' -e 's/" cl.*$//'`
    if ( XX$ID == XX ) then
      echo No info on $first $last
    else
      # Slightly different URL is needed with ID rather than last name.
      # Save results under players name
      lynx -source "$altURL?playerID=$ID" > $first$last.html
    endif
  endif
end

If you download getbaseball and players.dat , make "getbaseball" executable with "chmod u+x getbaseball", and run "getbaseball players.dat 1960" you will get html files for the 1960 seasons for each of the players in the file, e.g. HankAaron.html. One way to view these is to enter a URL like file:///home/WEIRD/hseltman/mybaseballdir/HankAaron.html in your browser. In step 4, I show how to clean up the files and make nice output.

There are a variety of fairly complex issues involved in getting this script to work. First, we can't use "foreach" because it trys to handle the first names separately from the last names. The solution is to load the "players.dat" file into the "players" variable using "set players = `cat $1`" and then to refer to the individual words in this variable using square brackets and an index. The special syntax "$#players" tells us how many words are in "players"; this is twice the number of players because we have listed first and last names for each. The while loop cycles through the players. Since we increment the index twice (with @ index++) per iteration, we loop once per player. The first "lynx" line constructs the full URL for the first attempt to find the player (by last name). Be sure to use the quotation marks as shown. The result, and entire web page, is stored in the file "tmpDeleteThis". (There are some better ways to name temporary files, but this isn't too bad.)

We use "grep" in the form of if ( XX`cat $tmp | grep "No information"` != XX ) to see if a particular string is in the html file . Here "!=" means "not equal", and this inequality will be true if and only if the "grep" result is non-blank, i.e. the file "tmpDeleteThis" does contain "No information". If so, we just let the user know that the player is not on file.

The "else if" clause uses "grep" to check for "Hitting Stats". If the string is found, we just use "mv" to rename the file tmpDeleteThis, which contains the whole results web page, to a file name made up of the player's first and last name followed by ".html".

The "else" clause handles neither "No information" nor "Hitting Stats". if our analysis of how the MLB site works is correct, this means that "tmpDeleteThis" contains a web page designed for selecting a player from a list of players. The "set ID =" command uses "sed" to extract the "playerID" from the html line containing the player's full name by substituting nothing for the text before and after the ID. If the ID is blank, the player was not on file, and we just report this conclusion. Otherwise we exectute a second "lynx" command in the alterate format to ask for the information on the player with the playerID we have found, and store this web page into an appropriately named file.

If you go on to Step 4 you will see what I think is an easier alternative to "sed" and "grep", and you will see two ways to get nice output rather than a bunch of html files.

Example 3: Zip codes

To get the zip+4 codes for the firms in our firms.dat file, we need to

read pairs of lines from firms.dat

replace spaces with plus signs

create a "post string" containing the firm's name, address, city and state

send the data to "lynx" in the post format

and process the result. I don't know how do read the pairs of lines using a csh script, so the code below is a perl script. Also, we don't seem to have access to the direct internet accces part of perl, so this code still uses "lynx" to access the web pages.

Here is the perl script getzip.pl :

#!/usr/bin/perl -w
use strict;

my $address_file = "firms.dat";
my $URL = "http://www.usps.gov/cgi-bin/zip4/zip4inq2";

my $PS1 = "\"Firm=";
my $PS2 = "\&Urbanization=\&Delivery+Address=";
my $PS3 = "\&City=Pittsburgh\&State=PA\&Zip+Code=\"";

open (INPUT, "<$address_file") || (die "Can't open address file");
 
# for each line in the address list 
while (my $firm = <INPUT>) {
    chomp $firm;
    my $xfirm = $firm;
    $xfirm =~ s/\s/\+/g;
    my $address=<INPUT>;
    chomp $address;
    $address =~ s/\s/\+/g;
    my $poststring = $PS1 . $xfirm . $PS2 . $address . $PS3;
    my @html_source = `echo $poststring | lynx -dump -post_data \"$URL\"`;
    print $firm, "\n", grep(/PITTSBURGH/,@html_source);
}

close(INPUT);

The three PS# variables are concatenated together with the current data in the "my poststring =" line to create the message that the zip+4 form would have sent to the URL if we were doing manual input. The "my $firm = <INPUT>" statement reads a line from the firms.dat file and stores it into the variable "$firm". (In perl, "scalar variables", as opposed to arrays or hashes, are always prefixed by the dollar sign.) The "chomp" removes a carriage return, if one exists at the end of a string. A similar pair of lines reads the address. The command echo $poststring | lynx -dump -post_data \"$URL\" is sent to unix because it is in single backwards quotes. The return value, the entire results web page, is stored, line by line, in the array "@html_source" because in perl the at sign indicates an array. The perl "grep" command takes an array of strings and returns only those strings that match the specified criterion, which in this case is "contains PITTSBURGH".

Go on to step 4 for the zip code example

Continue on to step 4

Return to main culling page

Send comments/suggestions/corrections to: