Automated Internet Culling: Step 4

Automated Culling of Data from the Internet

Step 4: Cleanup

Chemicals example Baseball example Zip code example

The automation procedure described in steps 1 through 3 will generally produce either many files from which data can be gleaned, or a single or a few files that contain extraneous information. This step describes some techniques for cleaning up these files. In practice the commands from steps 3 and 4 are usually combined into a single script file.

Example 1: Chemicals

Return to chemical example, step 3

The final version of Step 3 for the chemicals example produced a file called chemresult which looks like this:

Chemical CAS#
methane <li><strong>CAS Registry Number:</strong> 74-82-8</li>
methanol <li><strong>CAS Registry Number:</strong> 67-56-1</li>
ethane <li><strong>CAS Registry Number:</strong> 74-84-0</li>
ethanol <li><strong>CAS Registry Number:</strong> 64-17-5</li>
propane <li><strong>CAS Registry Number:</strong> 74-98-6</li>
propanol <li><strong>CAS Registry Number:</strong> 71-23-8</li>
butane <li><strong>CAS Registry Number:</strong> 106-97-8</li>
butanol <li><strong>CAS Registry Number:</strong> 71-36-3</li>
pentane <li><strong>CAS Registry Number:</strong> 109-66-0</li>
pentanol <li><strong>CAS Registry Number:</strong> 71-41-0</li>

There are many ways to clean up a file like this, beyond the manual method of making changes in a text editor. Perhaps the simplest is to use "sed", the stream editor. A command like "sed -e 's|<li><strong>CAS Registry Number:</strong> ||' -e 's|</li>||' chemresult > chem.dat" will use two substitute commands to replace specific text with nothing producing the desired clean output:

Chemical CAS#
methane 74-82-8
methanol 67-56-1
ethane 74-84-0
ethanol 64-17-5
propane 74-98-6
propanol 71-23-8
butane 106-97-8
butanol 71-36-3
pentane 109-66-0
pentanol 71-41-0

An even better solution is to combine steps 3 and 4 using getcheminfo3 which is shown here:

#!/bin/csh
# This script reads chemical names from the file named in the first position of its
# argument list, requests information on these chemicals from webbook.nist.gov using lynx,
# and saves the results to the file named in the second position of its argument list.
# E.g. use getcheminfo2 infile outfile

set tmp = tmpOKtoDelete
echo Chemical CAS\# > $2
foreach chemical (`cat $1`)
  echo Looking up $chemical
  lynx -source "http://webbook.nist.gov/cgi/cbook.cgi?Name=$chemical&Units=SI" | grep Registry > $tmp
  echo $chemical `sed -e 's|<li><strong>CAS Registry Number:</strong> ||' -e 's|</li>||' $tmp` >> $2
end
rm $tmp

This is a great version of our CAS number finder! The lynx output is piped through "grep", and the single line containing the word "Registry" is redirected into the temporary file. Then the chemical's name and the result of the complex command inside the back-single-quotes are appended as a new line in the output file using echo and ">>". The complex command uses "sed" to read its input from the temporary file, and to write as output the line with the offending strings of extraneous matter replaced by nothing.

Choose another example from step 1

Example 2: Baseball

Return to baseball example, step 3

To clean up the baseball files, I will demonstrate two useful programs I wrote and which are freely available for your use. The files scanhtml.c and scanhtml2.c are c program files which can be compiled to produce executable files that are very useful for cleaning up culled data. Intel/Linux executables, scanhtml and scanhtml2 are available here.

You can try scanhtml on any plain text file. The program uses "standard input", so you will usually pipe the text file into scanhtml. Output goes to standard output, so you can see the results during testing or redirect it to a file or assign it directly to a variable in a csh script. The required arguments are a string of text that identifies the "index" line of the file and a "skip count". E.g. if you run cat players.dat | scanhtml Bill 0 the output will be the first baseball player that has the name Bill, i.e. the great Bill Mazeroski. Try to be as specific as possible, because the first match can be anywhere in the text, so e.g. cat players.dat | scanhtml Dan 0 matches "Lindy McDaniel". Also note that the search is case sensitive.

The return from scanhtml when there is no match is "NA", e.g. try echo abcdef | scanhtml "ghi" 0 at the unix command prompt.

The skip argument is used to force the return value to be 1 or more lines after the index line. E.g., cat players.dat | scanhtml Bill 1 returns "Mike McCormick" and cat players.dat | scanhtml Bill 5 returns "Vada Pinson".

The scanhtml program has optional arguments to force the return value to be a substring of the main return value. The third argument specifies the beginning of the substring and the fourth specifies the end. If a third, but no fourth argument is specified, the substring continues to the end of the line.

Both the substring starting and ending arguments may be a string of characters called a "match string". Be sure to use double quotes is there is an embedded space. Also, you must put a backslash in front of any characters like "<", ">", "&" or "|" if the full argument is not in quotes. For specification of a substring using a match string, the result will NOT include the match string.

An alternative to a match string, either for the start or end of the substring or both, is to use numeric character positions, preceeded by a dash. E.g. echo abcdef | scanhtml ef 0 -2 -4 returns "bcd" because that is the substring starting at the 2nd character and ending at the 4th.

So here's how we use "scanhtml" to make nice clean output for the baseball example. First, here is getbaseball2 which makes nice screen output.

#!/bin/csh
# Find baseball stats
# First argument is a player list file (First Last format), second is a year

if ( XX$1 == XX || XX$2 == XX ) then
  echo use getbaseball PlayerFile Year
  exit
endif

set mainURL = "http://www.mlb.com/NASApp/mlb/mlb/stats_historical/mlb_player_locator_results.jsp"
set altURL = "http://www.mlb.com/NASApp/mlb/mlb/stats_historical/mlb_individual_stats_player.jsp"
set tmp = tmpDeleteThis
set players = `cat $1`
set final = $#players
set index = 1

while ( $index < $final )
  set first = $players[$index]
  @ index++
  set last = $players[$index]
  @ index++

  lynx -source "$mainURL?playerLocator=$last" > $tmp

  # Use scanhtml to return the first line with "No information" or "NA" if no match (unlisted player)
  if ( `cat $tmp | scanhtml "No information" 0` != "NA" ) then
    echo No info on $first $last

  # Use scanhtml to return the first line with "Hitting Stats" or "NA" if no match
  else if ( `cat $tmp | scanhtml "Hitting Stats" 0` != "NA" ) then
    # A unique match was found if there is a "Hitting Stats" section.
    # Need to find first ">year<" after "Hitting Stats" because pitchers have Hitting Stats second.
    # Use scanhtml2 to find 1st through 5th lines after the match from n-th character
    #   up to just before the "<".
    set team = `cat $tmp | scanhtml2 "Hitting Stats" \>$2\< 1 -67 \<`
    if ( "$team" == "NA" ) then
      echo $first $last did not play in $2
    else
      set games = `cat $tmp | scanhtml2 "Hitting Stats" \>$2\< 2 -50 \<`
      set atbat = `cat $tmp | scanhtml2 "Hitting Stats" \>$2\< 3 -50 \<`
      set runs = `cat $tmp | scanhtml2 "Hitting Stats" \>$2\< 4 -50 \<`
      set hits = `cat $tmp | scanhtml2 "Hitting Stats" \>$2\< 5 -50 \<`
      echo $first $last G=$games B=$atbat R=$runs H=$hits $team
    endif

  # Last possibility is multiple players with this last name.
  else
    # Need to find player ID number that matches first and last names.
    # Use scanhtml to find the player ID from character 127 to just before the quote mark
    # on the line with the full name.
    set ID = `cat $tmp | scanhtml "$first $last" 0 -127 \"`
    if ( "$ID" == "NA" ) then
      echo No info on $first $last
    else
      # Slightly different URL is needed with ID rather than last name.
      lynx -source "$altURL?playerID=$ID" > $tmp
      # Now it is the same as for a unique match.

      # But did the player hit this year?
      set team = `cat $tmp | scanhtml2 "Hitting Stats" \>$2\< 1 -67 \<`
      if ( "$team" == "NA" ) then
        echo $first $last did not play in $2
      else
        set games = `cat $tmp | scanhtml2 "Hitting Stats" \>$2\< 2 -50 \<`
        set atbat = `cat $tmp | scanhtml2 "Hitting Stats" \>$2\< 3 -50 \<`
        set runs = `cat $tmp | scanhtml2 "Hitting Stats" \>$2\< 4 -50 \<`
        set hits = `cat $tmp | scanhtml2 "Hitting Stats" \>$2\< 5 -50 \<`
        echo $first $last G=$games B=$atbat R=$runs H=$hits $team
      endif
    endif
  endif
end

Note that we first use "scanhtml" (with no optional substring arguments) instead of grep to detect which kind of initial match we have (unique, multiple or none). Then, for the multiple match case, we use "scanhtml" with the optional arguments to get the substring starting at character 127 and going up through the quotation mark. This is the playerID we are looking for, and we use it to call lynx again and get our specific player. Finally, we use "scanhtml2" to find the desired year of play after the words "Hitting Stats", and we go 2 through 5 lines forward to retrieve the specific information we want, which starts at character 50 and ends with the "<". Of course, this whole script will stop working if mlb.com changes the format of their web site!

Finally, if we want to send our output to a file, e.g. readable in S-plus, we make the minor modifications found in getbaseball3 , and execute "getbaseball3 players.dat 1960 1960stats.dat".

Example 3: Zip codes

Return to zip code example, step 3

The zip code output from step 3 is not too bad. Here is a small improvement called getzip2.pl .

#!/usr/bin/perl -w
use strict;

my $address_file = "firms.dat";
my $URL = "http://www.usps.gov/cgi-bin/zip4/zip4inq2";

my $PS1 = "\"Firm=";
my $PS2 = "\&Urbanization=\&Delivery+Address=";
my $PS3 = "\&City=Pittsburgh\&State=PA\&Zip+Code=\"";

open (INPUT, "<$address_file") || (die "Can't open address file");
 
# for each line in the address list 
while (my $firm = ) {
    chomp $firm;
    my $xfirm = $firm;
    $xfirm =~ s/\s/\+/g;
    my $address=;
    chomp $address;
    $address =~ s/\s/\+/g;
    my $poststring = $PS1 . $xfirm . $PS2 . $address . $PS3;
    my @html_source = `echo $poststring | lynx -dump -post_data \"$URL\"`;
    print $firm, " ", htmlnum(\@html_source, "PITTSBURGH", 0), "\n";
}

close(INPUT);



sub htmlnum {
    my ($intext, $find, $skip) = @_;
    my @text = @$intext;
    my ($index, $cnt, $line);
    for ($index=0; $index<=$#text; $index++) {
        $line=$text[$index];
        if (($line =~ m/$find/)>0) {
            $line = $text[$index+$skip];
            $line =~ m/([-\d].+)/;
            return $1;
        } 
    }
    return undef;
}

The only change is that the final print statement makes a call to the "html_num" subroutine which return only the zipcode.

The word "sub" is used to declare a subroutine in perl. Perl subroutines have no explicit argument list, rather they use the "@_" array, which must contain only scalars. Therefore we pass a reference (indicated by the backslash in front of @html_source) to an array of strings (the whole web page) as the first argument. The second argument is the string to search for. Third argument is how many lines to skip after finding the second argument. The first line of the subroutine assigns names to the passed arguments. The second line de-references the reference to the web page text. The third line declares some local variables.

The loop in the subroutine looks are each line until the first one containing the $find string is found. Then "$line = $text[$index+$skip];" pulls out the correct line, based on $skip. Next the "regular expression" "([-\d].+)" specifies looking for any sequence of digits and dashes. Look here for some help on regular expressions. Finally, "$1" is returned, since that indicates whatever matched to the first set of parentheses in the most recent regular expression. In our case the zip+4 code is returned.

Return to main culling page

Send comments/suggestions/corrections to: