We describe Scout, a multithreaded robot infrastructure for Web-based
information retrieval tasks. Scout implements HTTP communications,
document caching, and simple HTML parsing, and can be extended with
programs called rules that perform arbitrary data-processing
tasks on the documents collected. We describe and demonstrate two
simple proof-of-concept applications built using Scout: one that
collects and extracts basic information about a university from its
home page, and one that converts javadoc
documentation
to data structures that can be categorized and manipulated in
interesting ways. We conclude with some observations about the
performance of the sample applications and discuss some future
applications that might be built using Scout.
Scout is a general-purpose Web robot that can be arbitrarily extended with procedures called rules that implement data-processing techniques specialized to the text or data formats retrieved. Initialized with a set of rules and a list of URLs, Scout collects the documents associated with the URLs and provides them to the rules for processing. Rules might parse natural language, interpret markup structure, or analyze binary-data formats.
Rules build data structures called results and store them in a globally accessible table. At any time during a Scout session, a rule may look up results previously generated by itself or any other rule. This facility gives rules a multiple-document memory and allows processing logic to be broken into easily managed functional components. Any rule may also append discovered URLs to Scout's search queue or connect to an external database to export results to structured records. When a session terminates, the results table is stored to disk for postprocessing or to seed a later session.
Scout is a multithreaded Java application after the readers-writer model. The Scout thread, the writer, removes URLs from a search queue and requests the associated documents from the Web servers on which they reside using the HyperText Transfer Protocol, HTTP. Successfully collected documents and their HTTP headers are stored in a shared buffer where rules executing as concurrent threads may access them. Each rule is required to access the document one time to maintain synchronization, but a rule may at any time choose to release the document without performing any work on it. Scout and the rule threads also synchronize on the URL queue so Scout can differentiate an empty queue from one that is waiting on a rule to produce a URL.
The Scout thread avoids collecting redundant documents that could lead to cycling, and caches documents to minimize network traffic. It implements the Robots Exclusion Protocol and can be configured to stall for a specified interval between successive accesses to the same server to reduce remote server load.
Since HTML is the most common format of Web pages, Scout attempts to parse each document as HTML before buffering it, and stores the tags and text separately if the parse is successful. This preprocessing step permits rules to specialize in markup or text processing. As necessary, tags may be mapped backed into their positions in the text or tags and text recombined into a normalized HTML document.
Throughout a Scout session, both the Scout thread and the rules write detailed activity records to a log file. The last few lines of this log can optionally be monitored in a graphical window.
Rules are implemented by extending a base class Rule.class
that
provides standard interactions with Scout, including thread synchronization
and result recording. To perform useful work, subclasses of Rule must override
one method, processDoc()
. This method is called once by the Rule
parent class for each document buffered by the Scout thread.
When the processDoc()
method is called, the following conditions
exist:
scout.doc
(class Scout.Document
) as HTTP headers and body, the latter
separated into tags and text if the source was successfully parsed as HTML
sequenceNumber
contains a value that all
threads will consistently use to refer to the buffered document
results
(class java.lang.Vector
) is
initialized empty and ready to receive results from the rule
scout.ruleResults
(class Scout.Results
) contains a table of all results previously
generated by all running rules
processDoc()
method performs task-specific work by
examining the buffered document, previous entries in the results table, or both.
The rule may generate any legal Java class objects as results. These are stored
in the local results vector, which is automatically entered into the shared
table when processDoc()
returns. If the rule does not generate any
results, the empty vector is stored to maintain the structure of the results
table and differentiate results that will never be produced from those that
simply have not yet been.
In addition to the java.* packages that are included in Sun's Java Development Kit (JDK), two packages from third-party vendors were used. Ronald Tschalär's HTTPClient [1], provided under the GNU public license, provides a more developed interface to Web connections than those given in the java.net package. Pat 1.0 [2] by Steven R. Brandt provides Perl-style regular-expression syntax and is free to educational institutions.
The Hashlookup class that Scout uses to track visited URLs is from Praveen Devulapalli's 1997 Master's Project [3]. This class uses a constant-sized bit map and fast hashing methods for probabilistically equating strings.
Scout implements the
Robots Exclusion Protocol [4] developed by Martijn
Koster. This protocol is a standard by which Web servers instruct robots not to
access certain paths. The standard relies on the cooperation of robots not to
go where they are not wanted, and is implemented by including a simple text
file named robots.txt
in the server's Web root directory. The
robots.txt
file lists where particular agents are forbidden to
navigate. A simple, commented example follows.
# sample robots.txt file # The * wildcard means "match any" in either User-agent or Disallow # lines # forbid a known rude robot to access anything User-agent: RudeBot 6.66 Disallow: * # forbid all other robots to access the cgi-bin and images directories User-agent: * Disallow: /cgi-bin/ Disallow: /images/ # end robots.txt
Scout's first request to any server will be for the robots.txt
file. If the server provides such a file and it contains a User-agent field
that includes Scout, any prohibitions for this field will be merged with a
default set of excluded paths and file types and honored for all subsequent
accesses in the current session. If no robots.txt
file is returned
or no User-agent field can be found that applies to Scout, then only the
defaults are used.
This feature can be disabled for searching on well-known hosts but should always be used for searches on unfamiliar servers.
Scout runs on any Java Virtual Machine compatible with Java version 1.1.4 or later. For this example, we assume a generic Unix environment and Sun's Java bytecode interpreter. We also assume that the environment has been configured so that all required class libraries are available.
In addition to the core Scout package and a collection of rule classes, two files are needed to run Scout: an initialization file to set Scout's run-time parameters and an HTML file with a section of markup called a template. The template contains the information needed to load and initialize named rules and can also set named values in a table within Scout so common data can be readily accessed by a number of rules. The format of the initialization file and the markup for templates will be detailed later. For the following example, it is sufficient to know their purposes.
Our example concerns collecting a university name, acronym, and phone
number from a home page at http://www.nku.edu/. The initialization data
and template are contained in the files NKU.ini
and
university.html
respectively. The template associates three
instances of a regular expression matching rule, RegExpRule.class, with
the runtime names UniversityName, Acronym, and PhoneNumber. A fourth rule,
BreadthFirstSearch.class, instantiated under the name BFS, queues URLs
found in the home page. Scout does not search beyond the first page
collected, but the queue is serialized with the rest of the program's
state when it halts, and could be used to resume the session later.
Scout is invoked from the command prompt as shown. It dumps a summary of the configuration file parameters, detailed in Appendix I, to standard output before beginning Web crawling, during which it is silent:
% java Scout.Scout NKU.ini university.html
Configuration File: NKU.ini [EXTRACTOR] EntityFile: /a/al/u/al-d7/csgrad/borchers/classes/SGMLKit/entities.txt [SCOUT] CacheDir: NKUCache LogFile: NKU.log MaxCacheFiles: 256 MaxURLs: 1 NetDelay: 2000 PersistFile: NKU.dat RequestRobotsFile: true RestrictDomain: null RestrictHost: www.nku.edu SearchCache: true SearchWeb: true StartURL: http://www.nku.edu/ UseGUI: true
After running for a few seconds, Scout exits, leaving the following log in the file NKU.log. Editorial comments, indicated by italic text, have been added throughout. Log messages from specific rules are preceded by the names assigned to the instances invoked by the template.
The first section of the log shows how Scout processes the template, the first four directives in which set variables in an internal hash table to regular expressions for matching area codes, phone numbers, capitalized words, and acronyms.
Scout.setvar - Set runtime variable USAREACODE=(\(\d{3}\))|(\d{3}) Scout.setvar - Set runtime variable USPHONENUMBER=\d{3}-\d{4} Scout.setvar - Set runtime variable CAPWORD=[A-Z][A-Za-z]* Scout.setvar - Set runtime variable ACRONYM=[A-Z][A-Z]+
Next, the template calls for the rules to be loaded. The three instances of RegExpRule identify themselves as their constructors execute. Each rule reports the pattern it will match. What is not evident here is that the template parameterized RegExpRule with references to the internally stored variables listed above. For example the PhoneNumber rule was parameterized with 0 or 1 occurences of USAREACODE, followed by an optional space, followed by a USPHONENUMBER. This was expanded by Scout into the complex expression shown. Such variable parameters are discussed in Appendix II, which details the template syntax.
UniversityName.RegExpRule - ready to search on pattern (University\s+of\s+([A-Z][A-Za-z]*\s+)+)|(([A-Z][A-Za-z]*\s+)+University) PhoneNumber.RegExpRule - ready to search on pattern ((\(\d{3}\))|(\d{3})){0,1}(\s){0,1}(\d{3}-\d{4}) Acronym.RegExpRule - ready to search on pattern ([A-Z][A-Z]+)
Scout reports that four rules were initialized and dumps a simple view of the state of each.
Scout.Scout - Loaded 4 rules Scout.Scout - BFS - {type=D, parse=void, value=null, validate=true, name=BFS, rule=Scout.BreadthFirstSearch} Scout.Scout - UniversityName - {type=D, parse=String, squeezedoc=true, value=null, validate=true, trim=true, pattern=(University\s+of\s+([A-Z][A-Za-z]*\s+)+)|(([A-Z][A-Za-z]*\s+)+University), name=UniversityName, rule=Scout.RegExpRule} Scout.Scout - PhoneNumber - {type=D, parse=string, value=null, validate=true, trim=true, pattern=((\(\d{3}\))|(\d{3})){0,1}(\s){0,1}(\d{3}-\d{4}), name=PhoneNumber, squeezematch=true, rule=Scout.RegExpRule} Scout.Scout - Acronym - {type=D, value=null, validate=true, trim=true, pattern=([A-Z][A-Z]+), name=Acronym, rule=Scout.RegExpRule}
Scout looks for a cache of previously collected documents and any previously serialized state data as indicated by the CacheDir and PersistFile fields in the initialization file. In our case the robot is running for the first time, so neither the cache nor the state data exists.
Scout.restoreState - Initialized cache of 0 objects Scout.restoreState - No state data found. Creating new objects
The session proper begins here as Scout removes the first URL from the queue. For a fresh session, this URL will be the one indicated by the StartURL parameter in the initialization file.
Scout.run - Scout started at Mon Nov 30 21:34:02 EST 1998
The search engine and all rules execute as separate threads that synchronize using the DocBuffer and URLQueue objects, so their log entries are in a nondeterministic order. In the next few lines, the run methods of the four rules start while Scout requests the robot-exclusion data from www.nku.edu and mixes the seven excluded paths returned from the host with its own internally excluded paths and file types. The next-to-last line shows a rule trying to retrieve the first document. Since no such document yet exists, the DocBuffer object forces the rule's thread to wait.
BFS.run - starting URLQueue.removeFront - returning http://www.nku.edu/ UniversityName.run - starting Nobots.getHostExclusions - Getting exclusions for host www.nku.edu Nobots.loadExclusions - Read 7 path exclusions for host www.nku.edu PhoneNumber.run - starting Nobots.getHostExclusions - Stored 10 excluded paths and 19 excluded types for www.nku.edu Acronym.run - starting
Next, Scout requests the first (root) URL from the www.nku.edu host while two of the remaining rules become blocked waiting to access the document. Once the Web server returns the document, Scout parses it into text and tag components. Though no log entry is written to indicate it, the document is also cached at this point.
Scout.getDocument - Requesting URL http://www.nku.edu/ from cache Scout.getDocument - URL not cached - hitting the Web now Scout.getDocument - Separated tags and text DocBuffer.fill - Buffered 2273 bytes of text and 311 tags
Now that the document is available, the rules are unblocked and allowed to access it. The URL queue poses a producer-consumer problem with multiple producers, the rules, and one consumer, Scout. Scout will be blocked from exiting on an empty queue if any rule is still running with the potential to add to the queue.
UniversityName.run - acquired document http://www.nku.edu/ [0] PhoneNumber.run - acquired document http://www.nku.edu/ [0] Acronym.run - acquired document http://www.nku.edu/ [0] BFS.run - acquired document http://www.nku.edu/ [0] PhoneNumber.processDoc - processing document http://www.nku.edu/ [0] UniversityName.processDoc - processing document http://www.nku.edu/ [0] Acronym.processDoc - processing document http://www.nku.edu/ [0] BFS - Extracting links from URL http://www.nku.edu/ PhoneNumber.processDoc - searching on pattern ((\(\d{3}\))|(\d{3})){0,1}(\s){0,1}(\d{3}-\d{4}) UniversityName.processDoc - searching on pattern (University\s+of\s+([A-Z][A-Za-z]*\s+)+)|(([A-Z][A-Za-z]*\s+)+University) Acronym.processDoc - searching on pattern ([A-Z][A-Z]+)
Next, the rules finish the document and report their results. BFS does not report any results, but announces that it has enqueued 34 discovered URLs to Scout's search queue.
Acronym.run - finished document http://www.nku.edu/ [0] in 0 minutes 0 seconds Results.put - Storing 8 results for rule Acronym, document 0 BFS - Enqueued 34 URLs BFS.run - finished document http://www.nku.edu/ [0] in 0 minutes 1 seconds Results.put - Storing 0 results for rule BFS, document 0 UniversityName.run - finished document http://www.nku.edu/ [0] in 0 minutes 2 seconds Results.put - Storing 3 results for rule UniversityName, document 0
If the configuration calls for collection of more than one URL, Scout will wait on the URL queue until either one of the rules produces a URL or all of the rules exit without producing one. Since the configuration parameter MaxURLs calls for only one URL to be collected, Scout does not wait on the queue. Instead it begins its shutdown procedure. Meanwhile, the remaining rule threads finish their tasks and exit.
Scout.run - URL queue exhausted. Shutting down buffer and exiting... DocBuffer.close() - closing Acronym.run - finished BFS.run - finished UniversityName.run - finished PhoneNumber.run - finished document http://www.nku.edu/ [0] in 0 minutes 2 seconds Results.put - Storing 2 results for rule PhoneNumber, document 0 PhoneNumber.run - finished
Scout provides a summary of the results reported by each rule listed according to their names and subscripted by document and result numbers.
Acronym[0,0]: MM Acronym[0,1]: MM Acronym[0,2]: NKU Acronym[0,3]: JD Acronym[0,4]: MBA Acronym[0,5]: NCAA Acronym[0,6]: KY Acronym[0,7]: NKU PhoneNumber[0,0]: (606) 572-5220 PhoneNumber[0,1]: 637-9948 UniversityName[0,0]: Northern Kentucky University UniversityName[0,1]: NKU Northern Kentucky University UniversityName[0,2]: Other News Northern Kentucky University
Lastly, some statistics for the session are recorded and the program exits.
Scout.logResults - URL Stats: discovered = 0 requested = 1 expanded = 1 ignored = 0 failed = 0 error = 0 CacheManager.save - Saving cache information Scout.run - Finished after 0 minutes 9 seconds
Java provides a source-tagging syntax and a utility, javadoc
,
for automating HTML-formatted documentation of classes and packages.
Because the documents produced by javadoc
have a predictable markup
structure and familiar subject matter in a limited domain, we built a test
application, JavaDoc
, to traverse this document space and
produce results describing the Java packages and classes. We now describe
JavaDoc
and our early experience using it.
JavaDoc
, different from javadoc
only
in case. I have documented it as such.
We were interested in describing the Java containment hierarchy of packages,
their classes and interfaces, and their field and methods. We were also
interested in making field and method inheritance explicit in our object
descriptions, which it is not in the javadoc
-produced HTML.
JavaDoc
contains a class for describing Java objects and
primitives and three rules to process a portion of the set of
javadoc
documents that shipped with Sun's JDK version 1.1.4.
We also built a stand-alone postprocessor and interactive query tool to
index and look up lists of interfaces and classes by the names of fields
or methods they contain.
The subset of the documentation we concentrated on consists of a file that lists the packages contained in a Java distribution, package-index files that describe the interface and class contents of each of these packages, and a large set of files describing the interfaces and classes themselves. In designing our rules, we were careful to approach these documents as an HTML corpus without considering the tags or engine that produced them.
Our first rule, PackageListRule
, reads the package-list
file and appends the URLs of the package-index files referenced to Scout's
search queue without producing any results. The second rule,
PackageIndexRule
, reads the package-index documents
and builds a hash table describing the packages. This table contains entries
giving the package name, and lists of the interfaces, classes, exceptions and
errors it contains. For each package member, class or interface, the URL of
the file documenting it is added to the search queue. The third rule,
ClassDocRule
, processes the class and interface
descriptions and builds a description of each one as a vector of
JavaDocObject
instances.
The JavaDocObject
class represents a Java class, interface,
variable, or method. It stores the URL of the javadoc
-generated
source file, along with the name of the object, information about its type,
scope, containment and parentage. If the object is a class or interface,
the JavaDocObject
stores its description and a list of any
interfaces it implements. If the object is a method, the
JavaDocObject
stores a list of its parameter types. As mentioned
previously, a vector of JavaDocObject
s describes a class or
interface where the first element represents the class itself and subsequent
elements represent its variables and methods.
We ran JavaDoc
in two phases, both on a 166 MHz Pentium machine
with 64 MB RAM. In the first phase, we collected the documents by initializing
Scout with PackageListRule
and PackageIndexRule
and
seeding the URL queue with the URL of the package list file on the UK Computer
Science Department's Web server. The session ran on a 28.8 modem connection to
a commercial ISP connection in the Lexington area, with Scout configured to
stall 2 seconds between requests. Scout collected 500 documents (1 package list,
22 package indices, and 477 class or interface description files) in just under
63 minutes. Scout spent at least 55 minutes of this time in networking and
other overhead unrelated to the rules, as no rule recorded a full second of
processing time per document.
In the second phase, the rules from the first session were used again, with
the ClassDocRule
added to build the vectors describing the 477
class and interface descriptions. This time, operating on the cached copies
of the documents, the session completed in 12 minutes, 2 seconds. Again,
the maximum time reported by a rule in processing a document was under 1
second. The results generated by the JavaDoc
application can
be seen at the end of the log output from the
second-phase session.
Java is a fully object-oriented language, but javadoc
does not
explicitly document inherited members in subclasses. In order to easily
answer the question "what classes or interfaces implement a method
named Y?", we built a small Java program BuildMethodIndex
to extract the ClassDocRule
results from the table saved by
Scout and compute a representation of the classes and interfaces that
makes inherited members explicit by adding references to the ancestral member
in the descendant object's vector. It then builds a hash table using member
names as keys and vectors of the object names that implement the key member
as values. We then built an interactive application QueryMethodIndex
that reads member names at a prompt and displays the list of objects that
contain a member of that name.
Assuming the file classdoc.dat
contains the saved state of the
second-phase JavaDoc
session where the class description files
were produced by an instance of ClassDocRule
identified in the
template as ClassDocRule
, and we want to build the method index
in the file method.index
, the postprocessing proceeds as follows,
following the annotation conventions previously established:
% java JavaDoc.BuildMethodIndex classdoc.dat ClassDocRule method.index
Once this process completes, we can run the interactive
QueryMethodIndex
program.
% java JavaDoc.QueryMethodIndex method.index
Because the methods table is a large and complex structure and Java is fairly slow reconstructing it from the version serialized to disk, the application reports that it is running.
wait... ready!
The simplest result to demonstrate is a negative one.
? MAX
Not Found MAX
Next, we query for a variable member. one class.
? MAX_VALUE
[java.lang.Long, java.lang.Character, java.lang.Float, java.lang.Double, java.lang.Integer, java.lang.Short, java.lang.Byte]
Now, we query for a method member.
? contains
Found contains [java.awt.Panel, java.awt.FileDialog, java.awt.TextField, java.awt.Choice, java.applet.Applet, java.util.Vector, java.awt.List, java.util.Hashtable, java.security.Provider, java.awt.Container, java.awt.Polygon, java.awt.Button, java.awt.TextComponent, java.awt.Dialog, java.awt.Label, java.awt.Component, java.awt.Window, java.awt.Canvas, java.awt.ScrollPane, java.util.Stack, java.awt.TextArea, java.awt.Frame, java.awt.Checkbox, java.awt.Rectangle, java.util.Properties, java.awt.Scrollbar]
A query on a member that is contained in java.lang.Object
, the
ultimate parent of all Java objects, would result in all 477 classes
being listed.
We have described Scout, a Java infrastructure for building and running
Web-based information retrieval applications. We have also described
two proof-of-concept applications, one for extracting basic information
from university home pages and another for converting
javadoc
-produced documentation into objects representing
the classes and interfaces.
As proof-of-concept tests, both examples were encouraging, but serious questions regarding performance remain to be studied. In these applications, the vast majority of execution time was spent in networking, other input/output operations, and thread management. It will be interesting to apply performance metrics to determine exactly how this time is being spent. It will also be interesting to run rules on a larger cached collection to get a better idea of how performance scales with the cache size and to run applications that deal with multiple web servers so that Scout need not be artificially slowed to limit the individual server load.
Currently, the set of rules for a session is loaded once at program
start, and all the threads that will ever run in the session run throughout.
It would be useful for rules to be able to dynamically add other rules to
the set and to remove themselves from the set if and when they have
fulfilled their purpose. The JavaDoc
example is a case in
point where the sets of documents handled by each rule are disjoint, yet
all rules remain active throughout the session and must touch each document.
Rules run only when the session completes could do much of the work
that presently requires a postprocessing phase. Such a class of rules could
easily be created by adding an attribute to the template syntax and making
some minor modifications to the Scout thread's code.
The simple FIFO queue of URLs has also begun to show weaknesses. Currently, Scout is stalled waiting between requests on the same host. The ability to look ahead in the queue for a URL located on a different server would allow computation to progress. Also, a priority queue that can be modified by the rules would be useful in many circumstances, although it raises interesting questions about how much authority the rules should have to tamper with central data structures.
To date, Scout has been strictly an HTTP application, but there is no reason why it should not be extended to support other protocols such as FTP, gopher, finger, or any other service that can be referenced by a URL. Support for other protocols will require substantial rethinking of much of Scout's design, though, so there are no short-term plans to begin this generalization effort.
We have just begun to explore uses for Scout. We have considered several applications that might be implemented with Scout:
There are certainly many other applications for Scout that we have not yet imagined. We believe that Scout may prove particularly useful for XML applications, which can use the existing tag parser without modification. We hope that other programmers will adapt Scout to their purposes, creating new classes of robots and improving how all of us, as users, use the vast maze of information available on the World Wide Web.