Scout
An Infrastructure for Web-Based Information Retrieval
Anthony L. Borchers - borchers@cs.engr.uky.edu
Department of Computer Science, University of Kentucky

Abstract

We describe Scout, a multithreaded robot infrastructure for Web-based information retrieval tasks. Scout implements HTTP communications, document caching, and simple HTML parsing, and can be extended with programs called rules that perform arbitrary data-processing tasks on the documents collected. We describe and demonstrate two simple proof-of-concept applications built using Scout: one that collects and extracts basic information about a university from its home page, and one that converts javadoc documentation to data structures that can be categorized and manipulated in interesting ways. We conclude with some observations about the performance of the sample applications and discuss some future applications that might be built using Scout.

Contents

Introduction
Scout Overview
Rules
Third-Party Code Used in Scout
Robots Exclusion Protocol
Scout Usage Example
Running Scout
Log Output
An Application for Processing Javadoc
The JavaDoc Application
Collection and Processing
Postprocessing
Conclusions
Future Applications
References

Introduction

Scout

Rules

Scout Overview

Scout Thread

Rule Threads

When processDoc() is called

  1. Previously unseen document buffered in scout.doc
  2. Integer variable sequenceNumber indicates the document
  3. Local vector results initialized empty, ready to receive results from the rule
  4. Object scout.ruleResults contains a table of previously-generated results

The overridden processDoc()

Third-Party Code

  1. HTTPClient by Ronald Tschalär
  2. Pat 1.0 by Steven R. Brandt
  3. Hashlookup.class by Praveen Devulapalli

Robots Exclusion Protocol

Sample robots.txt File

# forbid a known rude robot to access anything User-agent: RudeBot 6.66 Disallow: * # forbid all other robots to access the cgi-bin # and images directories User-agent: * Disallow: /cgi-bin/ Disallow: /images/

Scout's Implementation of the Protocol

Example

Assume

Example

Collecting a university name, acronym, and phone number from http://www.nku.edu/

Running Scout

% java Scout.Scout NKU.ini university.html

Configuration File: NKU.ini [EXTRACTOR] EntityFile: /a/al/u/al-d7/csgrad/borchers/classes/SGMLKit/entities.txt [SCOUT] CacheDir: NKUCache LogFile: NKU.log MaxCacheFiles: 256 MaxURLs: 1 NetDelay: 2000 PersistFile: NKU.dat RequestRobotsFile: true RestrictDomain: null RestrictHost: www.nku.edu SearchCache: true SearchWeb: true StartURL: http://www.nku.edu/ UseGUI: true

Log Output Excerpts

Scout.setvar - Set runtime variable USAREACODE=(\(\d{3}\))|(\d{3}) Scout.setvar - Set runtime variable USPHONENUMBER=\d{3}-\d{4} Scout.setvar - Set runtime variable CAPWORD=[A-Z][A-Za-z]* Scout.setvar - Set runtime variable ACRONYM=[A-Z][A-Z]+ UniversityName.RegExpRule - ready to search on pattern (University\s+of\s+([A-Z][A-Za-z]*\s+)+)|(([A-Z][A-Za-z]*\s+)+University) PhoneNumber.RegExpRule - ready to search on pattern ((\(\d{3}\))|(\d{3})){0,1}(\s){0,1}(\d{3}-\d{4}) Acronym.RegExpRule - ready to search on pattern ([A-Z][A-Z]+) Scout.Scout - Loaded 4 rules Scout.Scout - BFS - {type=D, parse=void, value=null, validate=true, name=BFS, rule=Scout.BreadthFirstSearch} Scout.Scout - UniversityName - {type=D, parse=String, squeezedoc=true, value=null, validate=true, trim=true, pattern=(University\s+of\s+([A-Z][A-Za-z]*\s+)+)|(([A-Z][A-Za-z]*\s+)+University), name=UniversityName, rule=Scout.RegExpRule} Scout.Scout - PhoneNumber - {type=D, parse=string, value=null, validate=true, trim=true, pattern=((\(\d{3}\))|(\d{3})){0,1}(\s){0,1}(\d{3}-\d{4}), name=PhoneNumber, squeezematch=true, rule=Scout.RegExpRule} Scout.Scout - Acronym - {type=D, value=null, validate=true, trim=true, pattern=([A-Z][A-Z]+), name=Acronym, rule=Scout.RegExpRule} Scout.restoreState - Initialized cache of 0 objects Scout.restoreState - No state data found. Creating new objects Scout.run - Scout started at Mon Nov 30 21:34:02 EST 1998 BFS.run - starting URLQueue.removeFront - returning http://www.nku.edu/ UniversityName.run - starting Nobots.getHostExclusions - Getting exclusions for host www.nku.edu Nobots.loadExclusions - Read 7 path exclusions for host www.nku.edu PhoneNumber.run - starting Nobots.getHostExclusions - Stored 10 excluded paths and 19 excluded types for www.nku.edu Acronym.run - starting Scout.getDocument - Requesting URL http://www.nku.edu/ from cache Scout.getDocument - URL not cached - hitting the Web now Scout.getDocument - Separated tags and text DocBuffer.fill - Buffered 2273 bytes of text and 311 tags

Log Output Excerpts, More

UniversityName.run - acquired document http://www.nku.edu/ [0] PhoneNumber.run - acquired document http://www.nku.edu/ [0] Acronym.run - acquired document http://www.nku.edu/ [0] BFS.run - acquired document http://www.nku.edu/ [0] PhoneNumber.processDoc - processing document http://www.nku.edu/ [0] UniversityName.processDoc - processing document http://www.nku.edu/ [0] Acronym.processDoc - processing document http://www.nku.edu/ [0] BFS - Extracting links from URL http://www.nku.edu/ PhoneNumber.processDoc - searching on pattern ((\(\d{3}\))|(\d{3})){0,1}(\s){0,1}(\d{3}-\d{4}) UniversityName.processDoc - searching on pattern (University\s+of\s+([A-Z][A-Za-z]*\s+)+)|(([A-Z][A-Za-z]*\s+)+University) Acronym.processDoc - searching on pattern ([A-Z][A-Z]+) Acronym.run - finished document http://www.nku.edu/ [0] in 0 minutes 0 seconds Results.put - Storing 8 results for rule Acronym, document 0 BFS - Enqueued 34 URLs BFS.run - finished document http://www.nku.edu/ [0] in 0 minutes 1 seconds Results.put - Storing 0 results for rule BFS, document 0 UniversityName.run - finished document http://www.nku.edu/ [0] in 0 minutes 2 seconds Results.put - Storing 3 results for rule UniversityName, document 0 Scout.run - URL queue exhausted. Shutting down buffer and exiting... DocBuffer.close() - closing Acronym.run - finished BFS.run - finished UniversityName.run - finished PhoneNumber.run - finished document http://www.nku.edu/ [0] in 0 minutes 2 seconds Results.put - Storing 2 results for rule PhoneNumber, document 0 PhoneNumber.run - finished

Log Output Excerpts, Concludes

Acronym[0,0]: MM Acronym[0,1]: MM Acronym[0,2]: NKU Acronym[0,3]: JD Acronym[0,4]: MBA Acronym[0,5]: NCAA Acronym[0,6]: KY Acronym[0,7]: NKU PhoneNumber[0,0]: (606) 572-5220 PhoneNumber[0,1]: 637-9948 UniversityName[0,0]: Northern Kentucky University UniversityName[0,1]: NKU Northern Kentucky University UniversityName[0,2]: Other News Northern Kentucky University Scout.logResults - URL Stats: discovered = 0 requested = 1 expanded = 1 ignored = 0 failed = 0 error = 0 CacheManager.save - Saving cache information Scout.run - Finished after 0 minutes 9 seconds

An Application for Processing Javadoc

What is Javadoc?

Javadoc Files

The JavaDoc Application

Files Processed

JavaDoc Consists of

JavaDocObject

JavaDoc Rules

PackageListRule
  • Reads package-list file and queues the URLs of the package-index files
  • Produces no results
    PackageIndexRule
  • Reads the package-index documents and builds a table describing the packages
  • Queues URLs for each package member, class or interface
    ClassDocRule
  • Processes the class and interface descriptions
  • Produces a description of each one as a vector of JavaDocObjects
  • Collection and Processing

    Statistics - Phase 1

    Statistics - Phase 2

    Postprocessing

    Example

    % java JavaDoc.BuildMethodIndex classdoc.dat ClassDocRule method.index

    % java JavaDoc.QueryMethodIndex method.index

    wait... ready! ? MAX Not Found MAX ? MAX_VALUE [java.lang.Long, java.lang.Character, java.lang.Float, java.lang.Double, java.lang.Integer, java.lang.Short, java.lang.Byte] ? contains Found contains [java.awt.Panel, java.awt.FileDialog, java.awt.TextField, java.awt.Choice, java.applet.Applet, java.util.Vector, java.awt.List, java.util.Hashtable, java.security.Provider, java.awt.Container, java.awt.Polygon, java.awt.Button, java.awt.TextComponent, java.awt.Dialog, java.awt.Label, java.awt.Component, java.awt.Window, java.awt.Canvas, java.awt.ScrollPane, java.util.Stack, java.awt.TextArea, java.awt.Frame, java.awt.Checkbox, java.awt.Rectangle, java.util.Properties, java.awt.Scrollbar]

    A query on a member contained in java.lang.Object, the ultimate parent of all Java objects, would result in all 477 classes being listed!

    Conclusions

    Future applications

    Email-Address Extractor
    A simple rule could identify and extract email addresses from Web pages
    Meta-Search Engine
    Seeded with URL queries to several search engines, Scout could collect responses and filter and rank returned links
    Proxy Servers
    A rule providing a server socket collects and relays documents
    1. Look-ahead: extracts links and caches them in advance
    2. Content-filtering: Analyzes document content and selectively relays it to the user
    Comparative-Shopping Agents
    A set of rules to collect and sort pages describing products of interest and identify the vendor offering the best prices

    Thank You!

    Salutations to my ever helpful commitee!

    Raphael A. Finkel, Chair
    Victor W. Marek
    Miroslaw Truszczynski

    and, of course,

    Carol Hannahs
    Who mentioned my graderbot to
    Joe Oldham
    Who introduced me to the commitee that turned a Perl script designed to let me avoid reading student Web pages into a Master's Project

    References

    1. Tschalär, Ronald. HTTPClient, 1998, http://www.innovation.ch/java/HTTPClient/
    2. Brandt, Steven R. Regular Expressions in Java, 1998, http://javaregex.com/
    3. Devulapalli, Praveen. A Web-crawling engine to discover email addresses, 1997, Masters Project Report, University of Kentucky Department of Computer Science
    4. Koster, Martijn. A Standard for Robot Exclusion, 1994 http://info.webcrawler.com/mak/projects/robots/norobots.html