Appendix II - Template Syntax

Scout reads runtime command directives and Rule initialization information from a section of HTML-like tags. These tags can make up a file of their own or be embedded in another HTML file, since it is standard practice for browsers to ignore tags they do not understand. In this way it is possible for an HTML document to contain a template describing how Scout might can process it's content into rule results. This appendix describes the template tagging syntax, and shows the templates used in the example applications detailed in the paper, Scout --- An Infrastructure for Web-Based Information Retrieval. The template syntax we define here is only partially supported by Scout. Some tag attributes are, at present, meaningful only as annotations to the template author and user. These are clearly differentiated in the discussion that follows.

Template Syntax

General Structure

All template tags begin with the generic identifier SCOUT, with all tag-specific semantics being conveyed through attributes. A template begins with a tag <SCOUT TSTART NAME=X>, where X is a name for the template. The NAME attribute is not currently used by Scout, but is included to allow future implementations to recognize and use more than one template per session, and produce different result sets for each. The template ends with a corresponding tag <SCOUT TEND>. Between these two tags, Scout will interpret any tags with the SCOUT identifier as meaningful to the template. Other HTML tags within the template or SCOUT tags outside of it are ignored.

Between the TSTART and TEND tags, the required attribute TYPE is the most important, and the one that Scout evaluates first to determine how to process the rest of the tag. The TYPE attribute may take one of three values: C for "Command", D for "Data", or I for "Ignore". C and D type tags are discussed below. TYPE=I tags are simply an alternative to commenting the tag out using a standard HTML comment. If Scout encounters a template tag with this type value, it is passed over.

Command Invocations

A TYPE=C tag tells Scout to execute an internal command. The NAME attribute supplies the name of the command to execute, and the remaining tags are specific to the command named. Only one command is presently implemented, SETVAR, which instructs Scout to set an internal value in a hash table using the assignment expression in the VALUE attribute. Values set in this manner are readable by all rules in a Scout session. For example, the following tag instructs Scout to set the key FOO to the value BAR.

<SCOUT TYPE=C NAME=SETVAR VALUE="FOO=BAR">

Rule Initializations

A TYPE=D tag identifies result data and the rule for extracting it. For this type of tag, the NAME attribute supplies a name for the runtime rule thread. Note that multiple instances of the same rule-derived class can execute in the same session with different names and for different purposes. The VALUE attribute is currently unsupported for a TYPE=D tag, but is intended to seed a rule's results with initial data, or to provide named data for which no rule exists. The RULE attribute specifies what rule class to load. The PARSE attribute is an annotation to indicate what type of result objects the rule produces. It is not currently used by Scout.

Any other attributes in TYPE=D tags are specific to Rules, and are ignored by Scout. One attribute, VALIDATE, is recognized by the Rule base class. If this attribute is present, the rule will require that Scout successfully parse each document before invoking the processDoc() method. Documents which cannot be thus parsed are ignored by the rule.

As an example, consider the following tag. It constructs an instance of Scout.RegExpRule, a rule that extracts regular expressions from document texts. The rule thread will run under the name EmailAddress, and match strings of the form indicated by the RegExpRule-specific attribute PATTERN. Note that the VALIDATE attribute is not supplied, as there is no reason to restrict a search for email addresses to HTML documents.

<SCOUT TYPE="D" NAME="EmailAddress" RULE="Scout.RegExpRule" PATTERN="[^\s]+@[^\s]+\.(com|net|edu)"

VAR Parameters

Attribute values used as rule parameters are not restricted to literal strings. Values can be interpolated into the parameters from Scout's global variable hash table by the rule constructor. To perform such interpolation, the attribute string must contain the named key to be interpolated between the //s in a string of the form VAR/Key/. For example, if we has three rules that used the email address matching regular expression above, we could use the following set of template tags to set the value and initialize the rules:

<SCOUT TYPE="C" NAME="SETVAR" VALUE="EMAIL=[^\s]+@[^\s]+\.(com|net|edu)"> <SCOUT TYPE="D" NAME="EmailAddressRule1" RULE="Rule1Class" PATTERN="VAR/EMAIL/"> <SCOUT TYPE="D" NAME="EmailAddressRule2" RULE="Rule2Class" PATTERN="VAR/EMAIL/"> <SCOUT TYPE="D" NAME="EmailAddressRule3" RULE="Rule3Class" PATTERN="VAR/EMAIL/">

Template Examples

The following examples show the templates used in the sample applications described in the paper on Scout. The have been annotated with italicized comments to illustrate various features. Recall that the VALUE and PARSE attributes, when present, are only annotations to the human template user.

Template File: EduTemplate.html

The first four lines after the TSTART tag show four variables being set in Scout's internal hash. These correspond one to one to the first entries displayed in the sample log file example.

<SCOUT TSTART NAME="School"> <SCOUT TYPE=C NAME=SETVAR VALUE="USAREACODE=(\(\d{3}\))|(\d{3})"> <SCOUT TYPE=C NAME=SETVAR VALUE="USPHONENUMBER=\d{3}-\d{4}"> <SCOUT TYPE=C NAME=SETVAR VALUE="CAPWORD=[A-Z][A-Za-z]*"> <SCOUT TYPE=C NAME=SETVAR VALUE="ACRONYM=[A-Z][A-Z]+">

The first thread initialized, BFS, is for breadth-first search using the Scout.BreadthFirstSearch rule. The VALIDATE attribute is set, as valid HTML tags with anchors to other documents are the target of search.

<SCOUT TYPE=D NAME=BFS VALUE=null PARSE=void RULE=Scout.BreadthFirstSearch VALIDATE=true>

The remaining rules are all instances of the Scout.RegExpRule class, each charged with a specific data-extraction task. Each uses a VAR parameter to interpolate values set by the C tags above. Each rule also requires valid HTML via the VALIDATE attribute. Additionally, three new attributes meaningful to the Scout.RegExpRule class are demonstrated: SQUEEZEDOC, TRIM, and SQUEEZEMATCH. SQUEEZEDOC tells the rule to compress all whitespace into a single space character before applying the search pattern. TRIM and SQUEEZEMATCH respectively tell the rule to remove leading and trailing whitespace and compress interior whitespace in results.

<SCOUT TYPE=D NAME="UniversityName" VALUE=null PARSE="String" RULE=Scout.RegExpRule PATTERN="(University\s+of\s+(VAR/CAPWORD/\s+)+)|((VAR/CAPWORD/\s+)+University)" SQUEEZEDOC TRIM VALIDATE> <SCOUT TYPE=D NAME=PhoneNumber VALUE=null PARSE=string RULE=Scout.RegExpRule PATTERN=(VAR/USAREACODE/){0,1}(\s){0,1}(VAR/USPHONENUMBER/) SQUEEZEMATCH TRIM VALIDATE> <SCOUT TYPE=D NAME=Acronym VALUE=null RULE=Scout.RegExpRule PATTERN=(VAR/ACRONYM/) TRIM VALIDATE>

<SCOUT TEND>

Template File: ClassDoc.html

This template is quite simple. It simply loads the three rules which traverse the javadoc-produced HTML. None of these rules are parameterized. For a discussion of what the rules do, see An Application for Processing Javadoc in the main paper.

<SCOUT TSTART NAME="ClassDoc"> <SCOUT TYPE=D NAME=PackageListRule VALUE=null PARSE=Object RULE=JavaDoc.PackageListRule> <SCOUT TYPE=D NAME=PackageIndexRule VALUE=null PARSE=Object RULE=JavaDoc.PackageIndexRule> <SCOUT TYPE=D NAME=ClassDocRule VALUE=null PARSE=Object RULE=JavaDoc.ClassDocRule> <SCOUT TEND>