HTML Parser

The .html parser is a generic parsing capability built into iPAM. It is used by several agents including getAltavista, getExciteByKeword, getExciteNewstracker, getOnePage, and getRootPlusReferences (see the list of agents). The parser is contained in the Java package org.mitre.pam.getter.search.parse.

This guide is broken up into two sections, Tokens and Public Methods. The Tokens section lists all of the tokens generated by parser and the methods section list all of the available public methods. If the method takes an integer Tokend as an argument, then it must be any of the tokens defined in the Tokens section, unless otherwise noted.

Tokens

All of the following are recognized as tokens by the parser. There are different states at which a Token will be matched. The states will be in boldface type.

DEFAULT


TAG


ATTLIST


ATTRVAL



Public Methods

boolean SkipToToken(int tokenKind)
Moves the pointer to the first token found with kind = tokenKind. The token variable may be any of the constants defined above. If the token is not found the pointer is at the end of the input stream upon return.

boolean SkipToAfterToken(int tokenKind)

Same as SkipToToken except the token specified is consumed.

boolean SkipToOpenTag(int tag)

Looks for the html tag specified by tag,and moves the current token pointer to the first token in the html tag("<"). The tag must be one of the tokens defined in the TAG state above (A through UNKNOWN); If the tag is not found the pointer is located at the end of the input stream upon return. An open html tag is something like <b>. SkipToOpenTag(html.B) looks for "<b"

boolean SkipToAfterOpenTag(int tag)

Same as SkipToOpenTag above, except that the whole html tag is consumed.

boolean SkipToCloseTag(int tag)

Looks for a closing html tag and moves the token pointer to the "</" token if found. If not the token pointer is at the end of the html page. SkipToCloseTag(html.B) looks for "</b".

boolean SkipToAfterCloseTag(int tag)

Same as SkipToCloseTag except the entire html tag is consumed.

boolean GetUntilToken(int tokenKind)

Returns a String containing everything (including html tags) before the first finding of the token with kind=tokenKind. The pointer points to the token when finished. Null is returned and the pointer is at the end of the input stream if the token is not found.

Hashtable ProcessParameters()

If the pointer is in an html tag, it returns a Hashtable containing the parameters and their values. An empty Hashtable is returned if no parameters are found or the pointer is not in an html tag.

boolean SkipToEndOfTag()

If the pointer is inside a tag the pointer is moved to the token immediately following the ">" of the tag;

String GetUntilNextTag()

If the pointer is not in a tag, all of the text from the current pointer until a "<" is found is returned in a string. Null is returned if not outside a tag.

boolean SkipToString(String str)

This method looks at the text outside of html tags searching for the String str. If the string is found, then the pointer is moved to the first token in the string. If the string is not found, the pointer is moved to the end of the inputstream and null is returned.

boolean SkipToAfterString(String str)

Same as SkipToString except the string is consumed.

boolean SkipToTag(String tag)

This method looks for an full html tag specified by a String, and moves the pointer to the first token("<") of the tag if found. If not found the pointer is moved to the end of the input stream. For example, SkipToTag("<b>") will look for the next <b> tag. It returns true if found, false otherwise.

boolean SkipToAfterTag(String tag)

Same as SkipToTag except the tag is consumed if found.

String GetUntilString(String str)

Looks for the string specified outside of html tags and returns the all text as a String starting from the first character after the last html tag found.

String GetUntilTag(String tag)

Returns all of the text including html tags starting from the current pointer until the html tag specified as a String is found. It returns null if not found.

String GetUntilOpenTagOfType(int tagKind)

Returns all the text including html tags until an open html tag (<tag) if the specified type is found. The pointer is moved to the "<" token upon return. If the tag is not found it returns null and the pointer is at the end of the input stream. The int tagKind must be one of the TAG constants defined above.

String GetUntilClosedTagOfType(int tagKind)

Similar to GetUntilClosedTagOfType(int tagKind) except it looks for a closed tag (</tag). The pointer is moved to "</" if successful. Again, tagKind must be one of the TAG constants defined above.

Revised 12/1/98