Generic HTMLParser Users guide

This guide is broken up into two sections, Tokens and Public Methods. The Tokens section lists all of the tokens generated by parser and the methods section list all of the available public methods. If the method takes an integer Tokend as an argument, then it must be any of the tokens defined in the Tokens section, unless otherwise noted.

Tokens

All of the following are recognized as tokens by the parser. There are different states at which a Token will be matched. The states will be in boldface type.

DEFAULT
 
Token:		Value:
STAGO		"<"   - when matched go to State TAG
ETAGO		"</"  - when matched go to State TAG
ANY		matches any character except the above two tokens.  Retain state

TAG
A		"a"		
ADDRESS		"address"			
APPLET		"applet"			
AREA		"area"
B		"b"				
BASE		"base"				
BASEFONT	"basefont"			
BIG		"big"				
BLOCKQUOTE	"blockquote"			
BODY		"body"				
BR		"br"				
CAPTION		"caption"			
CENTER		"center"			
CITE		"cite"				
CODE		"code"				
DD		"dd"				
DFN		"dfn"				
DIR		"dir"				
DIV		"div"				
DL		"dl"				
DT		"dt"				
EM		"em"				
FONT		"font"				
FORM		"form"				
H1		"h1"				
H2		"h2"				
H3		"h3"				
H4		"h4"				
H5		"h5"				
H6		"h6"				
HEAD		"head"				
HR		"hr"				
HTML		"html"				
I		"i"				
IMG		"img"				
INPUT		"input"				
ISINDEX		"isindex"			
KBD		"kbd"				
LI		"li"				
LINK		"link"				
MAP		"map"				
MENU		"menu"				
META		"meta"				
NOBR		"nobr"                          
OL		"ol"
OPTION		"option"			
P		"p"				
PARAM		"param"				
PRE		"pre"				
PROMPT		"prompt"			
SAMP		"samp"				
SCRIPT		"script"			
SELECT		"select"			
SMALL		"small"				
STRIKE		"strike"			
STRONG		"strong"			
STYLE		"style"				
SUB		"sub"				
SUP		"sup"				
TABLE		"table"				
TD		"td"				
TEXTAREA	"textarea"			
TH		"th"				
TITLE		"title"				
TR		"tr"				
TT		"tt"				
U		"u"				
UL		"ul"				
VAR		"var"				
UNKNOWN		matches any word not matched above.  This is for unknown tag types.    

ATTLIST
TAGC		">"  - when matched go to state DEFAULT
A_EQ		"="  - when matched to to state ATTRVAL
A_NAME		#ALPHA ( #ALPHANUM )*	- matches a word.
WHITESPACE    matches anything not already matched by the three above tokens.

The following are used by the parser to determine token A_NAME.  
#ALPHA	["a"-"z","A"-"Z","_","-","."] - used 	
#NUM		["0"-"9"]	
#ALPHANUM	#ALPHA | #NUM	

ATTRVAL
CDATA		This matches a word and changes to state ATTLIST.
		The regular expression for the word is as follows:

		"'" ( ~["'"] )* "'"
		|	"\"" ( ~["\""] )* "\""
		| ( ~[">", "\"", "'", " ", "\t", "\n", "\r"] )+


Public Methods

boolean SkipToToken(int tokenKind)
Moves the pointer to the first token found with kind = tokenKind. The token variable may be any of the constants defined above. If the token is not found the pointer is at the end of the input stream upon return.
boolean SkipToAfterToken(int tokenKind)
Same as SkipToToken except the token specified is consumed.
boolean SkipToOpenTag(int tag)
Looks for the html tag specified by tag,and moves the current token pointer to the first token in the html tag("<"). The tag must be one of the tokens defined in the TAG state above (A through UNKNOWN); If the tag is not found the pointer is located at the end of the input stream upon return. An open html tag is something line <b&bt;. SkipToOpenTag(html.B) looks for "<b"
boolean SkipToAfterOpenTag(int tag)
Same as SkipToOpenTag above, except that the whole html tag is consumed.
boolean SkipToCloseTag(int tag)
Looks for a closing html tag and moves the token pointer to the "</" token if found. If not the token pointer is at the end of the html page. SkipToCloseTag(html.B) looks for "</b".
boolean SkipToAfterCloseTag(int tag)
Same as SkipToCloseTag except the entire html tag is consumed.
boolean GetUntilToken(int tokenKind)
Returns a String containing everything (including html tags) before the first finding of the token with kind=tokenKind. The pointer points to the token when finished. Null is returned and the pointer is at the end of the input stream if the token is not found.
String GetParameter(String param)
If the current token pointer is inside an html tag, the method looks for a parameter with name = param and returns its value. The parameter and value is consumed if found. It returns null if not found and the pointer points to the closing ">"; ex <a href="..."> GetParameter("href") will return the value in quotes.
boolean SkipToEndOfTag()
If the pointer is inside a tag the pointer is moved to the token immediately following the ">" of the tag;
String GetUntilNextTag()
If the pointer is not in a tag, all of the text from the current pointer until a "<" is found is returned in a string. Null is returned if not outside a tag.
boolean SkipToString(String str)
This method looks at the text outside of html tags searching for the String str. If the string is found, then the pointer is moved to the first token in the string. If the string is not found, the pointer is moved to the end of the inputstream and null is returned.
boolean SkipToAfterString(String str)
Same as SkipToString except the string is consumed.
boolean SkipToTag(String tag)
This method looks for an full html tag specified by a String, and moves the pointer to the first token("<") of the tag if found. If not found the pointer is moved to the end of the input stream. For example, SkipToTag("<b>") will look for the next <b> tag. It returns true if found, false otherwise.
boolean SkipToAfterTag(String tag)
Same as SkipToTag except the tag is consumed if found.
String GetUntilString(String str)
Looks for the string specified outside of html tags and returns the all text as a String starting from the first character after the last html tag found.
String GetUntilTag(String tag)
Returns all of the text including html tags starting from the current pointer until the html tag specified as a String is found. It returns null if not found.
String GetUntilOpenTagOfType(int tagKind)
Returns all the text including html tags until an open html tag (<tag) if the specified type is found. The pointer is moved to the "<" token upon return. If the tag is not found it returns null and the pointer is at the end of the input stream. The int tagKind must be one of the TAG constants defined above.
String GetUntilClosedTagOfType(int tagKind)
Similar to GetUntilClosedTagOfType(int tagKind) except it looks for a closed tag (</tag). The pointer is moved to "</" if successful. Again, tagKind must be one of the TAG constants defined above.