by zcorpan » Sat Nov 05, 2011 9:57 pm
An HTML tokenizer is part of an HTML parser. To parse HTML, you first tokenize the input stream (a sequence of bytes or characters) into a sequence of tokens, where a token is text, start tag, end tag, doctype, or comment. Then the HTML tree builder looks at the sequence of tokens and builds a DOM tree.