I've been working on an HTML parser in PHP based on the HTML5 specification. The reason I started working on this was because I wanted to have a parser that could parse small pieces of HTML, like comments on a weblog, or messages on a forum such as this one. And that's exactly what I've been working on. Thanks to the specification and
html5lib, I was able to get
something working up and running.
The script, which I dubbed
PH5P (the PHP abbreviation was already taken

), is written in two classes: a tokenizer class (HTML5) and a tree constructer class (HTML5TreeConstructer). The tokenizer processes every character and sends them as tokens to the tree constructer. However, there are a few cases where the specification isn't followed. This is because the parsing algorithm in the specification expects an entire document. This parser is only to be used for small pieces of HTML, such as comments. On a side note, the script requires PHP5 to run.
In the future I'm planning on parsing the character tokens as well. This will, for instance, allow the text to be wrapped in P elements when needed. But for now I'd be happy to have the parser work properly, so I'd appreciate it if you could test the parser and post all feedback in this topic.
http://jero.net/lab/ph5p/