Comment tokenizer algorithm

The existing comment-tracking systems that I know of just aren't enough. CoComment is buggy and fails to properly parse out comments for a number of blogs, and is missing a number of important features. (Float unread to top of list, for example, or track a URL without visiting it.) Co.mments.com has a much nicer interface and tracks better, but lacks a Firefox extension and some of the advanced features of CoComment.

Rather than simply whining about the lack of excellent trackers, I want to help the existing ones improve. Here I present most of algorithm to parse out comments from an unfamiliar blog template.

Our intrepid user, Bob, has just submitted a comment. The handy firefox extension has raised a sliding alert box to ask if the comment should be tracked, and Bob has clicked the "Track it" option. (This prevents unwanted tracking without requiring him to click a bookmarklet every time.) However, the tracking site isn't familiar with this site or blogging software. Here's what he wrote:

<blockquote>I’m trapped in an example factory!</blockquote>
Ha ha ha, you’re <font color="red">funny</font>, Alice.

Because the font tag is deprecated, the comment system has stripped it out. This leads us to the first challenge -- finding the user's comment. After all, the commenting system may change straight quotes to curly quotes, remove or fix HTML, or convert Markdown (or other) syntax to HTML. As a result, tracker software cannot simply request the blog post and search for the comment text directly.

Retrieve

Before we can find the comment in the page, we need to get the page. The task of determining the URL for the comments, the URL for the permalink, and the URL for the comment form is very nontrivial, and not within the scope of this essay. (Perhaps another time.) In any case, our tracker's crawler retrieves the blog entry and comments, the body of which is displayed below. I'm working with a complicated example -- no separate container for each comment, user starts with a quote from another commenter (or the post), and no integration with our particular tracking system.

<div id="blogpost">
	Ceci n'est pas une blog post.
</div>
<div id="comments">
	<span class="metadata">
		Alice says:
	</span>
	Help! I’m trapped in an example factory!
	<hr />
	<span class="metadata">
		Bob says:
	</span>
	<blockquote>
		I’m trapped in an example factory!
	</blockquote>
	Ha ha ha, you’re funny, Alice.
</div>

Tokenize

Since the text Bob submitted and the HTML the blogging software produced are substantially different, we'll reduce each to its lowest common denominator. Bob's comment is stripped of any HTML tags, then parsed into lowercase alphanumeric tokens:

i, m, trapped, in, an, example, factory, ha, ha, ha, you, re, funny, alice

The blog post needs more processing. First it is parsed into a DOM tree (seen above), and then each text node is tokenized as before. Our goal is to find the deepest node that contains the same tokens in the same order.

<div id="blogpost">
	ceci, n, est, pas, une, blog, post
</div>
<div id="comments">
	<span class="metadata">
		alice, says
	</span>
	help, i, m, trapped, in, an, example, factory
	<hr />
	<span class="metadata">
		bob, says
	</span>
	<blockquote>
		i, m, trapped, in, an, example, factory
	</blockquote>
	ha, ha, ha, you, re, funny, alice
</div>

Traverse and Match

Notice how the beginning of Bob's token array ("i, m, trapped, in, an...") occurs twice within the page. Quoting will generally only cause duplication at the beginning of comments, but you can't take that for granted. In any case, you'll need to write a traverser that recursively descends the DOM and stops on the first instance of the complete token array.

Localize

Here, the goal is to find an CSS/XPath-style selector for the containing block, the begin-delimiter, and the end-delimiter. The containing block is already known, and Bob's tokens are spread across a number of its children. The begin-delimiter is the previous-sibling of the first of these children, and the end-delimiter is the next-sibling of the last of them. Symbolically, here's what the parser sees:

Containing block
html[1] > body[1] > div[@id='comments']
Begin-delimiter
span[@class='metadata'][1]
End-delimiter
-end-of-container-

Triangulate

There is still not enough information -- we need to have at least two comments analyzed. There are two options:

  • Wait around and hope that a second user of our tracker will leave a comment, or
  • Ask [the user] for help

If there isn't enough information to automatically extract the structure, ask the user to select the text of a different comment, if available, or select the "no other comments" button.

Now Carol comments, and either she is a user of the tracker or a user identifies her comment for the system. There is finally enough information. Bob's comment, according to the parser:

Containing block
html[1] > body[1] > div[@id='comments']
Begin-delimiter
span[@class='metadata'][2]
End-delimiter
hr[following-sibling::span[@class='metadata'][3]] (stops there because the next sibling is Carol's comment)

Carol's comment:

Containing block
html[1] > body[1] > div[@id='comments']
Begin-delimiter
span[@class='metadata'][3]
End-delimiter
-end-of-container-

Sequence matching

Find the maximal-length XPath starting sequence that matches for each marker. A couple of notes:

  • The sequences A/B/C and A/B/D are matched as A/B.
  • The sequences A/B[1]/C and A/B[2]/C are matched as A/B[n]/C. (Must correlate to comment number, though.)
  • The end delimiter is too variable here to have a common start-string, but notice that it is always the last comment that has "end-of-container". Ignoring that, as the parser can easily be instructed, the end-delimiter can now be properly matched.

From this information, the tracker can conclude that comment n will be a child of (or collection of children of) the block html[1] > body[1] > div[@id='comments'], will begin right after the child span[@class='metadata'][n], and will end with either hr[following-sibling::span[@class='metadata'][n+1]] or the end of the container.

Conclusion

With these XPath queries stored in a database of sites, comment tracking should become much more powerful and accurate. A small amount of user assistance helps the tracker with the edge cases, but the parser does the majority of the work.

This algorithm is up for grabs. I'd love to see it in use.


Responses: 2 so far

  1. Cairnarvon says:

    Most blogging software nowadays provides comment RSS feeds too, which should simplify things a bit.

  2. Tim McCormack says:

    Certainly! But if you want your tracker to capture nearly everything, it had better be able to capture from blogs that don't support comment feeds.