rel=canonical

Matt Cutts recently posted on the topic of canonicalizing URLs. He strongly recommends consistency in usage of www vs. non-www, / vs. /index.php, etc. (Roger Johansson has two-lines of .htaccess code that will solve the www issue.) I'd like to try a more client-side approach, just for the fun of it.

Let's have an example. Looking at the flat-panel TVs at Crutchfield.com, I see that there are multiple pages of results. Going to page 2 appends &showall=0&pg=2 to the URL. As is common across the internet, clicking on the link back to page 1 changes those appended parameters to &showall=0&pg=1. Now we have a page absolutely identical to the original, but with a different URL. This is known as the canonicalization problem for search engines -- how to determine which URL is most likely the "more correct" one.

Incidentally, don't confuse the canonical URL with the most standard view of a page. A set of results that is sorted based on a parameter in the URL should have a different canonical URL. In other words, the canonical URL should serve up the same data as any non-canonical (but valid) one. The URL that shows all items together on one page may be more comprehensive, but it is not the canonical one.

Johansson's solution requires server-side code to redirect the user agent to the canonical URL. But I've been doing some reading about some of the first web pages, and how they were interlinked. At one time, the use of rel=contents, rel=previous, and rel=next were considered indispensible and unlikely to disappear, and the likes of them have only recently begun to resurface. I was thinking of proposing the microformat rel=canonical. Here's how I envisioned it in use:

Page URL: http://brainonfire.net/index.php

Code:

...
<head>
	...
	<link rel="canonical" href="http://www.brainonfire.net/" />
	...
</head>
...

So, any search engine that comes across apparently duplicate pages could simply check the canonical links and use those as further hinting. Great idea, huh?

The potential problem I foresee with this is the use of the <link /> tag. The href is allowed to be an absolute, relative, fully-specified, or any other type of URL, whereas a canonical address must be fully-specified. This doesn't prevent the search engine from reading the data literally, but I feel it is the wrong "data type"; namely, path instead of URL.

A possible solution to my semantic unease would be the use of a <meta /> tag instead. They 1) don't imply fetching, 2) don't have an implied data type, and 3) still contain meta information.

What other ways can you think of to indicate the canonical address?


Responses: 2 so far

  1. Xaprb says:

    There's already rel=permalink right?

    Why does the canonical address have to be fully specified? URI resolution and the document's BASE (explicit or implied) should take care of that. It shouldn't matter whether it's fully specified.

    Well done :-) Thanks especially for reminding me of the next/prev/contents stuff. I used to put that in all my sites, but I've never seen a user agent other than Opera and Lynx that pays attention to them, so I kinda lost my drive to do it.

  2. Tim McCormack says:

    Xaprb: Good point about rel=permalink, though I think that's more proper for dynamically-placed content, like a blog entry on the blog's front page.

    The canonical address has to be fully specified so that nothing is assumed on the user agent's part. If I put <link rel="canonical" href="/" />, and the current URL is http://www.brainonfire.net/, then the canonical URL would be http://www.brainonfire.net/. If the current URL was http://brainonfire.net/ (with no www), then that would be the canonical URL. Of course, if you used Johansson's solution to take care of the server/subdomain aspect, the canonical URL would not need to be fully specified anymore.

    I was reminded of the next/prev/contents stuff when I was reading Tim Berners-Lee's blog -- yes, he has one now! He linked to an article on the first web browser, WorldWideWeb. Very interesting article -- apparantly, it was a read-write browser. "Web 2.0" is really more of a rediscovery.