Matt Cutts recently posted on the topic of canonicalizing URLs. He strongly recommends consistency in usage of www vs. non-www, / vs. /index.php, etc. (Roger Johansson has two-lines of .htaccess code that will solve the www issue.) I'd like to try a more client-side approach, just for the fun of it.
Let's have an example. Looking at the flat-panel TVs at Crutchfield.com, I see that there are multiple pages of results. Going to page 2 appends
&showall=0&pg=2 to the URL. As is common across the internet, clicking on the link back to page 1 changes those appended parameters to
&showall=0&pg=1. Now we have a page absolutely identical to the original, but with a different URL. This is known as the canonicalization problem for search engines -- how to determine which URL is most likely the "more correct" one.
Incidentally, don't confuse the canonical URL with the most standard view of a page. A set of results that is sorted based on a parameter in the URL should have a different canonical URL. In other words, the canonical URL should serve up the same data as any non-canonical (but valid) one. The URL that shows all items together on one page may be more comprehensive, but it is not the canonical one.
Johansson's solution requires server-side code to redirect the user agent to the canonical URL. But I've been doing some reading about some of the first web pages, and how they were interlinked. At one time, the use of
rel=next were considered indispensible and unlikely to disappear, and the likes of them have only recently begun to resurface. I was thinking of proposing the microformat
rel=canonical. Here's how I envisioned it in use:
Page URL: http://brainonfire.net/index.php Code: ... <head> ... <link rel="canonical" href="http://www.brainonfire.net/" /> ... </head> ...
So, any search engine that comes across apparently duplicate pages could simply check the canonical links and use those as further hinting. Great idea, huh?
The potential problem I foresee with this is the use of the
<link /> tag. The
href is allowed to be an absolute, relative, fully-specified, or any other type of URL, whereas a canonical address must be fully-specified. This doesn't prevent the search engine from reading the data literally, but I feel it is the wrong "data type"; namely, path instead of URL.
A possible solution to my semantic unease would be the use of a
<meta /> tag instead. They 1) don't imply fetching, 2) don't have an implied data type, and 3) still contain meta information.
What other ways can you think of to indicate the canonical address?