When looking at a listing of links on, have you noticed that some people use next to no tags while others have 10 or even 20? I used to get annoyed at people who (in my mind) insufficiently tagged their posts, but I’ve been reconsidering my position on these core-taggers. I think they may paradoxically improve the relevance of search results.

Warning: Poorly-collected thoughts ahead. Caveat lector.

When you post a link, you are given the option of entering a space-delimited list of tags to describe that link. Even though all the tags have different degrees of actual relevance to the link, the system treats them as equally relevant. This is a type of boolean indexing: each tag is either fully present or fully absent.

All the tags a person uses for a post hold the same weight. If I tag this entry as “blog article post tagging analysis longtail emergent graph”, then ‘’, ‘analysis’, and ‘graph’ receive equal weight. Naturally, more people will use ‘’ than ‘graph’ — that’s how the head-tail distribution emerges. But for a given post, the system will treat ‘graph’ and ‘’ as equally good descriptors of the link.

Descending bar graph with low, roughly equal-height bars Fig. 1: Results of fringe-tagging

If only one person tags a link, each term will have the same value. There will only be tail — terms that only a few people have used (a.k.a. fringe tags). Imagine that everyone is a fringe tagger — the graph for a given link will be quite long and quite flat, possessing no distinct head (Fig. 1). An overall theme will not arise for each link, leaving the search results filled with junk matches. There will still be a head of sorts (composed of the intersection of peoples’ tag choices), but it will be much broader and will have no internal definition, since each tagger will likely use all of the terms composing the head.

Descending bar graph with sharp dropoff and short tail Fig. 2: Results of core-tagging

So, we can see that core taggers are extremely important in a tagsonomy. They provide definition and body to the tagscape, isolating the few most important terms. Disagreements between core taggers make for a slightly more diverse head, but they will likely agree on the main terms. Unfortunately, with only core taggers, the tail disappears and tagging becomes nothing but cross-categorization (Fig. 2). All the tangentially-related terms fall by the wayside in favor of the most obvious ones, and niche links go unfound, rendering the system useless to the tail-searchers.

Descending bar graph with longer dropoff and asymptotic tail Fig. 3: Results of mixed tagging

I’ve really come to respect the way diverse tagging styles are necessary for the nice distributions we see on (Fig. 3). Have you noticed anything about how diversity affects and similar systems?

  1. Tim McCormack says:

    A little pre-emptive criticism of this post:

    It strikes me as I look back over this post that I was slightly less than 'eloquent', rather more in the domain of 'redundant', 'grasping', and 'rambling'. I'd like to see someone take the idea and build a more coherent viewpoint.

    It might also be neat to use real data for those graphs, instead of drawing them to fit the model...

