why no one likes internet statistics sites

Alexa just changed their ranking system, Techcrunch has details. (The Alexa page, in typical clueless fashion, is not a permalink, so I won’t bother linking to it). This is good news, but it still doesn’t fix the problem. Why?

Alexa, Compete, Quantcast, Comscore, all have measures of “Rankings” or “Popularity” that put sites relative to others. These rankings, in general are a more-or-less accurate relative description of how a site is doing.

The keyword here, though, is relative, and, more specifically, relative to how the service in question measures other sites. Comparing a Quantcast ranking to a Compete ranking to an Alexa ranking for any given site is useless, and no one even attempts that.

Rankings for each site are widely understood to be relative within its own index, and no one has a problem with that. So far so good.

The real problem start when we put their measures of Visitors (Alexa calls that “Reach” as far as I know) and Visits. So some news publication may take a comscore measurement and say “such and such a site has 1,000,000 visits”, inevitably prompting discussion of whether it’s true or not and (generally) silent fuming on the part of the site that sees in their logs, every day, different data.

This is the problem. The use of a heavily overloaded term like “Visitors” or “Visits” or “Pageviews” or even “Reach” causes the confusion. None of these sites claim that they have the ultimate truth at their disposal, but by using common terms, this is exactly what happens. After all, everyone knows what a ‘Visitor’ is, right? Well, maybe. Maybe everyone does know what a Visitor or a Visit is, but no one agrees on the definition.

Even if everyone did agree on the definition, you could still lose data unless you’re looking at the logfiles for a service. Why? Several reasons, but the three key ones are:

  1. The services extrapolate traffic based on measures they choose. Try a site that has low traffic, and the services give up. “Not enough data.” Yep.

  2. Not only that, the services extrapolate based on previously filtered data. The raw log data for a year for any of the top 1,000 sites can be counted in the petabytes, if not exabytes. None of these services have enough compute power or storage at their disposal to process that, clearly. So they are pre-filtering information, which is then extrapolated. The prefiltering presumably eliminates bots. What if a new bot shows up? How do they count it?

  3. Domain mapping. Consider Ning. Suppose we agreed on what a pageview or a visit is. Would Alexa or Compete get the right numbers? No. Because they key on domains to define what is traffic to a service. In our case (as in many others) the service allows you to do domain mapping of your site, which means the services think that the traffic is going somewhere else, even though it’s going to Ning. That’s why I can assert without hesitation that the visits/visitors traffic reported by these services doesn’t cover a good portion of the traffic Ning handles, and when on top of that you add uncertainty as to what is a visit, and what is a visitor (is it an IP? a cookie? A combination? What about internet cafes? and so on), then the actual absolute value for those things is pretty much meaningless.

Now, what the services do track correctly in a lot of cases is trends, especially over several months (again, in the case of Ning the fact that they miss domain-mapped networks is a big problem, but the data we have for non-domain mapped networks shows similar, and I say again, similar shapes and trajectories).

My take is that if these services stopped calling what they measure “visits” or “visitors” and just said these were some sort of generic “[servicename] traffic measure” or something, then they would get a lot more respect, something they deserve since they do provide a valuable service, and they should get credit for that.