As the British Library announces a programme to plans to archive UK-based web sites, a new analysis has revealed huge numbers of live sites are never visited.
Analysing visits to several million websites during the last quarter of 2009 for its State of the Web report (registration required), cloud security startup Zscaler created a Hilbert curve-generated ‘heatmap' of active and inactive IPv4 sites from real customer data.
As expected, the grid that emerged from this showed clusters of active sites as white dots, a large volume of reserved or non-routed addresses in gray, but it was the sea of dark that loomed largest of all.
In the three months of the analysis, vast numbers of sites were not visited at all, and on the assumption that Zscaler's customers are typical of Internet users more generally, these are Internet's lost continent of sites nobody ever visits, or visit so infrequently that it doesn't register.
"It's a fascinating view which exposes just how vast the Internet truly is. Even when analyzing traffic from millions of users over the course of three months, it can be seen that much of the Internet remains untouched," say the authors.
Commentators often refer to the ‘dark side of the web', meaning the criminal and unsavoury parts of the Internet few normally look closely at, but what Zscaler has turned up on its map is dark in a more literal sense. Nobody looks at these sites or if they do it is incredibly hard to detect from the US cloud.
Some of this ‘unlit space' could, of course, be non-English speaking domains beyond the ken of Zscaler's customer base, which raises the possibility that there are several 'long dark tails' on the Internet which depend from which point you measure the phenomenon.
Part of the explanation for what does not get visited in Zscaler's report might also be explained in relation to what does.
According to the company, even half a decade ago the web was just that, a space defined by html files. Although many persist on seeing the web in this way, the file types moving across its servers have changed markedly. Now, more than half of such files are Jpegs or Gifs, with html files accounting for only 0.57 percent of files.
Popular domains also dominate the Internet, hovering up more and more of people's attention span. Liveperson, Google, doubleclick (the web ad distribution network), Yahoo, Facebook, and a clutch of less well known but structurally important web domains took a large percentage of all web visits, a sign that the web is becoming more concentrated on fewer locations. This is the part of the Internet that is growing.
Tellingly, a similar story of concentration is seen in terms of malware hosts, though with considerable fluctuations. Depending on the particular type of scam being looked at, huge number of malicious URLs emanate from a very small number of hosts. Whether botnets, phishing websites, or malware servers, there is usually a single mega-source, one or two large sources, and a large number of sources with extremely small shares.