The British Library has launched an online archive of UK websites, based on IBM systems.
The UK Web Archive will run on the IBM BigSheets system, which is based on the Apache Hadoop Java framework, and promises to process large amounts of data “quickly and efficiently”.
It aims to allow users to easily extract information from unstructured data, by publishing searchable data feeds, pie charts and tag clouds. It will also assist British Library archivists to extract, transform, annotate and analyse web pages.
Six thousand websites already have their pages stored, including ‘2010 General Election’, which holds the websites of MPs, and the ‘Credit Crunch’, a collection initiated in 2008 which holds records of high street retail chains that have gone bust in the recession.
The archive was unveiled this morning by Margaret Hodge MP, minister for culture and tourism, alongside the British Library’s chief executive, Dame Lynne Brindley, this project demonstrates the importance and value of the nation’s digital memory.
Brindley said the website would aim to create a record of the “major cultural and social issues being discussed online”, and to “avoid the creation of a digital black hole in the nation’s memory”.
But she lamented the current legal framework, which insisted that copyright was required to archive even free websites, and would allow only one percent of all free UK websites to be collected by 2011 at current rates. There are eight million sites in the UK web domain.