Data Portals Become Fashionable: Time to Worry?

Yesterday I mentioned Nigel Shadbolt, who has played a leading role in the opening up of government data in the UK. By chance, I've just come across a report [.pdf] he wrote for the EU about doing much the same, but on a larger scale. Curiously, this is dated 15 December 2010, but this is the first I've seen it. Either it's been buried deep within the Brussels system, or I've been remiss in catching it. Either way, it's still well worth reading.

Here's the thinking that lies behind it:

A number of EU Member States (e.g. in the UK) and elsewhere (the US with have developed or are now in the process of creating open data catalogues and portals of Public Sector Information (PSI). This approach is also beginning to be implemented in a limited number of municipalities and regions (e.g. Catalunya, Piemonte These initiatives are being undertaken by the public sector and are providing access to a broad and substantial set of government data, making data accessible in electronic ‘raw' formats (e.g. XLS, CSV) allowing for immediate re-use. These initiatives have attracted widespread interest and are highlighting how public data can be made more accessible and available for reuse. The setting up of, an EU portal aggregating national portals (UK, FR, ES, etc.) could become an EU flagship initiative allowing governments, companies and citizens to easily find, understand and re-use data created and maintained by the European institutions and Member States. This report is a scoping paper intended to offer guidance as to what the initiative could offer, the services it might provide, as well as the problems to be surmounted and a preliminary tentative project plan to implement a working system.

So the basic plan is to create a along the lines of

Seven major benefits are outlined, including:

transparency – simply put the provision of detailed information relating to European common interests such as taxation, spending, education, transport, energy, environment, crime, health etc. enables citizens to be better informed and to be able to make comparisons with and between states.

Related to this is accountability – information in these sectors holds the providers of public services accountable – from spending on infrastructure to the timeliness of transport, from death rates in hospitals to employability of graduates.

These can then lead to:

evidence-based policy – this provides EU institutions and Member States with the ability to base their policy decisions on empirical data – data that is open to public scrutiny.

Open data can also lead to social value - "the public good to which the data can be put" - economic value, via "new services that can generate real economic returns", and efficiencies, for example from better procurement, or by cutting the cost of publishing data in non-machine readable formats.

Finally Shadbolt notes that "there is growing evidence that open data can improve data quality....crowd sourcing improvement is possible once data is openly published and a feedback means is provided. He also points out that there are number of additional potential benefits that a pan-European portal would provide, including a "race to the top", whereby friendly competition between states would lead to more open data being made available, and greater interoperability across processes.

The portal itself would offer two basic functions:

firstly the locator services that use metadata to catalogue and discover relevant content and secondly records creation that support the creation and deposition of data assets into the portal.

The rest of the report then runs through some of the more technical aspects (although at a fairly general, and thus approachable, level) - things like schema designs ("a common semantic basis for meaningful searching"), linked data (including Berners-Lee's "five star" classification) and what might be a reasonable time-frame for setting up such a system.

There was also a section on "recommendation services", which I found particularly intriguing:

A range of recommendation services would be possible in a portal of this size and ambition. For example,

Similar users also downloaded... Amazon-style recommendations. Some users will have a longer and more interactive experience with the portal than other. The system can provide assistance to naïve users or newcomers by correlating their profiles with expert users using features of the resources they search for, records they view, and the datasets they download. This implicit collaboration of expert with inexperienced users will certainly help individuals to mine resources that are not readily available or ranked high on their search results. Recommendations of this type may be delivered in the form of "see what similar users are searching", or while the user is viewing a specific record.

Recommendations of similar datasets for locations nearby the user's specific geography. This can be employed on a per-record recommendation basis i.e. offer the recommendation while the user is viewing a specific record.

Popular and featured datasets present on the main page.

Supporting network formation of "people like me" is another recommendation method that is widely and successfully used.

These all address a crucial issue that is often overlooked. Making the data open is not enough: it must also be made as accessible as possible, to as many people as possible. Much of that depends on other factors – for example, whether people have computers and access to the Internet, but other aspects can be addressed by the design of the open data system using, for example, the techniques described above.

Finally, the report is of great interest for the detailed information that it gives about the site. It will come as no surprise to readers of this blog that the entire infrastructure is built using open source software – Drupal for CMS, Apache Solr for search, and MySQL and PostgreSQL for the databases. The critical importance of open source in the context of open data initiatives is made clear by the following:

Total pilot project resource 30 person months – or 5 full-time equivalents over 6 months – equates to around 400 K‚¬ of staff costs (full economic cost) and 100 K‚¬ for other costs including system procurement until launch.


A modest ongoing cost of operation and support would be around 200 K‚¬ per year – over five years equating to 1M‚¬ and a total project cost of around 1.5M‚¬.

The costs are only possible if the system is procured with the following characteristics: (i) based as open source, (ii) presented as a perpetual Beta (system under development) and (iii) adopt an open licence.

That's a useful reminder that open source is a natural fit for these new open data projects, not just in terms of licensing and philosophy, but also because it allows pilot systems to be set up very quickly and cheaply.

The only thing that concerns me slightly about this report is that it is further evidence of the resurgence of the "portal" concept. Those with good memories will recall how portals were all the rage at the end of the 1990s – just before the dotcom crash. Let's hope the current enthusiasm for data portals does not portend a similarly rough landing in the near future.

