Open Data: Fantastic, But Not Enough


In an unusual move for such a significant news item, the UK government announced over the weekend that they were ordering all government departments to embark on a voyage of transparency.

There were some very good ideas in the announcement, including a mandate to publish details of all ITC procurements. And there is no doubt that a mandate for open data is a fantastic move.

The letter from the Prime Minister was pretty clear:

'Given the importance of this agenda, the Deputy Prime Minister and I would be grateful if departments would take immediate action to meet this timetable for data transparency, and to ensure that any data published is made available in an open format so that it can be re-used by third parties. From July 2010, government departments and agencies should ensure that any information published includes the underlying data in an open standardised format.'

The reference to "an open standardised format" is very positive, but does have a slightly worrying overtone. It implies that the architect of this move envisages the publication of documents in something like Open Document Format (ODF). That's far better than the machine-hostile faxes, PDFs and proprietary documents that are common in many "transparency" situations, but it still means the data is aimed at human readers rather than computer programs.

We deserve better than that. We need applications that allow us to view and manipulate the data. To build an application, you need both the data and the syntax by which it is represented.

Why do we need applications? Document-level transparency is great - it powered the Daily Telegraph's investigation into Parliamentary expenses, after all. But as danah boyd pointed out recently, it means the data is always framed the way its originator intended, possibly concealing in plain sight truths that need highlighting.

With just documents, uncovering trends and correlations takes a lot of boring, repetitive work to build a data set, when what you actually want is analysis. Take a look at what you can achieve with DataMasher in the USA, for example. If you had to crawl through a load of documents to gather the data first, you would never be able to build such a useful tool.

There is hope. According to the Minister for the Cabinet Office Francis Maude, the whole activity is subject to the guidance of "a new Transparency Board, which will include experts, including perhaps the Government’s greatest critic when it comes to transparency, Tom Steinberg." Tom is the founder of MySociety, a non-profit that has done tremendous work opening up government data already, mostly by the hard scrabble of screen-scraping and crowdsourcing.

To get a taste of their work, try TheyWorkForYou, which gathers all the data you need to track your MP's work and then lets you subscribe to it via e-mail and RSS, and WriteToThem which makes writing to your Councillors, MP and MEPs trivially easy. I'm sure Tom will be a firm and loud advocate of machine-readable data.

That may not be enough, though. Looking through the existing catalogue of government data created by the previous government, there are plenty of data sets available. But they are in many cases only available for download, rather than through an API. That means either the applications that analyse the data will need to embed the data, or the end-user will have to download it. Neither is good.

For the "government as a platform" vision articulated by Tim O'Reilly to come true, we need the machine-readable data to be made available, live, through an API.

And even that may not be enough. The data and the syntax of its representation are only useful if you understand the semantics. How can that meaning be best represented? One way is to produce huge volumes of documentation, but a requirement to do that is costly and deadening.

The best way is to follow the example of the US Securities and Exchange Commission (SEC). Back in April, they made an amazingly insightful proposal that certain data disclosures must be accompanied by code that interprets them, so that there could be no doubt how to interpret the data and sample code (in Python) from which others could derive analysis.

So my call to Francis Maude and his new Public Sector Transparency Board is for them to embrace the idea of mandating not just transparency but open data, and not just open data but open data accompanied by open source code to manipulate it.

That's the fastest path to letting sunlight in to the dark corridors of power.

"Recommended For You"

Government must stop dumping unworkable raw data, say MPs Cameron demands UK businesses adopt open data practices