Reading Shakespeare: the Next Act of Open Data

As readers of this blog will have noticed, much of the most innovative work in the field of openness is taking place in open data. One of the largest stores of data is held by government, and the argument for opening it up where possible is...


As readers of this blog will have noticed, much of the most innovative work in the field of openness is taking place in open data. One of the largest stores of data is held by government, and the argument for opening it up where possible is strong: after all, we, the public, paid for this data, so it is only right that we, the public, should have access to it.

That makes the Shakespeare review of public sector information (PSI) particularly important, since it represents an attempt to pull together all the different strands of open data in government and to draw up a coherent, over-arching strategy. Here's the review's excellent summary of why we need to do this:

The next phase of economic, scientific and social development has data as its core – the digital trace left by human activity that can be readily gathered, stored, combined and processed into usable material. This data, to optimise its value to society, must be open, shareable and, where practical, it should be free. The richest source of data is government, which accounts for the largest proportion of organised human activity (think health, education, transport, taxation, welfare, etc). Therefore Britain must focus intellectual attention and material resources on the task of fulfilling the potential of PSI. The benefits will be many including: transparency, accountability, improved efficiency, increased data quality, creation of social value, increased participation, increased economic value, improved communication, open innovation, and data linkage. Just imagine this applied to health, an area in which we are making significant advances. There is a significant amount of work ahead. For instance, at the moment health data comes through a variety of unconnected channels and into many different silos. It is hard
for researchers to gain access to its full value. Advances in technology not only now allow us to collect data at source in real time, but also enable more practical linkage and accessibility. Establishing ways to effectively link data should become a priority, with special attention being paid to how medical practitioners can both access data themselves, and also contribute the data they have collected.

Here are the review's main recommendations:

A. Recognise in all we do that PSI, and the raw data that creates it, was derived from citizens, by their own authority, was paid for by them, and is therefore owned by them. It is not owned by employees of the government. All questions of what to do with it should be dealt with by the principle of getting the greatest value back to citizens, with input not just from experts but also citizens and markets. This should be obvious, but the fact that it needs to be constantly reaffirmed is illustrated by the way that even today, access to academic research that has been paid for by the public is deliberately denied to the public, and to many researchers, by commercial publishers, aided by university lethargy, and government reluctance to apply penalties; thereby obstructing scientific progress.

It's great to see that the first statement about PSI is that belongs to the public, and that the basic principle is giving back value to citizens. It's also good to see the reference to open access, something that has been discussed here on Open Enterprise many times, and how the case for making the results of research paid for by the public freely available is inarguable.

B. Have a clear, visible, auditable plan for publishing data as quickly as possible, defined both by bottom-up market demand and by top-down strategic thinking, overcoming institutional and technical obstacles with a twin-track process which combines speed to market with improvement of quality: 1) a 'early even if imperfect' track that is very broad and very aggressively driven, and 2) a #8216;National Core Reference Data' high-quality track which begins immediately but narrowly; and then moving things from Track 1 to Track 2 as quickly as we can do reliably and to a high standard. ‘Quickly' should be set out by government through publicly committed target dates.

I really like that framing: core reference data that is so important that it must be released as high-quality material, but everything else on a "good enough" basis, just to get the stuff out there. That deals with the frequently-raised objection that data isn't in a "fit state" to be released: if it's really important, then it must be made in a fit state; if it's not, then release it anyway.

C. Drive the implementation of the plan through a single channel more clearly-defined thanthe current multiplicity of boards, committees and organisations that are distributed bothwithin and beyond departments and wider public sector bodies. It should be highly visible and accessible to influence from the data-community through open feedback mechanisms. ‘Implementation' includes not only publishing but also processes to ensure that government transparently uses its own structured data to improve policy development and to measure progress.

Open feedback is crucial here: setting up a single channel will only work if it is responsive to people's needs, and that means listening to comments.

D. Invest in building capability for this new infrastructure. It is not enough to gather and publish data; it must be made useful. We lack data-scientists both within and outside of government, and not enough is being done in our education system at school and undergraduate level to foster statistical competence; we will feel these gaps more and more as the potential grows. Government is already committing resources to this; we should consider increasing this further, as the economic and social benefits quickly and demonstrably outstrip costs. Our research councils should seek to play a more strategic role, targeting investment on basic data-science and on inter-disciplinary academic/business projects and partnerships.

This is a really important idea: we need to start training a new generation of open data engineers (and maybe find a better word to describe them – any suggestions?) Data literacy will become an important skill for the future, and the sooner we start nurturing it, the better.

E. Ensure public trust in the confidentiality of individual case data without slowing the pace of maximising its economic and social value. Privacy is of the utmost importance, and so is citizen benefit. People must be able to feel confident about two things simultaneously: that the data they have supplied or that has been collected about them is made as useful as possible to themselves and the community; and that it will not be misused to their detriment. We lay out ways in which we think we can get as close as possible to this ideal.

This is, of course absolutely right, but there is a slight irony here. While the Shakespeare review correctly emphasises the importance of respecting the privacy of citizens when their data is being used, the European Parliament looks likely to sell us down the river as far as data protection is concerned, largely because of unprecedented lobbying from US companies.

The most contentious issue for PSI is, of course, the idea that data currently provided by the trading funds – Companies House, Land Registry, the Met Office and Ordnance Survey – should be made available free. The Shakespeare review is certainly in favour of doing so:

The overarching aim of the Trading Funds should be to deliver maximum economic value from public data assets they provide and support, by working to open up the markets their data serves. This means they should work towards opening up all raw data components, under the Open Government Licence (OGL) for use and re-use.

That licence is pretty liberal – basically a kind of Creative Commons attribution licence, with a few minor differences. But making the data from the trading funds freely available would mean forgoing income; the review quantifies how much:

Deloitte were able to estimate the cost on Exchequer revenue of continuing to collect anddisseminate Trading Funds' PSI in its current form, without charging for it, is in the order of £395 million on an annual basis. As government would no longer need to purchase the PSI itself, the direct loss to the Exchequer on an annual basis is in the order of £143 million. This figure may be lower still if there are efficiency savings to be made if fewer dedicated sales and marketing resources are required by Trading Funds. It seems a straightforward decision to invest £143m to make Trading Fund data widely available is a relatively small price to pay to leverage wider economic benefits far exceeding this by orders of magnitude.

As that makes clear, at the moment the UK government is paying the trading funds a couple of hundred million pounds for information that effectively it generates itself, so the actual lost revenue would be far less than some claim. Indeed, the figure of £143 million is small when you compare it to the £15 billion cost of a replacement for Trident, say – something that one hopes will never be used. But the knock-on benefits of liberating the trading funds' data are likely to be huge, judging by how much economic activity has been generated in the US, where such data is freely available.

All-in-all, then, the Shakespeare review is a valuable contribution to the open data debate, even if I wished it had been more self-confident in pushing for the zero-cost release of everything – including all trading fund data. Certainly, it seems like the open data revolution proceeds apace, and that's good news for open source and openness in general.

Follow me @glynmoody on Twitter or, and on Google+

Find your next job with computerworld UK jobs