Open Source Genomics

There's a revolution underway. It's digital, but not in the computing sector. I'm referring to the world of genomics, which deals with the data that resides inside all living things: DNA. As most people know, DNA uses four chemical compounds -...

Share

There’s a revolution underway. It’s digital, but not in the computing sector. I’m referring to the world of genomics, which deals with the data that resides inside all living things: DNA. As most people know, DNA uses four chemical compounds – adenine, cytosine, guanine and thymine – to encode various structures, most notably proteins, which are represented by stretches of DNA called genes.

Those four chemical “letters” - A, C, G, T – actually form not a binary information system, but a quaternary one (it’s trivial to convert between them.) That makes genomics an inherently digital domain, and therefore one that is ideally suited to computers for storage and analysis.

That’s been true for a while, but the costs of elucidating the DNA of an organism - “sequencing” it – have meant that this has only taken place in research laboratories. But costs for sequencing are dropping so rapidly - much faster than Moore’s Law - that it will soon be possible to sequence anybody’s complete DNA for a few hundred pounds, then tens of pounds, and finally for a vanishingly small amount. Since our DNA contains all kinds of hints about our genetic make-up, and our predisposition to certain diseases, once sequencing costs fall to this level, there will be a huge move to make our genome the basic barcode of our lives, since it is not only incredibly informative, it is unique – even identical twins aren’t identical at the genetic level.

However, each human genome has around three billion chemical letters, and the raw output of sequencers runs to many tens of gigabytes for each. This means that hospitals, say, will need to be able to manage petabytes and more of genomic data, as quickly and as cheaply as possible. This makes the new world of genomic medicine a natural for open source, which scales well, and is much more economical than alternatives. One new company hoping to exploit the natural strengths of open source in the field of medical genomics is Curoverse. Here’s how it introduces itself and its approach:

Next-generation sequencing is driving an explosion in big data that poses unique challenges for bioinformaticians, computational biologists, and the IT teams that support them. At Curoverse, we’re addressing those infrastructure challenges with a platform that makes storing, organizing, and processing these data faster, easier, and more affordable. (We’ll be launching in 2014.)

Curoverse is entirely built with open source software. At its core, Curoverse uses a free and open source system called Arvados that was first developed for the Harvard Personal Genome Project. Arvados is designed to address the unique data management, computation, and sharing requirements driven by genomic and biomedical data. You can learn more about the project and join the community at arvados.org.

During the last 10 years, web-scale businesses have produced an array of innovations in distributed computing, virtualization, file storage, and big-data processing. These innovations have only just begun to make their way into bioinformatics cores. At Curoverse, we’re dedicated to translating these technologies into products that address the unique requirements of the biomedical industry.

As that indicates, Curoverse is based on the existing open source project Arvados:

Arvados is a platform for storing, organizing, processing, and sharing genomic and other biomedical big data. The platform is designed to make it easier for bioinformaticians to develop analyses, developers to create genomic web applications and IT administers to manage large-scale compute and storage genomic resources. The platform is designed to run on top of “cloud operating systems” such as Amazon Web Services and OpenStack. Currently, there are implementations that work on AWS and Xen+Debian/Ubuntu.

“Bioinformaticians” are just the people who using computers to analyse genomic data. The idea behind Arvados – and hence Curoverse – is to create a new, and totally open platform on top of which bioinformaticians in hospitals and companies can develop and run genome-based applications.

Curoverse will adopt a classic open source business model: the software will be free, but Curoverse will offer support and service contracts. That might mean running a hospital’s genomic holdings on Curoverse’s premises, allowing access over the Internet; running on a public cloud like AWS; or taking charge of the hospital’s own systems on site.

Curoverse also hopes that extra layers of information will be added to systems running its code. For example, imaging data requires large storage capacities, as will sensor data once it is captured routinely, and continuously. Curoverse says that the techniques applied to managing petabytes of genomic data can be applied to these other domains too.

Given the incredible advances in sequencing, and the corresponding fall in costs to sequence complete human genomes, it seems likely that digital DNA will soon form the foundation of future health systems (with plenty of tricky privacy and security issues that need to be resolved as a result.) That means there will be a big market for the kind of system that Curoverse is offering.

The fact that it is open should give it a huge advantage, since its customers won’t be locked into one company’s proprietary offerings, but can choose from among alternative suppliers. The open nature of the platform should also encourage other companies and developers to build on top of it, driving innovation far faster than would be possible with proprietary systems.

Also worth bearing in mind is that Curoverse’s open platform will offer plenty of meaningful challenges for a new generation of coders. That’s critically important if open source is to continue to attract the best minds as it has done in the past. The next Richard Stallman or Linus Torvalds may well be a bioinformatician hacking on free software that analyses the digital code that is running within us all.

Follow me @glynmoody on Twitter or identi.ca, and glynmoody on Google