Anyone that read about the Panama Papers leaks in April will have already seen the benefits of using a graph database. Graph database technology played a crucial role in enabling journalists to wade through immense datasets, quickly making connections between individuals, institutions and tax havens.
And graph databases are increasingly being used by large organisations in a variety of sectors. Many of the world's big banks, for example, are starting to recognise that the connections and patterns within their highly complex data are as important as the data itself. Graph databases promise a way of discovering and leveraging these connections.
"Graph has always made sense, it is theoretically superior, but relational [database technology] has been so good it needed to catch up and now it is at the tipping point,” says Jarred McGinnis, managing consultant at semantic web company Ontotext.
“I think it's no longer esoteric, it is mature enough. There are enough examples of it working in the enterprise environment.”
Graph database definition
A graph database is a flavour of NoSQL database built upon graph theory, an academic computer science methodology which plots data points, known as objects or nodes, and the connections between them on a 'graph'.
So, where a traditional, relational database stores data in rows and columns, a NoSQL database stores large sets of unstructured data. A graph database goes a step further by including the connections between those data points, essentially building up a network of data.
Billy Bosworth, CEO of open-source software vendor DataStax, summed up a graph database as: “Essentially a way of finding connections and patterns within data, rather than just querying a data set.”
He adds that a graph database gives businesses a “much simpler and faster understanding of the nature of deeply complex relationships in data”.
Co-founder of graph database vendor Neo Technologies Emil Eifrem started building graph databases fifteen years ago when he found that the old relational database was working against him as he built content management systems for enterprise customers.
"The key reason it was working against us is because the information we were working with didn’t really fit," he says.
"The information was big and messy and interconnected and ever changing and evolving. It was a round peg and we tried to squeeze it into this square static hole." The result was the Neo4j graph database.
The history of graph databases
As with many technologies at the cutting edge of enterprise adoption, graph databases first gained traction in the hyperscale data centres of the tech giants.
Due to the nature of the internet and its web of interconnected data, the likes of Facebook and Google have been building their businesses upon some form of graph database for decades.
Mark Zuckerberg was referring to Facebook as your 'social graph' since he started building the social network upon the Open Graph Protocol a decade ago, repurposing the mathematical term for a specific computer science application.
So why is it just starting to enter the public lexicon now?
Eifrem from Neo says that there is greater awareness of how graph can address some of the limitations of object databases created in the 1970s.
“Object databases were a massive failure,” says Eifrem. “One of the key problems with object was they took a programming paradigm and turned that into a data model.
“We have learned that the half-life of data is much longer than that of software, so you need to make sure you decouple those so that, as your software gets displaced, your data lives on. That’s a key mistake that the object model did that we have avoided.”
Which industries stand to benefit from these benefits the most? DataStax CEO Bosworth thinks that any digital business will be able to benefit from graph databases: “The clustering happens around use cases,” he says. This includes specific uses such as fraud detection, as well as personalising services or gaining a clearer view of customer behaviour.
“No matter the industry: finance needs it, healthcare, oil and gas, manufacturing, everybody is looking for the same thing,” he says.
One Neo4j customer is Stelios Gerogiannakis, senior software engineer at Royal Bank of Scotland (RBS).
Currently, RBS is using Neo4j for a dependency management system that’s being used to minutely track change implications within the RBS trading system.
“We wanted it to be able to record all of the dependencies in our artefact universe, and to have professional support,” says Gerogiannakis.
“We also wanted it to be [high] performance and easily embeddable for microservices.”
He says that the bank is looking at the capabilities of graph for a number of things: “There is no one tool for all, so in your arsenal you need all the tools you can get,” says Gerogiannakis. “I know that it is being assessed right now for credit risk, because there is value in the connections."
Similarly, when asked if they are looking at graph for fraud protection, Gary Stewart, an architect at Dutch bank ING told ComputerworldUK that he is very interested in the product, but added: “We want to be leading edge but don’t want to bleed. We are always a version or two behind [because of the regulation ecosystem.]”
DataStax's Bosworth also sees personalisation as a growing use case. He explains: “I think personalisation will be the flip side of anti-fraud. The same aspect that allows me to understand those connections is going to allow me to understand how to make your experience better.”
Eifrem from Neo says real time recommendations, identity and access management, impact analysis for telecommunications and fraud prevention and detection are the use cases he sees the most of.
When it comes to real time recommendations, Eifrem explains: “It turns out all big retailers have stored your purchase history for decades. If you take a graph approach that becomes more than just a list of items. It becomes: Emil has bought these three items, James has also bought these three items, so what is the fourth thing that James has bought that Emil hasn’t, so that we can recommend it to him.”
Spotify’s popular Discover Weekly feature also graphs connections between users with similar music taste and fills in the gaps in the form of a playlist.
Case study: The Panama Papers
Another Neo4j customer, The International Consortium of Investigative Journalists (ICIJ), has been using the graph vendor, in tandem with the visualisation firm Linkurious, to help report on data leaks for offshore tax havens in its Swiss Leaks and Panama Papers investigations.
Mar Cabra, head of the data and research unit at the ICIJ says the Panama Papers was the biggest leak in journalism history, with 11.5 million documents and 2.6TB of data.
“The way our journalists would deal with this before was to print the paper and draw graphs on paper. We missed connections. So when we first went into business [with Neo4j] for the Swiss Leaks the first reaction of my reporters was ‘I didn’t know this connection’.”
“This was very difficult to find in documents because our brains are not wired like that visually and it requires a lot of work if you have to do this by hand. In the Panama Papers that was even more interesting because it allowed us to see patterns we couldn’t find before.”
Graph database vendors
According to Forrester the graph database vendor market is small but ripe for growth. In his market overview for graph databases, analyst Noel Yuhanna says: “Forrester expects the graph database market to have significant success in the coming years as organisations develop a digital insights architecture that relies on connected data.”
He identifies the current eight vendors as: the aforementioned DataStax with its enterprise graph (DSE), Neo4j, as well as Oracle (the only big four vendor so far) and Complexible, FlockDB, Franz, Objectivity and Orient Technologies.
Amazon Web Services (AWS) also followed suit late in 2017 by releasing its own graph database-as-a-service under the name Neptune.
Neptune is a graph database engine optimised for storing billions of relationship data points and querying the graph with milliseconds of latency. Underpinned by S3 and replicated across at least three availability zones Neptune should prove a popular graph DB option.
In typical AWS style Neptune can be used without having to worry about maintenance, patching, backups, and restores. Neptune supports fast-failover, point-in-time recovery, and multi availability zone deployments and 15 read replicas to can scale query throughput to hundreds of thousands of queries per second.
Graph database skills
Being a new technology, graph comes with a new set of skill requirements. Jarred McGinnis lays out the challenge of graph data models for developers: “It is based on a JSON, a Java based graph language for programming. This is the gateway drug to work in semantic statements. It takes about the same effort of learning any database.
"It’s not easy but you can get very esoteric with this technology if you want to. The key will be making sure it’s as simple to start up as a relational database from Oracle. We’ve already convinced the hardcore techies, now it’s the rest of the world.”
Bosworth says that graph is such a step change from relational databases that it requires a new query language of its own: Gremlin.
“When you look at a graph query it looks nothing like SQL,” says Bosworth. “Turns out that when you look at a graph problem the Gremlin language is just much more intuitive because it has things like ‘out’ and ‘contains’ and ‘in’. You can write the language the way that you view the graph.”
Bosworth admits that the nature of the open source ecosystem means that there will be multiple languages out there but adds: “Where we do see the benefit of standardisation is that we don’t see the need for ten graph languages. We either have the graph database for you or we don’t.
“What I want is for the graph skill set to be in the market and that all boats rise. So if we can get part of a project like that to help standardise a graph language then that’s good for everybody.”
Much like relational databases taking ten years to really mature into a popular, commercially viable tool for businesses, graph appears to be rounding into maturity, and the big vendors and enterprise customers are starting to pay attention.
The use cases may be far more narrow than a traditional, relational database, but they apply to big data problems that are plaguing digital companies and the money men in financial services.
No one is going to build their entire big data strategy around a graph database, but it is another tool at digital enterprise's disposal. Expect to hear a lot more about graph in the years to come.
Find your next job with computerworld UK jobs