Lloyds Banking Group has been working at getting its data pipeline and analytics up to realtime speeds, paving the way for customers to see transactions straight away and for better fraud detection internally.
With the upcoming regulatory changes around open banking on the horizon, Lloyds is looking to get its data architecture ready to expose transactional data to customers, and as close to real time as possible.
Speaking at the DataWorks summit in Munich this week, Andrew McCall, chief engineer for big data at Lloyds Banking Group said: "Certainly in Europe my personal view is that this is the new way of looking at customer data. It is not our data as a company, it belongs to the user and we are responsible for it."
McCall believes that getting to near real-time data processing within the bank "starts to open up lots of possibilities in terms of machine learning and how we can better serve customers and give them better insight into their own finances."
Also it will also be helpful in tackling fraud. He explained: "When you know about it now, something can be done about it. If you know about something that happened yesterday it is not as effective as an anti-fraud mechanism."
McCall said that the aim is "sub-second latency across our entire big data estate from source system to the consumer." Even with the amount of layers that data has to pass through, by using Apache Storm the bank is seeing latency below 20 milliseconds.
What this could look like for consumers is instant spending notifications, like those seen with some of the more technologically agile challenger banks and their mobile apps.
This has required a big architectural change for the bank, as McCall's team has had to integrate legacy systems with a variety of modern techniques like change data capture (CDC) and Kafka connectors to achieve the required speed of operations. McCall admits that the ideal architecture would be a "Lambda architecture with some microservices that write and reads data before dumping it into your system of insight."
In reality, though, Lloyds has to deal with complex legacy systems, which don't tend to offer the low latency McCall is looking for.
"You tend to have many sources, many consumers, lots of state in that system," he said. "Write APIs, queues, some batch feeds, schedule jobs, read APIs, queues outbound and batch feeds outbound."
The tools that the bank utilises for its real-time pipeline is HBase for data store and to enrich data as it comes through, Hive for loading data for insight and analytics, Storm as a real-time processing engine, Spark for consuming and loading data into the various systems, and Kafka as the pipeline across the estate. All of this is still run on-premise at the bank.
Lloyds implemented a caching layer for extra security to protect API data. McCall explained: "For our API, part of the reason we would want to put Cassandra or something like it in place would be because we don't necessarily want to expose some of the core systems to the rest of the world, which could be potentially dangerous, so we want some kind of caching layer."