When challenger bank Monzo began building its platform, the team decided it would get running with container orchestration platform Kubernetes "the hard way". The result is that the team now has visibility into outages or other problems, and Miles Bryant, platform engineer at Monzo, shared some observations at the bank at the recent Open Infrastructure Day event in London.
Finance is, of course, a heavily regulated industry - and at the same time customer expectations are extremely exacting. If people can't access their money, they tend to get upset.
Microservices lends itself well to the needs of a modern bank, said Bryant, in reliability and in security. The company had taken its microservices to an "extreme" - with over 1,000 microservices "all doing very small, individual tasks".
"This is working great for us because we can scan each part of the system independently," Bryant said. "For example, if our banking ledger starts becoming a bottleneck, we can scale that bit without affecting the rest of the system. We have many teams working on completely separate parts of the codebase, and deploying these things independently, then we can also deploy multiple instances of each service team."
The majority of microservices is in Kubernetes, along with Kafka and Elasticsearch for some queuing and data storage, with Prometheus monitoring everything - and the open source Envoy Proxy as the service mesh. Outside of Kubernetes, the main database runs on Cassandra.
With more than a thousand microservices running at any given time, the team naturally wants to automate as much monitoring and operations within Kubernetes as possible.
Monzo also has a small data centre presence, and this is used to connect to legacy payment systems from the likes of MasterCard and the Faster Payments scheme for bank transfers. This is extremely sensitive information, and Kubernetes helps with security here by allowing the team to ensure only the services that need to access these data centres actually do.
"Kubernetes helps us by running our services in containers," Bryant said. "We base our service builds off a special kind of scratch Docker container that contains absolutely nothing - because most of our services are written in Go, we can just build a fat binary, and deploy it in a scratch container.
"There's no kind of base operating system stuff, so the only attack vector really is via our code within the container."
Although building the microservices architecture from the start might seem like a daunting task, the end results have been positive for the bank.
"We have over 1,000 services, we're able to deploy these regularly, on a daily basis, down to individual teams," he said. "I think it was a pretty bold decision to go for building a microservices architecture from the very beginning. It adds a lot of overhead, it takes time away from building our products, [but] I think it's really paid off - we haven't needed to majorly re-architecture."
Observability is crucial. Understanding the full infrastructure from the start enables the company to gain a top-level view but also to zoom in on individual components that may be causing trouble (as was the case with a big outage in 2017).
When a payment is made, the card terminal sends a request to the MasterCard network, and that network will send a request back to Monzo to ask if the payment should be allowed.
"We have a bunch of services internally, 40 or 50 involved in this kind of critical path, which have to do a wide range of things - so very basic things like checking card details are correct, matching the card details to someone's account, checking that person's balance to make sure they've actually got enough money to pay for something, run various fraud checks - and inserting a transaction onto the banking ledger to record the movement of the money."
All of that needs to be monitored closely to ensure that the latency of the services doesn't creep up - because if it does, MasterCard will start declining transactions. The team needs to be able to "very quickly" work out which of the 50-or-so services could be the culprit.
And when things go wrong, they can go very wrong.
Henning Jacobs at retailer Zalando maintains a GitHub repository of Kubernetes 'failure stories', where Monzo has the dubious honour of being featured.
Full details of that outage in 2017 was the subject of a talk at last year's KubeCon, called Anatomy of a Production Kubernetes Outage.
In short, the company's entire platform was down for an hour. The culprit in this case was a mixture of failed updates and bugs in - an "interaction between an Etcd upgrade, which is the database that backs up Kubernetes, and a bug in Linkerd, we were using for our service mesh at the time, and a bug in the Kubernetes Etcd client."
This was ultimately resolved by taking down its service mesh.
However, he added that these problems are not inherent to Kubernetes itself, and that once you know it, it makes it "much easier to build up these complex systems".
"It makes the automation of this, the potential for complex and non-deterministic interactions, between components more likely," he said. "I don't say this to scare anyone off of using Kubernetes, it is a great tool. But I think it's important to be aware that things can go wrong."
He added that the team decided to "do Kubernetes the hard way" as laid out by Casey Hightower's tutorial, i.e. with no scripts, on its own, and without the help of a service provider.
"If you're not familiar with Hightower's excellent tutorial, we set up our clusters ourselves, and we generate certificates and install the components and do all the configuration," he said, adding that it was probably the team's only option at the time and that if it was building Monzo today, they would have looked into automated tooling, which have "gotten much better recently".
"I think one of the really great benefits we've got from doing Kubernetes the hard way is that we have a pretty deep understanding of how our stuff works, and how Kubernetes works, and how to operate it," he added. "This has proved really valuable, especially when fixing or investigating outages."
One such problem was a latency spike when the team deployed certain services - it wasn't an urgent problem, however, it was annoying enough that it needed fixing. The team was able to locate the problem (a bad interaction between Linkerd and the Kubernetes endpoint API) and fix it.
Bryant also had high praise for the Kubernetes community, whether that is from the big tech companies or individual contributors. He added that the bank has "a lot of confidence in the quality of the Kubernetes releases," and that his advice to new adopters would be to experiment with the platform as a learning exercise. "At the very least, you'll hopefully get an appreciation for how much you're paying for GKE [Google Kubernetes Engine]," he said.