The Financial Times' content platform team has reduced its AWS server costs by 80 percent while moving to a much more stable tech infrastructure by adopting containers, but the move wasn't free of challenges.
The FT was a reasonably early adopter of Docker as its container platform in mid-2015.
Speaker at KubeCon in Copenhagen this week, Sarah Wells, technical director for operations and reliability at the FT said: “It [Docker] was pretty leading edge at the time, though we had to do lots of work to glue components together."
There are a limited amount of changes to infrastructure you can implement at one time, Wells warned, suggesting that organisations should think about themselves having a finite number of 'innovation tokens' that they have to use wisely.
For example, the team had to build their own container orchestration platform due to a lack of off-the-shelf options. “That definitely meant we were spending some innovation tokens,” Wells said.
The team then introduced container orchestration platform Kubernetes in 2017 to help manage its large existing containerised tech stack, consisting of 150 microservices. Adopting this architecture allowed them to move from 12 releases a year to 2,200.
Then there were the savings on top. Wells says that "moving the entire stack to just eight large VMs achieved an 80 percent cost reduction."
However, “the migration [to Kubernetes] was not a small thing to attempt,” as Wells put it.
The team did it because, on balance, the benefits outweighed the risks. Their in-house platform posed a number of challenges around supportability and documentation.
Firstly, as it was built by the FT team, there was “nowhere else to go for help”. Secondly, “the container work was done by a bunch of people who pretty much all left in 2016. You're supporting stuff you don't understand, and you can't ask anyone,” she said.
This is where the team saw the advantages of adopting an off-the-shelf product – or as Wells dubbed it, 'boring technology'.
“Technology usually starts off as something for experts – custom builds. But then people build them up into products, and finally it becomes a commodity. A great example is electricity. We don't build power stations, we just plug in. Computing is the same. We've moved from data centre to private cloud to cloud providers,” she said.
Before picking a container orchestration platform, the team decided upon two metrics for success: the amount of time spent keeping the tech cluster healthy and the number of sarcastic comments on Slack.
“Boring tech isn't bad. It allows you to focus on the stuff that differentiates you from others. Building new innovative tech is fine, but by late 2016 tools were maturing and we thought we'd take advantage. Why maintain something yourself when it's available as a product, unless it's core to your business? We're not a cluster orchestration company, we're a news business,” she added.
The team opted for Kubernetes for two reasons. Firstly, they preferred it to the other platforms they had assessed, and secondly, it appeared to be an emerging standard the industry was (and has continued to be) converging on.
Also, because Kubernetes is open source, there is huge community with plenty of opportunities for documentation sharing and learning from others.
The main diffculty associated with adopting Kubernetes was that the FT had to move from its in-house platform to Kubernetes while running both in parallel.
“It was a major challenge. It was a bit like trying to switch horses in the middle of a roaring river,” she said.
“At the start of migration we had 150 services live, and lots of other work going on. We had to migrate without affecting any of that,” she added.
There were some significant costs as a result – financial in terms of temporarily inflated AWS costs and its logging tool, but also in terms of taking up more of the team's time than expected.
It was difficult to keep both stacks in sync, deploying changes across both, and to test everything twice across both platforms, Wells said.
However, migration went fairly smoothly and as a result the platform is a lot more stable.
“We've had three production incidents since, they didn't affect anyone though, it was just some nodes going down. We've had just two out-of-hours calls, but in both cases Kubernetes recovered the system before the developers could even open their laptop,” she said.
“We now have a much more stable, cheaper system, and a much happier team working on functional things rather than just supporting the platform,” she added.