Why Capital One completely redesigned its Chef stack on AWS

Chef Conf 2018
Chef Conf 2018

The bank is running Chef at true enterprise scale and shared some lessons on how to run the automation software in the cloud

Share

US bank Capital One is on its third iteration of the Chef automation platform, allowing the US bank’s software development pipelines to be more resilient.

The bank, which is a big exponent of devops, runs Chef Automate at true enterprise scale, with more than 15,000 nodes for dev, QA, and production environments, all hosted on AWS.

Speaking to a packed room at Chef Conf in Chicago this week, John Casanova, master software engineer at Capital One explained that when he joined four years ago, "my first task was to get Chef out of three data centres and on to AWS. We did that in 2015 and that architecture had limitations, so we redesigned it over a year and finished that last year and released this year.”

Naturally, Capital One is highly regulated, so compliance is key to any software development done within the organisation.

For example, the bank runs a process it calls rehydration as part of its OS patches and security updates.

The bank uses a tool called Cloud Custodian “that will shut you down if you are out of compliance, so there is a bit of pressure,” Casanova said. “Rehydration is a big factor in how we approach building our stacks as we have to relaunch them every two months.”

The original Chef stack, which was actually five stacks, one for dev, two for QA and two for production, ran on AWS and had a long list of limitations, such as random failovers, a lack of scalability due to multiple backends accessing the same Amazon Elastic Block Store (EBS).

The old architecture also limited Capital One from reaching the gold resiliency tier or running the new AWS instance types, which would save them money.

Since being redesigned the new Chef architecture “gives us no downtime in rehydration and allows us to move away from random failovers,” Casanova explained.

Now his team of three only have two stacks to maintain across three availability zones. With auto scaling, scaling is no longer an issue. Casanova has also been able to make some costs savings by shifting to Amazon’s better, cheaper EC2 C5 instances.

“Now we have a regionally agnostic configuration, so if you lose the East those nodes can just go down the West side and continue checking in like nothing happens,” Casanova said. During a recent regional failover simulation everything went through the West zone for a full week, “and customers didn’t really notice anything, with no real lag time going through West,” Casanova explained.

During the Q&A Casanova also said that the UK Capital One team is eying this Chef architecture too, although with different AWS regions.

Read next: Chef shifts towards app automation with 2.0 release

Capital One can often be found speaking at industry events like this following an enterprise-wide shift towards a devops culture and heavy adoption of Amazon Web Services (AWS) for cloud infrastructure as it pursues the lofty aim of being an all-cloud bank.

As Tapabrata Pal, senior director of agile at Capital One told Computerworld UK last year: ”About 40 percent of our production workload is on AWS now. At 40 percent we are larger than Netflix on the AWS footprint, it's huge. We are not running a hybrid model - our focus is everything on cloud. There are data centre-based applications being transformed, re-written, re-engineered, thrown out, to get to cloud."

"Recommended For You"

How Netflix survived the Amazon EC2 reboot OpenStack Summit Report