Massive Danish consumer website Trustpilot contains roughly 60 million user generated reviews spanning 230,000 companies - and attracts more than 3 billion reviews read in just a single month.
Like almost every business with a digital product, downtime can have major repercussions in reputation and user experience. But with the enormous digital footfall at the firm, Trustpilot really needed to ensure that it had visibility into its architecture to stop alerts from piling up and drowning the team.
As a result of this scale the firm has embarked on a bold initiative to go serverless where possible, meaning it essentially outsources the provisioning of infrastructure to public cloud providers like Amazon Web Services (AWS).
The firm started this process in 2016, and as of late 2018 the company had nearly shifted over to a completely serverless model running on infrastructure-as-code principles. This includes a major drive to use containers, and avoid virtual servers wherever possible, which it considers "legacy" tech. Amazon's EC2 instances are still in use, but on the way out.
As a result developer teams can now have infrastructure for services up and running in 15 minutes or less.
This is something that "allows the team to really experiment," Morten Reinholdt Boelskifte, site reliability engineer manager at Trustpilot tells Computerworld UK. The company simplified the deployment process through its own platform so that teams can spin up new services without having to run through all the default steps via AWS or with Google Cloud Platform, which for "some people can be a blocker".
"We kind of eliminated the whole thing," Boelskifte adds. "They can just go in, and set it up and have everything running. In terms of compliance we have a lot of automation happening around the services."
What ties everything together is Trustpilot's usage of Pagerduty, which provides visibility into the company's full technology stack - and crucially lets teams known if they're about to break anything by experimenting too much.
"Pagerduty is very central to more or less everything we do, as they are the ones we rely on when we experiment maybe a bit too hard," Boelskifte says. "They are a central role in our entire pipeline of when we set up new stuff, but also when we deploy, and when we put out new features and new versions.
"When we experiment you do get more alerts - and alerts tend to happen when people interfere with the system. When we're interfering maybe a bit too much, Pagerduty alerts us - but it also helps us keep on top of alert fatigue, which can be a real pain if you allow it to grow on the teams."
Next, although nothing is set in stone, the company is going to run a planning session for Q2/Q3 - where Boelskifte believes there will be an even closer shift to infrastructure-as-code, to the degree where it can be self-served within repositories.
He hopes that on the self-service platform, teams will be able to spin up quickly using architecture patterns that are easy to follow. But they'll soon be able to run it, and head into the code and tweak it manually - practically live.
"It is probably still going to be with code reviews happening on the SRE team, especially during the start," Boelskifte says. "But I think that's one of the things that we will be doing."