Blizzard Entertainment has turned to OpenStack's Senlin clustering service to help with autoscaling infrastructure for its massively popular first-person shooter Overwatch, accounting for a 40 percent reduction in its virtual machine (VM) usage as a result.
Overwatch now accounts for more than $1 billion in revenues for the game studio and is an e-sports phenomenon. Blizzard runs its multiplayer games, like World of Warcraft and Diablo, using virtual gaming servers running on OpenStack on private cloud infrastructure across 11 global data centres.
Speaking during a panel session at the Open Infrastructure Summit this week, Blizzard cloud engineer Duc Truong and senior cloud software engineer Jude Cross provided an overview into how Senlin helps the game server clusters run with minimal impact to Overwatch's 40 million registered players, who expect low latency - and where a delay of seconds can easily turn the tide in-game.
How they did it
Blizzard began autoscaling its game servers with OpenStack at the start of this year. But the company's relationship with OpenStack actually goes back to 2012, when it began using the open source infrastructure platform to host its private cloud. The company is running mostly on Rocky, but a few services run on Pike, while Senlin is on the most recent OpenStack release, Stein.
"Autoscaling in the public cloud, the benefits are very obvious," said Truong. "You use less instances and your costs are reduced. But in a private cloud like Blizzard, as you can imagine, we have finite capacity - and without autoscaling, that capacity gets divided up with different games.
"What happens is that each game allocates a set number of virtual machines that will be able to handle the peak traffic for that game, during any time of the lifetime of that game, and that includes your content releases or seasonal events."
In non-peak times, the load is spread among those VMs, meaning each VM is essentially under-utilised. With autoscaling enabled, the load is packed more tightly into existing VMs during non-peak times, and the VMs that are not being used translate into free capacity - which can then be redistributed to support other games.
Another benefit is that by using autoscaling, game development teams were able to perform deployments more easily - for example, if a new map is being released, the team can spin up a cluster of game servers that feature the new content, where it can be tested. When the release is ready to go live, traffic gets switched over from the old cluster to the new - and slowly scale down the original cluster, in a process called draining.
And if there's an unpredictable spike in traffic - for instance, a Twitch streamer logging on resulting in a spike in users trying to join in - the servers can be scaled as needed, and removed as the traffic decreases. Additionally, because capacity is constantly being scaled up or down, the lifetime of these VMs is much shorter than traditional game servers (which could be up and running for weeks, months, or longer) leading to a reduction in bugs associated with long-running servers like memory or connection leaks.
However, because individual gaming sessions can last for hours, the draining of players might take a long time before it is complete. So the team implemented lifecycle hooks, using Senlin, to drain nodes before the clusters are deleted.
Lifecycle hooks allow for pausing the creation or deletion of an instance as well as visibility into the major events in an instance's life cycle.
"The gist of it is that with any sort of interruption in the game, whether that's even one second, can dramatically impact a player's experience - so you'd be the difference whether or not you secure a kill or capture an objective," said Cross. "You really need to maintain the state of the server when you're doing any sort of changes to the server infrastructure."
The team had originally used Amazon Web Services (AWS) autoscaling in its public cloud instances since 2017, and considered making the switch to [OpenStack orchestration engine] Heat, but ultimately opted for Senlin because of the support afforded to draining servers with those lifecycle hooks.
Senlin also closely matched the AWS autoscaling that was already underway at the company, but in addition, offered a REST API for scaling clusters - as well as the ability to write Gophercloud and Terraform extensions to support those clusters.
Traditional autoscaling relies on metrics from virtual machines like CPU and memory usage in order to make scaling decisions. Game servers reserve all the RAM on a VM, said Cross, and the CPU utilisation can spike depending on physics calculations or whether or not something is happening within the servers.
"Those aren't really reliable metrics to use for game servers or for getting people off a game server," he added. "So we let the game servers calculate their definition of load independently, which allowed us to expose load to the autoscaling service - and the autoscaling service was able to make scaling decisions based off of that."
Autoscaling implementation at Blizzard, then, consists of Senlin for creating and managing the clusters of game servers, running as a Docker container on the control plane, along with Zaqar to deliver lifecycle hook messages - also run as a Docker container - and combined with the Blizzard autoscaling service, a custom application written in Go that's run as a VM sidecar to a cluster, and in addition to working with the OpenStack private cloud, it plays nicely with AWS if additional resource is required.
The autoscaling determines the load of a game's servers, and requests that Senlin scale up or scale down. If it's scaling up then it will request or create a server via OpenStack VM provisioning tool Nova. If scaling down, the Blizzard service talks to Senlin, which then marks the node that needs to be targeted. The service triggers that node to drain, and once all the players have left the server, Senlin says it's safe to delete, and requests Nova to kill it.
There were, of course, challenges - not least database inefficiencies, as well as action list and customer list operations lagging when many clusters were up and running.
To address this, Blizzard's cloud team updated its database models, and removed unnecessary DB calls by an astonishing 1,000 percent.
"That might sound like I'm speaking hyperbole, I'm not," said Cross. "It was 1,000 percent."
The team also encountered its fair share of database locking, leading to database failures when the team was running a high number of simultaneous scaling requests. Failure checks also happened asynchronously, leading to problems with the Blizzard autoscaling service - with multiple scale-up or scale-down requests sitting in a queue, obscuring the visibility of the state of the clusters. They got around this by exposing conflicting actions to the API via Senlin, so if a bad request came in, the team would "fail fast" - the API telling them that they were already scaling up or down, for instance.
But ultimately the introduction of Senlin with the autoscaling service helped lead to a 40 percent reduction in Blizzard's VM footprint across its private cloud environment.
Truong and Cross say they're very active in the #senlin OpenStack IRC channel, and invited interested parties who want to learn more or about how to commit upstream to join them there.