The BBC has been able to increase the amount of content it publishes to its hugely popular iPlayer video on-demand service by 100 percent since shifting to the cloud and adopting a microservices architecture.
The iPlayer service publishes 10,000 hours of video and audio media a week, and no longer has to prioritise what is and is not added to the service because of technical constraints.
During a highly technical session at AWS re:Invent in Las Vegas this week, lead architect at the BBC Stephen Goodwin explained how iPlayer was ported to the AWS public cloud infrastructure in 2012, and how the platform has developed since.
Before shifting to the cloud and breaking up its monolithic iPlayer application into microservices, the BBC was running iPlayer on physical systems in its own data centres.
This was adequate at the time, but started to strain when the BBC added mobile, tablet and HD viewing options for on-demand and live content, as well as bringing the iPlayer Radio service into iPlayer and allowing downloads.
"The systems didn't have the resources to do what they needed," Goodwin explained. "They ran out of CPU, disk throughput and storage." The resulting system was far from stable and highly inflexible.
For example, when the local news goes out at 6:30pm, BBC1 essentially splits into 19 regional channels running concurrently. Under the old hardware-based system it would take 16 to 17 hours to process those 19 videos, Goodwin said. Now they simply spin up additional EC2 instances to process that content and add it to iPlayer.
Worse still, the BBC didn't have the processing capability to publish everything it was broadcasting. The solution at the time to this restriction was to hire people to decide each week what content was a priority to upload to the service.
Shifting to AWS SQS and SNS
At the heart of the old system was a database which listed all of the jobs that needed to be done - essentially content to be uploaded to the service. "This was OK in a steady state," Goodwin said, "but if there was an influx of work to be done, or there was a fault, and work backed up, this table filled up."
This caused a non-virtuous circle where the database became harder to query as it filled up, and any service calling it downstream was being slowed down. So the BBC couldn't process the jobs quickly enough but the table was still getting bigger.
To get around this developers would dump the contents of this database into a text file and slowly feed it back in to ensure the system could cope. "It was not ideal using a database as a queue," Goodwin said.
Now the BBC has microservices at its heart, all underpinned by a robust event queuing and messaging framework. All of this is built on Amazon Web Services (AWS) tooling, which should come as no surprise considering where Goodwin was speaking this week.
This extends to the messaging and queuing workflow the BBC has put in place through the AWS simple queue (SQS) and simple notification (SNS) services.
In short the new workflow sees the jobs which were previously loaded into that manual database being put into a SQS queue. These jobs are passed along as messages from microservice to microservice until it eventually becomes available on iPlayer.
"In reality the system is made up of 30 separate microservices, with messages passing along a daisy chain of services to go through the workflow," Goodwin helpfully explained.
If there is a rush of content to go on iPlayer, the messages queue and the system uses auto-scaling to increase the number of instances it is running. Messages are then essentially hidden from each instance to avoid work being duplicated. "So I can scale to 30 instances and be confident they will work independently of each other," Goodwin said.
This also allows the BBC to be more confident that a message won't get lost. As Goodwin put it: "We use this style as messages are a unit of work we want to get done and queuing systems are good at promising the integrity of those messages.
"This is important as each unit of work is something like adding Sherlock to iPlayer and people would get upset if that didn't happen."
There are a few caveats though. Each message is pre-defined with a timeout period of 30 seconds to avoid any one message causing a bottleneck.
The system also sets a threshold of processing attempts to five so that if a message is causing a problem it can be dropped out of the queue and picked up by the operations team to resolve the underlying issue, before putting it back in, instead of hammering the system over and over again.
This is especially important as the BBC implements a Chaos Monkey methodology to introduce issues into its system to test its resilience. So if some infrastructure is taken down by Chaos Monkey, they can be confident that the piece of work will still get done.
The new system also incorporates robust monitoring and debugging capabilities, using tools from Splunk on top of AWS, to track messages though the system and spot issues with each microservice, without having to work through 300 individual log files.
"This makes it much easier to see what is going on and how many programmes we can process an hour, for example," Goodwin said.
Benefits of switching to microservices
All in all, the switch to microservices has been an unbridled success for the BBC, with the added scalability and resiliency being the obvious benefits.
It also means the system is more easily extensibile, with Goodwin giving the example of the BBC bringing its previously separate radio services into iPlayer. "It's easy to merge additional capability in," he said.
This also allows for faster time to deployment, as the iPlayer team has also shifted to continuous delivery alongside the microservices architecture. "We now have the tooling to implement a change in 15 minutes, with 32 deployments going live each week across a development team of 30 people," Goodwin said.
All of that eventually means happier audiences.
"If we have the rights, then a programme is going online," Goodwin said, with the organisation seeing a 100 percent increase in the amount of content available on iPlayer immediately upon switching to this architecture.