It is time for cloud computing providers to think about whether eliminating humans from their infrastructure operations makes sense.
Automation is not necessarily the answer to an industry where the promise of low cost utility computing is undermined by stories of downtime in the cloud.
We’ve seen Amazon struggle with its elastic cloud. We’ve seen Google Apps hit by mail sync issues. These companies may have revolutionised IT, but they could still learn a useful lesson or two from the past.
Good operations don’t necessarily depend on automating more, but rather providing humans with the best possible data to make the right decisions and warnings for the wrong. Otherwise you get bad data copied over good data etc
It's been a little while now since Amazon’s EC2 service let down whole swathes of its customers and probably a very good time to reflect on the fact that while this was hugely impacting for their users, it was also disappointing for the industry itself. Like many of us it got me thinking and reminded me of the importance of DR, backup, failover and the many other procedures for keeping our systems running and our data intact. So, for me the Amazon EC2 fiasco was a BIG Wake-up call.
Let’s be frank here, (no pun intended) Amazon’s EC2 looks a good cloud offering, they have hosted operations across different regions across the world and each region has multiple datacentres. Their service offering is for 99.96% up time which allows for 4.4 hours of downtime per month. In this case they obviously blew their SLA out of water.
The problem, as most of us know by now, is that fundamentally they had an issue with a network change, coupled with human error, and an automated failover in their EBS cluster. I won’t go into it, that’s been done before, you can read Amazon’s version of events here.
It’s important to be aware that these things can happen, the reason it made front page news was because it was down for such a long time (about 30 hours in total). Not good, as an outage of this magnitude for many of its customers was extremely costly.
The truth however, is that the human error part and over aggressive automated solutions are a huge part of the problem. Some years ago I worked with the then Compaq (nee Digital), Disaster Tolerant Solutions (DTCS) team, who had developed very sophisticated failsafe systems for making the right decisions in a split site live-live computing and data replication environment.
Today this is owned by HP and remains focused on very quick recovery and minimal business downtime. DTCS developed many techniques and theories but amongst the most important of these were ensuring that automated recovery procedures might be blocked pending human confirmation and that humans were not allowed to do the wrong thing by providing the best information.
The humans were given appropriate options so that they could not perform arbitrary operations but they had the ultimate say. I suspect that the cloud industry has much still to learn from DTCS’ processes.
Amazon have stated that the trigger for this event was a network configuration change and that in future they will audit their change process and increase automation to prevent a similar issue happening in the future. That’s fine provided the automation makes the right decisions and the decision is clear cut.
The cloud industry is shiny and new and probably quite full of itself. But as we see from this they can learn something from the people who have been there and done it!
Posted by Frank Puranik
Frank is Product Director at iTrinegy. With more than 30 years in the computing industry. He is an expert in the performance issues of applications across the world’s most complex networks.