The Met Office is embracing what it calls ‘Chaos Days’ -- where it purposefully introduces failures into a clone of its cloud environment -- as a way to test its newly formed CloudOps team and the resiliency of its cloud infrastructure.
The Met Office is the UK's national weather service and provides forecasts to consumers and the private sector, as well as data science around the issue of climate change. This makes it a highly data-intensive organisation, so a shift towards the cloud is an understandable move.
Richard Bevan, head of operational technology at the Met Office told Computerworld UK that the Met Office has traditionally run an on-premise delivery model, but is increasingly shifting to the public cloud. He said: "Our strategy is to have an in-house cloud capability and we have developed a CloudOps team over the past twelve months to do so."
The Met Office has been working with consultancy Cloudreach to build out its CloudOps team and to aid in the adoption of AWS infrastructure. The first app to move to a cloud delivery model is the media-facing APIs for including weather information in web and mobile apps.
One practice Cloudreach promotes amongst its clients is "Chaos Day", where the CloudOps team is encouraged to break parts of its own infrastructure. After spinning up a clone AWS cloud environment the team spends the day breaking small parts of the system and investigating what went wrong and how to fix it. This doubles up as a training exercise for the team, as well as giving them an insight into where gaps in their knowledge and documentation are.
Cloudreach took inspiration from Chaos Monkey, an open source software tool developed by Netflix engineers to test the resiliency and recoverability of their Amazon Web Services (AWS) infrastructure, when promoting this exercise. The Met Office is currently working with AWS, but Bevan said that they aren't committed to a single vendor.
James Wells, a systems developer at Cloudreach helped the Met Office run its first Chaos Day last month. He said: "We wanted to see what we don't know and iron that out." He explained that the aim is to "discover problems you may not have seen before. You see the documentation and gaps where you need to improve".
Wells has some tips for any CloudOps teams looking to run their own Chaos Day, and admitted that it's important to strike a balance between difficulty of the challenge to keep staff engaged.
His advice is:
- Know your team. If your team is mainly networking specialists, it’s going to to be easy for them to find networking problems. If you’ve got a mix, do a range of things so everyone gets a chance to share their knowledge.
- Be careful! In the cloud you can create copies of environments to test these things with. So spin one up. Don’t risk your production data if you don’t need to.
- Make backups that you can quickly restore from. You might break something you didn’t intend to. Make sure you have a rollback and restore plan for every ‘breakage’ you make, so that you can fix any unintended consequences quickly!
- Start simple -- in real breakages or accidental changes, simple stuff happens as well. As you see how the team responds, you can increase the difficulty, break multiple things at once, etc.
- Don’t be tempted to be too clever too early. Remember, the goal is find out areas for improvement, not to defeat your CloudOps team!
- Timebox the breakages - typically beyond about 30-45 mins per breakage will help keep people engaged without losing focus.
- Audit tools such as AWS Cloudtrail can be your undoing with a clever team looking for changes. You can avoid this somewhat by using different users, or have something such as a Lambda function or cron on an instance to trigger the changes. However, ultimately you’ll probably have to restrict your teams from jumping straight to CloudTrail or it will get pretty boring fast!
- Try and present your problems to the CloudOps team as users would - an email with screenshots, error messages etc.