How management failings led to RBS IT catastrophe

The £56 million fine received by the Royal Bank of Scotland (RBS) today brought an end to the lengthy investigation into the bank’s high profile IT failure during the summer of 2012.

Share

The £56 million fine received by the Royal Bank of Scotland (RBS) today brought an end to the lengthy investigation into the bank’s high profile IT failure during the summer of 2012.

But despite widespread condemnation of the banks legacy systems following the outage, the Financial Conduct Authority (FCA) has claimed that risk management and control failings were to blame for the outage – rather than a lack of investment in technology.

“The incident was not the result of insufficient investment in IT generally or in its IT infrastructure,” the FCA said in its report, pointing to the bank's £1 billion annual spend on systems, 'up-to-date' software and mainframe technology under five-years old.

“Rather, the underlying cause of the incident was weakness in the group’s IT risk management and in its IT controls...which failed to formally identify and actively manage IT risks and to implement prudent testing controls.”

Upgrade issues – what caused the outage?

The July 2012 outage created havoc for RBS, Natwest and Ulster Bank customers. Access to millions of accounts, ATMs, payments, transfers and digital banking services were blocked. Companies were unable to meet payroll commitments. For some, the problems continued for three weeks.

According to the FCA’s findings, the root cause of the outage was an issue upgrading batch processing systems that are used to process payments and settle accounts each evening.

Problems began when the bank’s central IT team moved to a new version of CA7 for its Natwest and Ulster Bank batch scheduler system, due to a lack of ongoing support for the software. While staff had successfully tested the new software, it had not done so for the patched version that it actually implemented.

When it went live with the update, disruption was initially caused to Ulster Bank and Natwest systems, resulting in a decision to uninstall the upgrade and move back to a previous version.

However, the bank was unaware that the new, patched version of the upgraded software was incompatible with the original software.

As a result, more and more batch processing ‘jobs’ were left unprocessed, creating a backlog that was unable to be cleared before the next daily batch load came in. RBS batch systems were also affected due to interdependencies between the two systems.

From there, processing of accounts led to vital services being blocked for customers, resulting in 70,000 customer complaints and over £70 million handed out in compensation.

Managing risk – the underlying cause

According to the FCA, while the software upgrade was the trigger for the incident, the real problem was the lack of underlying management processes in place to prevent such a large-scale outage.

This was the case for its technology services division, but also for the wider group, which sets operational risk policies that IT teams are required to follow.

According to the FCA report, RBS risk management policies were “limited in scope because [the] focus was on business continuity and should have included a much greater focus on IT resilience”.

The FCA claims that the technology services division’s processes were inadequate, with problems including inaccurate records of changes to systems, inconsistent procedures across the division, and an incomplete view of IT risk.

This had a direct impact on the events that led to the batch processing software upgrade, the FCA said, as the department “did not sufficiently identify, understand or mitigate the risk of a batch scheduler failure”.

Furthermore, measures to reduce risk of outages, and minimise subsequent effects – such as separating batch processing systems – were not addressed.

“[The technology services risk function’s] culture was ineffective insofar as it was based on a past history of reacting and responding to incidents, rather than forward looking identification of risk,” the report said.

In addition, there was a lack of ‘substantial’ experience, with over half of the technology services risk team appointed within two years of the incident.

Find your next job with computerworld UK jobs