The £56 million fine received by the Royal Bank of Scotland (RBS) today brought an end to the lengthy investigation into the bank’s high profile IT failure during the summer of 2012.

But despite widespread condemnation of the banks legacy systems following the outage, the Financial Conduct Authority (FCA) has claimed that risk management and control failings were to blame for the outage – rather than a lack of investment in technology.

“The incident was not the result of insufficient investment in IT generally or in its IT infrastructure,” the FCA said in its report, pointing to the bank's £1 billion annual spend on systems, 'up-to-date' software and mainframe technology under five-years old.

“Rather, the underlying cause of the incident was weakness in the group’s IT risk management and in its IT controls...which failed to formally identify and actively manage IT risks and to implement prudent testing controls.”

Upgrade issues – what caused the outage?

The July 2012 outage created havoc for RBS, Natwest and Ulster Bank customers. Access to millions of accounts, ATMs, payments, transfers and digital banking services were blocked. Companies were unable to meet payroll commitments. For some, the problems continued for three weeks.

According to the FCA’s findings, the root cause of the outage was an issue upgrading batch processing systems that are used to process payments and settle accounts each evening.

Problems began when the bank’s central IT team moved to a new version of CA7 for its Natwest and Ulster Bank batch scheduler system, due to a lack of ongoing support for the software. While staff had successfully tested the new software, it had not done so for the patched version that it actually implemented.

When it went live with the update, disruption was initially caused to Ulster Bank and Natwest systems, resulting in a decision to uninstall the upgrade and move back to a previous version.

However, the bank was unaware that the new, patched version of the upgraded software was incompatible with the original software.

As a result, more and more batch processing ‘jobs’ were left unprocessed, creating a backlog that was unable to be cleared before the next daily batch load came in. RBS batch systems were also affected due to interdependencies between the two systems.

From there, processing of accounts led to vital services being blocked for customers, resulting in 70,000 customer complaints and over £70 million handed out in compensation.

Managing risk – the underlying cause

According to the FCA, while the software upgrade was the trigger for the incident, the real problem was the lack of underlying management processes in place to prevent such a large-scale outage.

This was the case for its technology services division, but also for the wider group, which sets operational risk policies that IT teams are required to follow.

According to the FCA report, RBS risk management policies were “limited in scope because [the] focus was on business continuity and should have included a much greater focus on IT resilience”.

The FCA claims that the technology services division’s processes were inadequate, with problems including inaccurate records of changes to systems, inconsistent procedures across the division, and an incomplete view of IT risk.

This had a direct impact on the events that led to the batch processing software upgrade, the FCA said, as the department “did not sufficiently identify, understand or mitigate the risk of a batch scheduler failure”.

Furthermore, measures to reduce risk of outages, and minimise subsequent effects – such as separating batch processing systems – were not addressed.

“[The technology services risk function’s] culture was ineffective insofar as it was based on a past history of reacting and responding to incidents, rather than forward looking identification of risk,” the report said.

In addition, there was a lack of ‘substantial’ experience, with over half of the technology services risk team appointed within two years of the incident.

However, the FCA also said that the wider business failed to monitor risks around IT - a function that is central to the overall running of the bank. This was partly due to a lack of IT knowledge held by group management, as well as other factors, such as incomplete audits of IT, including mainframe systems in the preceding 12 months. The responsibility for management of risks also fell to the board, which did not properly review group-wide governance policy measures.

Such policies were "limited in scope" because they "addressed recovering from a single low probability but high impact event" such as the total loss of a data centre, rather than smaller but more probable disruptions like software failure.

The aftermath

Following the outage, regulators in the UK and Ireland began investigations into the issue, with the Central Bank of Ireland fining Ulster Bank last week for failing to ensure stability of infrastructure, which had been outsourced to RBS in 2005. The FCA also subsequently launched a wider investigation into robustness of IT systems used by all UK banks.

Meanwhile, RBS CEO Ross McEwan pledged to invest £750 million over three years to improve resilience of its systems. This included remedial action to simplify its legacy estate and attempt to prevent further occurrences. For example, it completed the separation of its batch processing systems for individual banks within the group in May, meaning that outages will not have the ripple effect on other arms of its business in future.

It is also attempting to reduce the number of core banking platforms by half in the next two years.

“In this year alone we have gone from a single overnight batch to now running four overnight separated batches on different parts of our business,” said McEwan in a recent earnings call.

“That is one of the heaviest lifting I think any financial services organisation could do in their lives. And we have done that in the last 12 months and nobody noticed it. And at the same time we have double batched, we have doubled security, we've got a little bit of work to do on the ATM and the point-of-sale technology, fronts that we connect into, to do by the end of this year.”

So is the FCA right?

So is the FCA correct to say that the outage was due to a lack of safeguards rather than investment in IT? It is likely that both aspects played a major role, said TechMarketView financial services analyst, Peter Roe, with the hugely complex and intertwined legacy systems used by the bank the real underlying cause.

While there were clearly huge amounts spent on IT each year, it was not necessarily targeted in the right place, he said.

“A huge amount of resource has been devoted to keeping the lights on, and also making systems compliant with all the new regulation, so there was little in the past that was able to be spent on real change,” he said.

“[The batch processing software upgrade] is an example of how legacy systems increase the vulnerability of the banks, particularly when systems have to change, as they do.

“It was not a lack of investment – because of the amount that they were spending - but it shows the complexity of changing IT within a bank, [and the danger of] not having the correct process and quality control, because of the fiendishly complicated structures and systems.”

He added: “It is a bit like a game of ‘pick-up-sticks’ – you try and pick one up without moving the other ones – but with software. If you move one system you are going to move others, and if you are not careful you can bring the whole lot of them down.”

According to Lev Lesokhin, executive vice president at software quality analysis firm CAST, legacy infrastructure was ultimately to blame.

“The underlying issue is the creaking infrastructure which the largest (and oldest) UK banks use. This is under increasing pressure to deliver ‘Google-like’ customer services demanded by customers today,” he said.

“Western banking systems are particularly exposed because they were the first to install computer systems, and investment in those systems has since been neglected as tightening budgets have meant less is spent on modernisation and quality assurance.

“Until these underlying issues are addressed and industry standards put in place, we will continue to see glitches like this.”