I'd hazard a guess that more than half of the money we spend on IT infrastructure goes to ensuring uptime. Whether that takes the form of redundant power supplies, redundant disks, multicore network architectures, high-availability server clusters, or fully replicated backup data centres, we build our infrastructures not "just in case" of failure, but because we know failure will eventually occur.

And we're usually not disappointed. I've lost count of how many failed SAN disks and controllers, servers, and network equipment I've seen - probably thousands over the years. Doesn't matter who made the equipment - it could be the biggest, most respected name in storage or networking - it all breaks at some point or another.

We're now at a point, however, where failure has become much more complicated. As storage has moved from simple, striped disk volumes to virtualised storage platforms and deduplication, the old standby of throwing more redundant hardware at the reliability problem doesn't always cut it. Every time you add a new feature to optimise or more effectively manage storage, you need more software to drive all the hardware. Often, that software is where the worst problems arise.

The case of the disappearing cartridge

Recently, I worked with a client to design and implement a fairly complex new backup architecture. One key component of that configuration was a fairly large, high-performance VTL (Virtual Tape Library). The VTL was chosen as part of the design because it offered easy integration with the client's existing backup infrastructure and provided excellent deduplication capabilities - essentially allowing them to keep months of backups in a nearline state where they could be easily and quickly restored - rather than the few days that their previous, non-deduplicated disk-to-disk to tape solution had allowed.

Last week, the VTL decided for some reason that one of the emulated tape cartridges was corrupt. There was no real indication as to why; the hardware seemed to be working fine. Obviously, that didn't inspire confidence, but hey, things break. I've seen a few corrupted physical tape cartridges in my time. You learn to have a backup plan for your backup plan and roll with it.

The manufacturer's first-line support suggested that the client delete that cartridge and then reboot the VTL (fourth-level support and engineering would later say this was the last thing you should do, but that's an entirely different topic). Having done that, the VTL couldn't bring the affected virtual library online at all. That's because that cartridge -- really a massive collection of deduplicated data blocks on a disk array -- wasn't deleted cleanly and had taken with it all of the blocks that had been in common with many of the other virtual tape media. That one poorly conceived troubleshooting step had essentially rendered 20TB of backups useless in one fell swoop.


Several days and many, many hours of engineering time later, the manufacturer had gotten the VTL back up and running with all of its data intact. Though I won't name them, the vendor did quickly realise the severity of the problem (nobody likes disappearing backups) and stepped up to the plate to fix it. But that's not the point. The point is that the more complex we allow our solutions to get, the more likely this kind of catastrophic software failure is.

I could rattle off a huge list of similar incidents I've noticed lately. Like the SAN with some bad code controlling its cache mirroring that crashed both redundant controllers simultaneously, which is ironic because the mirrored cache only exists to ensure uptime. Or maybe the recent and very public debacle surrounding McAfee's antivirus software identifying part of Windows XP as a virus (I'll resist the urge to make a joke about the potential accuracy of that conclusion).

Buckle up and back up your backup

It has really gotten to the point where, regardless of your spotless past history with a piece of storage gear, you may be one firmware upgrade away from introducing a crippling software bug that brings the system down in spite of all of the hardware redundancy you've paid for. That's not a fact that's going to help anybody sleep at night. But what can you do about it?

First, we need to absolutely demand that problems like this get fixed, and be very public about them if they aren't. Companies won't invest more resources in software quality assurance unless their heads are on the chopping block.

Second, never trust anything -- regardless of how redundant it may look. Design a completely independent backup for your backup. Imagine what you'd do if your most critical piece of infrastructure evaporated without much warning or explanation. In the case of the VTL that I mentioned earlier, the client's backups were protected by duplicated backups that were sent to a physical tape library for off-site archiving. They could have cleanly survived that loss as a result.

The bottom line is that as the data explosion grows, so too will the complexity of the solutions we use to combat it. Storage virtualisation, online deduplication, and content archiving will be some of our most powerful tools in that battle. They're also all big chunks of black-box software spaghetti, much of it written to get to market quickly and appear in a long list of features on the marketing glossy. Every one of those features has the potential to cause data loss or downtime. Don't lose sight of that as you design your next-generation storage architecture.