Lose unwanted gigabytes overnight

Organisations can no longer afford to simply add more storage to hold their burgeoning data mountain. Fortunately help is at hand through de-duplication.


Like overstuffed wardrobes, cluttered enterprise backup operations scream for attention. Fortunately, vendors are coming out with data de-duplication functions -- packed into storage software suites or in stand-alone appliances -- that sort through data destined for the archives and eliminate the redundancies.

Analysts say the technology can provide a 20-to-1 reduction of backup data. In other words, 20TB of original data can be shrunk to 1TB for backup purposes.

Eliminating duplicate data seems like a no-brainer, but in the past, corporations were wary of losing data on its way to backup repositories. Only now are they getting comfortable with the reliability of de-duplication technology, which has matured thanks to advancements in data transfer techniques and standards.

Specifically, the rise of Advanced Technology Attachment and Serial ATA technologies, along with huge spikes in processing power, have fostered better de-duplication functionality.

Suddenly, de-duplication is catching on big time, attracting big-name vendors such as EMC and Symantec In November, EMC acquired de-duplication vendor Avamar Technologies, and now EMC is incorporating de-duplication into its Clariion, Centera and NetWorker product lines. Meanwhile, Symantec is reportedly scrambling to inject de-duplication capability into its Veritas NetBackup storage management software.

Corporate IT's comfort level with the technology has increased to the point where some IT executives wonder whether de-duplication could extend from backup operations to disaster recovery and even primary storage. But first there are lingering questions about where best to insert de-duplication functionality in the backup process: at the client, at the disk or at the virtual tape library (VTL).

IT managers will have to ask vendors hard questions, because de-duplication methods vary significantly by vendor. "There is still a lot of confusion in the market about what data de-duplication is and isn't -- and where it is best done. This confusion can delay adoption," says Heidi Biggar, an analyst at Enterprise Strategy Group in Milford, Mass.

But whatever confusion exists, corporate IT shops shouldn't be stumped for too long. "There are pros and cons to each approach, but all have potentially significant benefits for users by allowing them to reduce the amount of [storage] capacity they need on the back end," Biggar says. The benefits extend to other areas, too. For example, de-duplication can reduce the network bandwidth required for long-distance data replication, she says.

Where to de-dupe

Data de-duplication can take place either at the source or at the point where data is being written to disk systems or VTLs. "The packaging of this functionality can occur in three ways: as software, which can be stand-alone or integrated with the backup software; as a disk gateway or disk array; and, lastly, as a VTL," explains David Russell, an analyst at Gartner

Avamar and Toronto-based Asigra take the first route by performing de-duplication in backup and recovery software running on a protected server -- before sending the data across the network to backup repositories.

Some vendors take a second approach, relying on in-line, disk-based products that expunge duplicate data after it's shipped to a disk repository. Vendors using this method include Data Domain, Diligent Technologies and ExaGrid Systems

Data Domain, which uses an approach similar to ExaGrid's, says its appliance can spread the benefits of de-duplication to geographically dispersed sites.

The third approach to data de-duplication is employed by vendors such as FalconStor Software , Quantum and Sepaton These vendors offer data de-duplication as extensions of their VTL systems and perform the task outside of the backup process.

The systems in this category write all data to the VTL and then run a de-duplication process after the fact. This method ensures that de-duplication won't interfere with backup operations and has zero impact on backup windows and operations, the vendors claim.

The trade-offs

It all sounds easy, but of course there are complications and trade-offs. For example, regardless of the chosen de-duping method, "performance degradation can be an issue," says Enterprise Strategy Group's Biggar.

Specifically, companies performing de-duplication at the source run the risk that this function will get in the way of the primary task at hand -- protecting data as it is readied for offloading to backup storage systems. "The potential disadvantage here is that de-duplication can steal memory cycles from the backup servers," Gartner's Russell says.

After-the-fact de-duplication poses challenges, too. More upfront capacity is required to store data that will be de-duped in postprocessing, says Biggar. "However, capacity is released after the de-duplication is complete," she notes.

But Biggar is quick to add that any trade-offs pale in comparison to the benefits of de-duplication. Her conclusion: "ESG Labs has tested several vendors' de-duplication technologies and has had no issue implementing or using the technologies, and we have substantiated vendors' data reduction claims. In general, the benefits of data de-duplication far outweigh any negatives."

Measuring shrinkage

Storage vendors touting data de-duplication gains usually state their claims as ratios, comparing the amount of space it took to house multiple copies of the same data with the space required for de-duplicated data.

"We have found that it is not uncommon for these solutions to provide 20-to-1 reduction of backup data," explains Enterprise Strategy Group analyst Tony Asaro in a report on de-duplication. "For example, if customers can back up 20TB of data on 1TB of disk capacity, think about the implications of this. If a storage system costs $100 (£51) per GB, after de-duplication, it would cost $5 (£2.55) per GB."

Of course, de-duplication vendors aren't shy about making big promises. "I've heard claims that range from 20-to-1 all the way to 500-to-1. I think 20-to-1 or 50-to-1 is realistic," says Gartner analyst David Russell.

However, your mileage may vary. "The more duplicated data you have, the more opportunity you have to reduce that volume," says Russell, "and some enterprises could easily get to 100-to-1."

Find your next job with computerworld UK jobs