Archivas's RAIN architecture is based on clustered server nodes which provide an archival system. Read the company's white paper to find out how it accomplishes this.
White PaperArchivas, Inc. | 200 West Street First Floor | Waltham, Massachusetts 02451-1121 | United Statesphone 781 890 8353 | fax 781 890 8343 | email firstname.lastname@example.org | web www.archivas.comCopyright by Archivas, 2004All rights reservedH-RAIN: An Architecture for Future-proofing Digital ArchivesAndres Rodriguez and Dr. Jack Orenstein, Archivas, Inc.Presented at the NASA/IEEE MSST2004 Conference on Mass Storage Systems and Technologies4/16/2004Traditionally, systems for large-scale data storage have been based on removable media such as tape and, more recently, optical disk (CD, DVD). While the need for increased storage capacity has never been greater, the inadequacies of traditional approaches have never been more apparent. This is especially true for fixed-content data: new government regulations and increasingly competitive market pressures have converged to underscore the importance of finding long-term storage solutions for fixed-content data that offer ready and secure access, easily scale, and are relatively inexpensive.Shortcomings ofremovable media archivesArchives that rely exclusively on removable media share the following shortcomings:" An archive system that commits physical data to removable media is also captive to the specific hardwaresystem that enables read/write access. As technology changes, these systems inevitably tend towards obsoles-cence. It is questionable whether the devices that are used today to read tape or disk will still be available andviable years hence never mind the availability and viability of the vendor itself!" As the archive grows, access becomes increasingly cumbersome and time-consuming. Data is not always readilyavailable when it is wanted. Moreover, the administrative overhead that is required to provide timely access isunacceptably high, and for many organizations prohibitively expensive." Government regulations reflect a rising demand to maintain large amounts of data over long periods of time,and to guarantee their authenticity. Removable data is especially vulnerable to physical mishandling and corrup-tion, both through physical deterioration and outside intervention, whether inadvertent or deliberate.An alternative modelIn general, when digital data is bound to tape or disk, it ceases to be a digital asset; instead, it becomes simply aphysical widget that contains bits, with all the drawbacks previously cited. Long-term mass storage of fixed-contentdata requires a new type of storage model, where the data's physical location is completely separate from its logicalrepresentation. In order to achieve this objective, digital data must be stored in a digital archive that is scalable,reliable, and highly available.Untitled DocumentArchivas, Inc. | 200 West Street First Floor | Waltham, Massachusetts 02451-1121 | United Statesphone 781 890 8353 | fax 781 890 8343 | email email@example.com | web www.archivas.comPage 2White PaperToday's best hope of realizing this model rests in a network or cluster of inexpensive servers such as IA-compatiblemachines that can run a full Linux distribution. This model offers the following advantages:" Various protection schemes (RAID-5, RAID N+K) safeguard files from multiple, simultaneous points of failure inthe network, and guarantee that their data remains continuously available." Within the network, the archive system autonomously enforces policies that are associated with the stored files.These policies include retention period, file protection, and content authentication." Gateways for standard protocols (HTTP, NFS, SMB/CIFS) provide over-the-wire access to the archive." The archive is easily extended: as new nodes enter the cluster, the archive automatically invokes its own load-balancing and protection policies, and redistributes existing storage into the new space accordingly." A network-based archive can facilitate updates to files so they stay current with the latest applications forexample, format changes that are required by new end-user applications. Data migration of this type, on thescale required for large archives, is virtually impossible to achieve in tape-based systems." The data's actual location on the network is transparent to the user. During its lifetime in an archive, a storedfile might be relocated across many network machines or nodes as the result of hardware upgrades, replace-ments, or load balancing. The reference to the file, however, remains constant, enabling users ready access toits contents without requiring knowledge of its physical location within the cluster.Two architectures for onlinearchives: RAIN and H-RAINIn the last several years, various vendors have come forward with archive systems that implement the networkapproach just described. These all embody various implementations of RAIN (redundant array of independent nodes)architecture. RAIN archives are based on one or more clusters of networked server nodes. As nodes enter or leavethe cluster, the cluster automatically adjusts by redistributing and, when necessary, replicating the stored files.Currently, RAIN archives are typically delivered as proprietary hardware appliances, closed systems that are builtfrom identical components. Evolution of these systems is carefully controlled by the vendor.The architecture of H-RAIN heterogeneous redundant arrays of independent nodes differs from the RAIN architec-ture from which it evolved by making minimal assumptions about the archive's underlying hardware and software. Inpractice, this means that H-RAIN architecture can be implemented with commodity hardware. This relatively openarchitecture has two advantages over its RAIN progenitor:" It adapts more readily to technological advances and site-specific contingencies. Administrators are free toreplace components with superior hardware as it becomes available, thus improving storage capacity, perfor-mance, and reliability. Furthermore, they can choose among hardware options that best suit their requirements,such as CPU, memory, and disk capacity. For example, a cluster might be extended by adding new nodes withhigher-performance CPUs, which can be used for CPU-intensive filtering operations. Incremental hardware addi-tions and improvements might thereby measurably improve overall archive performance." Archive administrators can start small and scale up capacity incrementally simply by adding nodes as they areneeded. Moreover, they are free to seek the best prices for storage cluster components. Given that componentcosts tend to decrease over time, cost-conscious administrators can reduce their average cost per gigabyte byspreading out purchases.Untitled DocumentArchivas, Inc. | 200 West Street First Floor | Waltham, Massachusetts 02451-1121 | United Statesphone 781 890 8353 | fax 781 890 8343 | email firstname.lastname@example.org | web www.archivas.comPage 3White PaperIn general, an H-RAIN architecture enables users to upgrade their technical infrastructure while transparentlymigrating archive content to more up-to-date nodes. Improvements can be made incrementally, leaving the initialinstallation intact. As hardware prices fall, archive performance can be enhanced with better-performing nodes,and at lower cost.Implementing H-RAINarchitectureArchivas' archive management system, Reference Information System (RIS), is based on the H-RAIN model. With RIS,organizations can create large-scale permanent storage for fixed content information such as satellite images, diag-nostic images, check images, voice recording, video, and documents, with minimal administrative overhead.In RIS' H-RAIN implementation, two features are salient:" Distributed processing" Autonomous managementDistributed processingAll nodes in a cluster are peers, each capable of running any or all of services that an archive requires. A cluster canbe configured so archive services are distributed in a way that best serves the enterprise's storage requirements.For example, a cluster can configured symmetrically that is to say, each node runs the same processes anddaemons, including a portal server, metadata manager, policy manager, request manager, and storage manager.Each node bears equal responsibilities for processing requests, storing data, and sustaining the archive's overallhealth. No single node becomes a bottleneck: all nodes are equally capable of handling requests such as put and getoperations. Furthermore, in the event of node failure, any other node can take over responsibility for the data thatwas managed by the failed node, so that user access to this data remains unaffected. Alternatively, a cluster might be configured so that various services are distributed asymmetrically across differentnodes. For example, if read requests are especially heavy for a given archive, several nodes might be dedicatedsolely to request management and run multiple request managers and metadata managers, in order to maximizethroughput to other nodes that store the physical data. Autonomous managementThrough policies that are associated with archived files individually, and the cluster collectively, the archive canmanage itself without human intervention. Policies are set for the archive on initial configuration, and can (option-ally) be set for individual files as they are archived. Taken together, these policies determine the archive's day-to-day operation. Through a policy manager that executes on each node, the archive monitors its own compliance withcurrent policies, and when lapses occur, takes the appropriate corrective action.For example, in the event of a failed disk or node, the system determines what data is missing and how best torestore it from data on the remaining healthy nodes, so that the protection policy for these files is fully enforced.Similarly, the system prohibits removal of an archived file before its retention period has elapsed.Human intervention is rarely warranted, and usually only in response to system warnings that require outsideaction for example, notification the cluster load has crossed a specified threshold, requiring the addition of newnodes.Four attributes characterize archive self-management:Untitled DocumentArchivas, Inc. | 200 West Street First Floor | Waltham, Massachusetts 02451-1121 | United Statesphone 781 890 8353 | fax 781 890 8343 | email email@example.com | web www.archivas.comPage 4White Paper" Self-configuring: Setting up large archive systems is error prone. An archive comprises networks, operatingsystems, storage management systems, and, in the case of RIS, databases and web servers; getting all thesecomponents to run together requires teams of experts with a myriad of skills. An autonomous system simplifiesinstallation and integration by setting system configuration through high-level policies." Self-protecting: Policies that enforce document retention, content authentication, and file protection combineto protect an archive from loss of valuable digital assets." Self-healing: Serious problems with large-scale archives can sometimes take weeks to diagnose and fix manu-ally. When the faulty device is finally identified, administrators must be able to remove and replace it withoutinterrupting ongoing service. Autonomous systems can automatically detect software and hardware malfunc-tions in a node, and safely detach it from the archive. Further, because data is replicated across many nodes inthe cluster, the failure of one or more nodes has no impact on data availability. Archivas' distributed metadatamanager can find an alternative source for any data that resides on a failed node." Self-optimizing: Storage systems, databases, web servers and operating systems all have a wide range oftunable parameters that enable administrator to optimize performance. An autonomous system can automati-cally perform functions such as load balancing as it monitors its own operation. Extending the H-RAINmodelWith its H-RAIN architecture, RIS is capable of integrating with storage systems that use removable media such astape or optical disk. In this scenario, the tape system is seen by RIS as simply another set of storage nodes; the phys-ical location of data is managed by an RIS storage manager implementation that is specifically targeted to tape-based storage.This capability is critical for a multi-tier storage and migration strategy, where data is stored in whatever mediumbest serves external access requirements. For example, a file that is frequently accessed should be archived inprimary storage on a high-performance disk, while data that is rarely used can be stored on relatively low-perfor-mance media such as tape. Further, it is likely that access requirements for a given file will not remain constant,especially if the file is retained for a long period of time. In general, references to most types of data significantlydecline as the data itself ages, as shown in the following figure: FIGURE 1: Data reference patterns (from: http://www.horison.com/horison/industry_topics/Lifetime_Data_Management.docUntitled DocumentArchivas, Inc. | 200 West Street First Floor | Waltham, Massachusetts 02451-1121 | United Statesphone 781 890 8353 | fax 781 890 8343 | email firstname.lastname@example.org | web www.archivas.comPage 5White PaperWith the advent of government regulations such as the Sarbanes-Oxley Act, enterprises are required to archiveincreasing amounts of reference data, and retain them for ever longer periods of time. In order to keep storagecosts down, it is increasingly important that archive systems respond to changing access requirements by movingdata easily from more expensive disk-based media to less-expensive removable media. By encompassing both disk-based and tape-based storage and providing a unified interface to both, RIS can provide a smooth migration path foraging data. Furthermore, RIS policy mangers can automatically manage the migration, as determined by archive-wide or file-specific policies.ConclusionA digital archive system that is based on H-RAIN architecture offers the most economical, scalable, and effectivesolution for large-scale storage of reference data. Policy-based management minimizes administrative overheadwhile it provides the most reliable way to achieve an archive's most important requirements: availability, reliability,and content authentication. By extending the H-RAIN model to encompass any network node, including those that interface to tape-basedsystems, the potential exists to implement a multi-tier system where data is stored on the medium that best suitsaccess requirements, and can easily be migrated to another medium as those requirements change.