Given the importance of an enterprise’s data warehouse, the keeper of an organization’s corporate memory, it’s troubling to read reports of data warehousing failure rates of between 20 percent and 75 percent (“The Application Revolution,” by Jim D’Addario, posted on The Data Warehousing Institute website, December 2006). According to experts, data quality issues account for up to 70 percent of those data warehouse failures.
It is an overwhelming challenge to address all data quality issues for a large data warehouse as it is being built, and it’s even more challenging to address in an existing warehouse. What’s needed is a way to prevent, identify and monitor anomalies – which are irregularities or deviations from the norm – and to improve the quality of, and confidence in, data content.
white paperThe high rate of data warehouse failures and the associated downtime and lost productivity require organizations to take a proactive approach to address data quality issues. Data warehouse professionals can greatly improve the quality of data content by preventing, identifying and monitoring data anomalies. Avoid the pitfalls that cause data warehouse failures through a zapping process that can help ensure the success, effectiveness and efficiency of your data warehouse.>> Zapping data anomalies eds.comUntitled DocumentAuthorsNanette Ahnell Stephen Dane Stephen Fugate EDS Consulting Services EDS Consulting Services EDS Consulting ServicesTable of ContentsIntroduction 1The details behind the data 1What is a data warehouse? 1What is a data anomaly? 2What is a typical data flow? 2Defining data warehouse quality 2How it all ties in 2Insidious data anomalies 2Overview of the zapping process 3The key strategies 3Details of the anomaly zapping process 3Step 1 Identify and recognize anomalies 3Step 2 Monitor and handle anomalies 5Step 3 Repeat the process 6Conclusion 7About the Authors 7Untitled Document< 1 >EDS white paperZapping data anomalies IntroductionGiven the importance of an enterprise s data warehouse, the keeper of an organization s corporate memory, it s troubling to read reports of data warehousing failure rates of between 20 percent and 75 percent ( The Application Revolution, by Jim D Addario, posted on The Data Warehousing Institute Web site, December 2006). According to experts, data quality issues account for up to 70 percent of those data warehouse failures.It is an overwhelming challenge to address all data quality issues for a large data warehouse as it is being built, and it s even more challenging to address in an existing warehouse. What s needed is a way to prevent, identify and monitor anomalies which are irregularities or deviations from the norm and to improve the quality of, and confidence in, data content.This document outlines a process that provides a straightforward method for preventing and protecting a data warehouse from data anomalies that can lead to bad decisions, downtime and lost productivity.The details behind the dataData can be a volatile thing. An application can function effectively even if the data behind it is bad. It is only when you try to reuse that data for another purpose that you discover how inaccurate, outdated or irrelevant it is and by then, it s usually too late. The fact that the data is working reasonably well for the application that created it doesn t mean it will work for any other application or business use. And it is also possible you will have problems with the data later.By definition, data warehouses reuse data from source systems, and therein lies the vast potential for data anomalies. That s why it makes sense to try to address potential problems early on. Data warehouse professionals can protect their data warehouse from anomalies through a methodical approach to detecting, handling and protecting the data warehouse from anomalies. This paper, Volume 1 in a series, describes the framework for improving warehouse data quality by zapping preventing or at least monitoring and handling data anomalies. Why is zapping anomalies so important? The value of the data warehouse begins to diminish as anomalies accumulate. This can result in a 30 percent increase in unscheduled downtimes, costing the average 50 billion revenue company up to 25 million a year in lost productivity. The unmonitored flow of data into the warehouse makes it easier for anomalies to accumulate undetected. The sooner anomalies are detected and handled, the better. Fixing incorrectly loaded data can be hugely expensive, both in terms of the impact to data analysis and in terms of the potential cost to clean up the data.Before diving into the data anomaly-zapping topic, we will first define the following fve key concepts discussed in this paper to provide a common starting point:Data warehouse Data anomalyData flows Data warehouse qualityHow it all ties in"""""What is a data warehouse?A data warehouse provides reliable, consistent, enterprisewide, historic data that can be accessed and analyzed. Data marts are sometimes referred to as data warehouses, but these consist of data that supports a more focused business or departmental need. Data warehouses are built using an iterative methodology, developing layer upon layer of data from which increasingly useful analysis is made available to the business units. The essence of data warehouses and data marts is the qualitative aspect of the business decisions made from the reporting and analysis of the data. Operational data stores (ODS) are another frequent component of a business intelligence solution used to focus on operational reporting, but they are quite different from data warehouses and marts. The ODS has different business rules, so anomalies in the ODS usually are not altered. Untitled DocumentEDS white paper Zapping data anomalies < 2 >The anomaly zapping concepts in this paper are intended to apply primarily to data warehouses and marts, referred to generically as data warehouses through the rest of this paper.What is a data anomaly?A data anomaly is any data that is unsuitable for the intended use because it doesn t conform to data warehouse documentation or requirements. The value of the data warehouse begins to diminish as anomalies accumulate. If enough anomalies are allowed to accumulate, they begin to affect user confidence in the data. Without confidence in the data, its value in supporting business decisions diminishes. What is a typical data flow?Data often flows from one or more source systems through a data transformation process before being loaded into the target data warehouse. Data transformations are also known as Extract, Transform and Load (ETL) processes. The data flows can be in batches, such as daily loads, or continuous feeds. Typically, end users then access the data with a data access tool. Figure 1 illustrates this typical data flow pattern.Defining data warehouse qualityThere are many definitions of data warehouse quality, ranging from data excellence to customer satisfaction with the data. This paper uses the following definition: Data warehouse quality means the data is suitable for the intended use by all users. The data conforms to all documented specifications. Data warehouse quality ensures the following:" The data is stored according to its data types." The data is consistent." The data is not redundant." The data follows business rules." The data is timely" The data is complete." There are no duplicate records. The technical and business teams, as well as end users, must be keenly involved in defining and ensuring data warehouse quality, particularly business users. Strong participation from all stakeholders is required for there to be meaningful definitions of intended use and compre-hensive documentation of requirements.How it all ties inProactive and consistent attention is needed throughout the life cycle of the data warehouse to continuously define and control data quality. This paper focuses on the process recommended to identify the anomalies of interest and handle them. Subsequent volumes will cover four key strategies you may want to deploy to systematically detect and handle anomalies. From there, you can select a tool or tools that will help you effectively meet those requirements.Insidious data anomalies Data anomalies can exist when the data warehouse is first built. They can also enter the data warehouse system over time, as changes are made to upstream source systems, as changes occur in the functional business or as changes are made to the data warehouse. The anomaly can occur within a particular column of data, such as an incorrect customer name or within a table, such as customer master, which, by its definition, should not contain duplicate rows.If a proactive plan to actively zap anomalies is not implemented, it can be several months before the level of anomalies rises to the point they become noticeable. By then, the costs of identifying and eradicating the anomalies can be huge in terms of actually fxing the problem and in terms of poor decisions made as a result of bad information.Figure 1: Typical data flow patternData ExtractsSourceSystemsDataTransformations(ETL)DatabaseLoadDatabaseAccessUntitled Document< 3 >EDS white paperZapping data anomalies As an example, a major health care company was unpleasantly surprised when one of its vendors changed its patient-gender coding scheme, causing all data for a particular warehouse feed to default to female. It took more than six months before the volume of data became so large it stood out as a problem. This resulted in hundreds of hours spent to identify the extent of the problem and clean up the data not to mention the impact on decisions over the six-month period. A methodical approach to anomaly handling, such as the zapping process, can minimize these impacts and costs. Keeping anomalies out of the data warehouse requires thorough planning. The planning must address striking a fnancial balance between the cost of zapping anomalies and the cost of accumulating bad data. This will depend on many factors, such as the size of the data warehouse, budget, expected usage and ability to fix bad data. Overview of the zapping processThe zapping process contains three iterative steps to systematically handle anomalies. These steps are used in conjunction with four key strategies to make the process practical, as illustrated in Figure 2. The steps are identifying and recognizing anomalies, monitoring and handling anomalies, and repeating the process.The remainder of this paper discusses these steps in detail. Having a methodology such as zapping is a start. But practical strategies for identifying, monitoring or handling potential anomalies are needed to successfully implement the anomaly zapping processes. The key strategiesThe strategies trace the documentation and flow of data from source systems to the target warehouse. This provides a logical approach to identifying opportunities for handling anomalies. These strategies, which we will cover in subsequent volumes of this paper, are as follows:1. Use specifications and related documentation.2. Monitor data sources.3. Handle data extraction/transformation/load (ETL) loopholes.4. Monitor the target data.The three iterative steps to the zapping process, combined with the key strategies, provide a comprehensive framework for limiting anomalies in the data warehouse. Details of the anomaly zapping processOur process follows a sequential approach to zapping anomalies. It begins with identifying the potential pool of anomalies that could or do exist. From this pool, a selection is made of the anomalies that warrant measurement and control. The next step is to recognize each of the selected anomalies and mitigate their effects. The last step involves repeating the process on a regular basis for continued refinement. The details of each step follow.Step 1 Identify and recognize anomaliesIn this step, data warehouse fields are analyzed to identify potential anomalies. Data anomalies can arise from a variety of problems, including missing or incomplete data, incorrectly captured data, data that does not conform to business rules, and inclusion of data that does not serve the intended usage. Figure 2: The anomaly zapping processP ZAPStrategies1. Use your documentation2. Pay attention to sources3. Handle ETL loopholes4. Monitor target data1. Identify and recognizeanomalies2. Monitorand handleanomalies3. Repeatthe processUntitled DocumentEDS white paper Zapping data anomalies < >Identify potential anomalies Assess all felds in the data warehouse and identify the fields that have potential anomalies. Use as many input sources as is practical to assess fields for potential anomalies, and document the results. Figure 3 suggests the kinds of information that go into the anomaly identification process.Determine potential anomalies to recognize After the list of potential anomalies is assembled, a determination is made whether to ignore or recognize each anomaly. The decision to recognize a potential anomaly should take into account the relative importance of the feld to users. It should also account for the practicality and cost of handling the anomaly. This is where the art of anomaly recognition lies.Figure 3: Sources for the anomaly identification stepSourceExamplesData warehouse documentation" Requirements documentation" Data dictionaries" Source system specificationsData models" Source system data models" Warehouse modelsSource system data analysis" Field distributions" Record counts" Look-up tables for validationETL specifications" Detailed mapping specifications" Transformation logicWarehouse data analysis" Warehouse field distributions" Record counts" Mismatched data between tablesEnd-user requirements" Types of reports to be produced" Types of analysis to be performed" Critical usage requirementsPotential anomaliesPart of the anomaly identification process is knowing what to look for. Here are some typical situations you may discover:" Non-numeric data going into a numeric field*" Invalid dates*" Data size mismatches between source and data warehouse" Unexpected code values" Out-of-range values" Erroneous defaulting of fields" Unmatched data" Missing data" Incomplete data" Erroneous data aggregations* This type of data anomaly is not pos-sible if the source and target column is truly defined with a numeric or date data type.Factors for recognizing anomalies Figure 4 lists a variety of factors that will assist in prioritizing the kinds of anoma-lies that are candidates to recognize. Completing this step will require in-depth reviews of information from a variety of sources. It will also involve making many decisions and judgment calls. This information will be revisited regularly throughout the life of the data warehouse to determine if new anomalies present risks that need to be addressed.Untitled Document< >EDS white paperZapping data anomalies Organizing all the information Creating a log of potential anomalies will help organize this information. The log is seeded with all data columns and table names from the data warehouse. The log is updated as the strategies are applied and potential anomalies are identified. The update indicates the date, potential anomaly condition, the decision to recognize the anomaly and the reasoning behind the decision (Figure 5). This information can be critical to future reviews to determine if changes in anomaly monitoring are warranted.Step 2 Monitor and handle anomaliesThis step establishes procedures to monitor and handle the occurrence of recognized anomalies. The practicality of this step will depend on many variables, including the ability to identify anomaly values, the expense involved to design and operate a monitoring process, the criticality of the data to users, and at what point in the system development life cycle you are addressing this issue.Anomaly monitoring is accomplished by establishing a procedure that observes the column or row of data at the earliest opportunity in the data warehouse data fow. Monitoring can be established as a pre-ETL process, in the ETL process or after the data is loaded into the data warehouse (Figure 6). The most efficient way to handle anomalies is to zap them at the earliest phase of the data flow. Careful up-front establishment of business rules and/or source data clean-up can prevent many data anomalies.For example, a U.S. utilities company implemented a step change in the data fow process and improved the up-time performance of its data warehouse by 17 percent. This resulted in a 2 million annual savings to the business.SourceExample of a value to recognizeExample of a value to ignoreCritical analysis fieldsPaid amountUnstructured fields, such as comments and notesFields that affect business decisionsLocation where product was soldCustomer fax numberAnomalies that are easy to detectInvalid customer number (does not exist in master table)Incorrect customer number (entered incorrectly in the source system, but does match a number in the customer master table)Anomalies that are easy to remedyInvoice total does not match the sum of the invoice line itemsPaid amount not recordedFigure : Factors for recognizing anomaliesFigure : Potential anomaly log Table/columnDatePotential anomalyRecognize?ReasonPaid amount10/13/06Value is This is a key value that will be the center of analysis for most users. Figure 6: Examples of anomaly monitoring methodsMonitoring methodsData flow locationCompare values with other sourcesPre-ETLValidate source row countsPre-ETLEnsure values fall within an expected rangeETLEnsure percent of rows with default values are within tolerance ETLCompare details to aggregates or sub-aggregatesData warehouseCompare actual results to expected resultsData warehouse Untitled DocumentEDS white paper Zapping data anomalies < 6 >Assuming an anomaly can be reasonably detected and monitored, the next logical step is to complete the process to zap the anomaly from the data warehouse. The actual corrective action (handling method) to take will depend on several considerations, including the ability to determine what the correct value should be, the criticality of usage, and the amount of effort and expense involved (Figure 7).After decisions have been made on identifying and handling a particular anomaly, the decision should be documented in the anomaly identification log created in Step 1. With this additional information, the log might look something like the Figure 8.Step 3 Repeat the processThe entire process should be repeated on an ongoing basis, because in the course of business operations, some level of change is likely to occur over time. User needs will change. Even the source systems may encounter change. All of these factors point to the need to continue the process of identifying recognizable anomalies and establishing procedures to detect and mitigate their effects (see sidebar). The frequency of revisiting these items will depend on many factors, but should be done at least annually.Events that trigger repeating the process" A year has passed since the last time the anomaly documentation was reviewed." The source system has changed, and new kinds of anomalies may be occurring." New data sources are now available for the warehouse." Usage of the warehouse has matured, and users are now focusing on different kinds of analysis." New regulatory requirements have been imposed.Figure 7: Examples of anomaly handling methodsHandling methodsData flow locationHalt the entire load process until the anomaly condition has been corrected.Pre-ETLRevise the data warehouse data model to retain fields as entered for corresponding cleansed fields. Pre-ETLDefault anomaly values to some other value so they are consistent but still stand apart.ETLCorrect anomaly values.ETLWarn when an anomaly occurrence approaches a specific predetermined threshold.Data warehouseAlert when an anomaly occurrence exceeds a specific predetermined threshold.Data warehouse Table/ columnDatePoten-tial anomalyRecog-nize?Reason DetectionAction PlanPaid amount10/13/06 Value is <= zeroYesThis is a key value that will be the center of analysis for most users.Check each source record to ensure the value is greater than zero.Reject rows with values 0 during the load process into a sepa-rate file for research and reload after corrections. Figure 8: Updated potential anomaly logUntitled Document< 7 >EDS white paperZapping data anomalies ConclusionWhether you are designing a new data warehouse or supporting an existing data warehouse, the zapping process can help you avoid the pitfalls that cause the majority of data warehouse failures. The process can be used in a new data warehouse design project from inception. This will enable you to efficiently build in safety measures that ensure data quality from the receipt of data out of the source systems through the data transformation process, and as it is loaded into the data warehouse.The process can support an existing data warehouse by retroactively identifying potential anomalies, correcting existing anomalies and preventing potential anomalies from polluting the data warehouse.The business case for implementing the zapping process is made by measuring the avoided cost of bad decisions resulting from poor-quality data. Another measure is the return on investment by preventing defective information from getting into the data warehouse. Although both of these are difficult to measure with a dollar Stephen FugateStephen Fugate is a consultant specialist in the Business Intelligence area within EDS Consulting Services, providing expertise in designing and building data warehouses and data marts. Fugate has more than 20 years of experience in the IT industry. For 16 of those years, he expanded his experience in business intelligence data transformation, with a focus on the healthcare insurance industry.ContactStephen Fugatephone: 916 636 4290 e-mail: firstname.lastname@example.orgStephen DaneStephen Dane is a consultant specialist in Business Intelligence within the EDS Consulting Services group. He provides expertise in designing and implementing data access solutions to a wide variety of data structures, for a broad range of diverse users. Dane has more than 22 years of experience in the IT industry with concentration on data mart and data warehouse access over the last 12 years, primarily within the healthcare industry.ContactStephen Dane phone: 916 636 4283 e-mail: email@example.com About the AuthorsNanette Ahnell, PMPNanette Ahnell is a senior consultant in the Business Intelligence area within EDS Consulting Services. In this role, she provides project management expertise and leadership for data warehouse solution implementations for EDS clients. Ahnell has 28 years of information technology experience with an emphasis on project management for business intelligence solutions within the healthcare industry.ContactNanette Ahnell phone: 916 636 4435 e-mail: firstname.lastname@example.org value, it is clear that proactive attention to data quality is the prudent and cost-effective thing to do. By documenting the anomaly identification process as suggested, it will be easier to identify new trends and determine where adjustments should be made to meet the ever-changing needs of the warehouse. This proactive attention to data quality in the zapping process will help ensure the success, effectiveness and efficiency of your data warehouse. Untitled DocumentAbout EDSEDS (NYSE: EDS) is a leading global technology services company delivering business solutions to its clients. EDS founded the information technology outsourcing industry more than 40 years ago. Today, EDS delivers a broad portfolio of information technology and business process outsourcing services to clients in the manufacturing, financial services, healthcare, communications, energy, transportation, and consumer and retail industries and to governments around the world. Learn more at eds.com.Contact usCorporate HeadquartersUnited States5400 Legacy DrivePlano, Texas 75024USA1 800 566 9337Asia36F, Shanghai Information Tower211 Century AvenuePudong, ShanghaiChina 20012086 21 2891 2888Australia and New ZealandLevel 1, The Bond30 Hickson RoadMillers PointNew South Wales 2000Australia 612 9025 0777 Canada33 Yonge StreetToronto, OntarioM5E 1G4 Canada1 416 814 45001 800 814 9038 (in Canada only) Europe, Middle East and Africa2nd Floor Lansdowne House Berkeley SquareLondon W1J 6ER 44 20 7569 5100Latin AmericaAvenida Presidente JuscelinoKubitschek, 1830 5th Floor Tower 404543-900 S o PauloBrazil55 11 3707 4100Regional HeadquartersEDS and the EDS logo are registered trademarks of Electronic Data Systems Corporation. All other brand or product names are trademarks or registered marks of their respective owners. EDS is an equal opportunity employer and values the diversity of its people. Copyright 2007 Electronic Data Systems Corporation. All rights reserved. 02/2007 6GCPH6417_A4EDS. Technology services. Business solutions.We help clients improve their systems and processes so they can become more productive, manage change and grow.