The most significant challenge to solving this specific data quality problem is its highly subjective nature. The most common challenge is that most data has multiple uses, each with its own requirements. Additionally, there are often multiple sources for the same data. Business rules can be different from project to project, and decision makers on the same project can have widely varying perspectives.
Identifying Duplicate Records by Jim Harris 2011 Obsessive-Compulsive Data Quality Page 1 of 7www.ocdqblog.com Identifying Duplicate Records By Jim Harris at OCDQ BlogIn association with Experian QASUntitled DocumentIdentifying Duplicate Records by Jim Harris 2011 Obsessive-Compulsive Data Quality Page 2 of 7www.ocdqblog.com Identifying Duplicate RecordsBy Jim Harris IntroductionOne of the most common and challenging data quality problems is the identification of duplicate records, i.e. the redundant representations of the same information within and across systems throughout an enterprise. The need for a solution to this specific problem is one of the primary reasons that companies invest in data quality software and services.The great news is that there are many data quality vendors to choose from and all of them offer viable data matching solutions driven by impressive technologies and proven methodologies.The not so great news is that the wonderful world of data matching has a very weird way with words. Discussions about data matching techniques often include advanced mathematical terms like deterministic record linkage, probabilistic record linkage, Fellegi-Sunter algorithm, Bayesian statistics, conditional independence, and bi-partite graphs.What is sometimes overlooked is that although technology provides the solution, what is being solved is a business problem. Technology sometimes carries with it a dangerous conceit that what works in the laboratory and the engineering department will work in the board room and the accounting department, that what is true for the mathematician and the computer scientist will be true for the business analyst and the data steward.However, what determines that a duplicate record has been identified is not what scientific techniques or mathematical models can justify, but what your business rules define as a duplicate record. An effective methodology for implementing business rules for identifying duplicate records helps maximize the time and effort, as well as the subsequent return on investment in whatever data matching technology is selected.The most significant challenge to solving this specific data quality problem is its highly subjective nature. Although a common definition for data quality is fitness for the purpose of use, the common challenge is that most data has multiple uses, each with its own fitness requirements. Additionally, there are often multiple sources for the same data (e.g. customer master data). Business rules can be different from project to project, and decision makers on the same project can have widely varying perspectives. These challenges highlight the need for having an effective methodology in addition to powerful and flexible technology, in order to enable the implementation of a business-driven data matching solution.Untitled DocumentIdentifying Duplicate Records by Jim Harris 2011 Obsessive-Compulsive Data Quality Page 3 of 7www.ocdqblog.com The Two-Headed MonsterAll data matching techniques provide some way to rank their match results (e.g. numeric probabilities, weighted percentages, odds ratios, confidence levels, etc.). Ranking is often used as a primary method in differentiating the three possible data matching result categories: Automatic Matches Records that matched (aka True Positives) 1. Automatic Non-Matches Records that did not match (aka True Negatives)2. Potential Matches Records requiring manual review by a subject matter expert (SME)3. All data matching techniques must also face the daunting challenge of The Two-Headed Monster:False Negatives Records that did not match, but " should have been matchedFalse Positives Records that matched, but " should not have been matchedAs a simple example, imagine determining the business rules for identifying duplicate customer records using customer name and postal address and initially requiring an exact match on both fields.Sherlock Holmes would say it s elementary that these two records exemplify a true positve:And the Bard of Avon would admit that a customer by any other name would exemplify a true negatve:But Papa Haydn would be disappointed that a required exact match orchestrated this false negatve:However, by relaxing the exact match on customer name, we might Pen this father/son false positve: Untitled DocumentIdentifying Duplicate Records by Jim Harris 2011 Obsessive-Compulsive Data Quality Page 4 of 7www.ocdqblog.com The Very True Fear of False PositivesFundamentally, the primary business problem being solved by the identification of duplicate records is the reduction of false negatives. However, the pursuit to reduce false negatives carries with it the risk of creating false positives. The harsh reality is that there is no perfect data matching solution. The risk of false positives can be mitigated but never eliminated.Most organisations are far more concerned about the potential negative impact on business decisions caused by false positives in the records automatically linked by a data matching solution, than they are about the false negatives not linked after all, those records were not linked before investing in the data matching solution. Creating a new problem is usually perceived as being worse than failing to solve an existing problem especially if you may have been blissfully unaware of the existing problem.The very true fear of false positives often motivates overly cautious approaches to data matching that result in the perpetuation of false negatives. Furthermore, this often restricts implementations to exact (or near-exact) matching techniques (e.g. see Sherlock Holmes above) and ignores the more robust capabilities of data matching software to fnd potential matches (e.g. see Joseph Haydn above).The pragmatic concerns of these human and business factors, not just the capabilities of the technology, need to be taken into consideration in order to achieve an acceptable balance between false positives and false negatives when defining the business rules for the identification of duplicate records.StandardisationMatching often uses a standardisation process to prepare its input, which facilitates a direct evaluation of comparable fields, more reliable comparisons based on standardised values, decreases the failure to match records because of data variations, and increases the probability of effective match results.Standardisation represents the input data the same way, regardless of the data source, using a common set of transformations, consistent formats, and normalised values so data can be directly compared. Free-form fields (e.g., customer name, postal address) will often contain numerous variations resulting from different conventions for representing the same value. Standardisation parses free-form fields to break them down into smaller individual felds to gain improved visibility of the available input data, apply standard values, and, when possible, populate missing values. Standardisation often creates additional fields (e.g. phonetic keys) for improved selection of candidate records for comparison.Some fields are more valuable than othersIn most data matching techniques, fields are compared in order to assemble a collective assessment of the likelihood that a matching record has been found. However, the contribution of each field is often not equivalent since some felds contribute more to determining if a matching record has been found (e.g. a non-matching Last Name is worse than a non-matching House Number). Most data matching techniques provide numerous field comparison functions, including not only exact match comparisons, but also many error tolerant comparisons (e.g. edit distance functions). As with everything else, business rules should determine which fields are used in data matching, how much each feld should contribute, and what, if any, error tolerance should be allowed during the field value comparison.Untitled DocumentIdentifying Duplicate Records by Jim Harris 2011 Obsessive-Compulsive Data Quality Page 5 of 7www.ocdqblog.com Duplicate ConsolidationDuplicate consolidation evaluates groups of identified duplicate records and creates one best of breed representative record (aka survivor or golden record ) for each group using one of two techniques:Record Level Consolidation choosing one complete record from within the group 1. Field Level Consolidation constructing fields from different records within the group2. Business rules for performing consolidation can vary as much as those for identifying duplicate records. Although usually straightforward, complexity can be introduced when multiple criteria with nested levels of tie-breakers are needed to choose or construct best of breed data representing the group. Although not a comprehensive list, here are some of the most common consolidation criteria:Completeness for record level consolidation, selecting the record with the highest number of populated " felds. For field level consolidation, selecting the fields with the longest values.Frequency more common in field level consolidation where the most frequently occurring value is selected " for a given field. However, it can be used in record level consolidation to select the record that has the most frequently occurring combination of values across fields. Either way, the assumption is that the most frequently occurring value indicates preferred information.Recency for record level consolidation, selecting the record most recently updated. For field level " consolidation, selecting the field value from the record most recently updated. Either way, the assumption is that the most recent update contains the most reliable information.Source more common in record level consolidation to select the record that originated in a preferred " source system. However, it can be used in field level consolidation to select the value for a given field from the record that originated in a preferred source system.As a simple example, imagine determining the business rules for duplication consolidation using customer first, middle, and last name, as well as date of birth, and national insurance number (NINO): Untitled DocumentIdentifying Duplicate Records by Jim Harris 2011 Obsessive-Compulsive Data Quality Page 6 of 7www.ocdqblog.com Frankenstein Consolidation Field level consolidation is sometimes referred to as Frankenstein consolidation because it can create a combination of field values not found on any of the individual records within the group, thereby possibly assembling an unnatural data monster by creating an invalid combination of field values.Physical Removal vs. Logical LinkageTypically, consolidation is implemented using one of two techniques:Physical Removal Identified duplicate records are replaced with their representative record, either removed 1. (usually archived) from the source system or excluded from the target system.Logical Linkage Identified duplicate records are appended with a new reference identifier field whose value 2. points to the identifier field value of the representative record. Consolidation vs. Cross PopulationAn alternative strategy to duplication consolidation is cross population, where the representative record is used to update the identified duplicate records with the best of breed field values.Typically, cross population is implemented using one of two techniques:Fill in the blanks field values from the representative record are used to update only the unpopulated fields 1. on the records within the group.Create consistent values representative record is used to update all fields in all records within the group to 2. create a single consistent representation of the highest quality data available.Untitled DocumentIdentifying Duplicate Records by Jim Harris 2011 Obsessive-Compulsive Data Quality Page 7 of 7www.ocdqblog.com SummaryAll data quality vendors have viable data matching solutions driven by impressive technology, so avoid technology bias by stating the problem in business terms. Solutions to this common data quality challenge must have the fexibility to meet specific business needs. Therefore, an effective methodology for implementing the business rules for identifying duplicate records helps maximise the time, effort, and subsequent return on investment in whatever data matching technology is selected.Aggressively identifying duplicates can negatively impact business decisions after consolidation. This common concern will often influence a cautious approach to duplicate identification motivated by the fact that there is generally far greater concern about false positives than false negatives. The harsh reality is that there is no perfect data matching solution. The risk of false positives can be mitigated but never eliminated. An acceptable balance between false positives and false negatives must be achieved when defining the business rules for identifying duplicate records.Matching often uses a standardisation process to prepare its input. This facilitates a direct evaluation of comparable felds, more reliable comparisons based on standardised values, decreases the failure to match records because of data variations, and increases the probability of effective match results. For more information:Email email@example.com Or call 0800 197 7920 2011 QAS Limited. QAS Limited. Registered in England. No. 2582055. Landmark House, Experian Way, NG2 Business Park, NottinghamNG80 1ZZ. The words Experian and 'QAS' are registered trade marks in the EU and other countries and are owned by Experian Limited and/or its associated companies. All rights reserved.