Robust Detector Of Fuzzy Duplicates

< Back

Robust Detector Of Fuzzy Duplicates

Abstract: At least one implemenatation, described herein, detects fuzzy duplicates and eliminates such duplicates. Fuzzy duplicates are multiple, seemingly distinct tuples (i.e. records) in a database that represent the same real-world entity or phenomenon.

Get Free WhatsApp Updates!
Notices, Deadlines & Correspondence

Patent Information

Application #

Filing Date

15 June 2005

Publication Number

32/2007

Publication Type

INA

Invention Field

COMPUTER SCIENCE

Status

Parent Application

Applicants

MICROSOFT CORPORATION

ONE MICROSOFT WAY, REDMOND, WASHINGTON 98052, UNITED STATES OF AMERICA

Inventors

1. RAJEEV MOTWANI

ONE MICROSOFT WAY, REDMOND, WASHINGTON 98052, UNITED STATES OF AMERICA

2. SURAJIT CHAUDHURI

ONE MICROSOFT WAY, REDMOND, WASHINGTON 98052, UNITED STATES OF AMERICA

3. VENKATESH GANTI

ONE MICROSOFT WAY, REDMOND, WASHINGTON 98052, UNITED STATES OF AMERICA

Specification

ROBUST DETECTOR OF FUZZY DUPLICATES TECHNICAL FIELD [0001] This invention generally relates to technology related to databases and data warehouses. BACKGROUND [0002] Decision support analysis on data warehouses influences important business decisions; therefore, accuracy of such analysis is important. However, data received at the data warehouse from external sources usually contains errors (e.g., spelling mistakes, inconsistent conventions across data sources). These errors often result in duplicate entry of tuples. Hence, significant time and money are spent on data cleaning, the task of detecting and correcting errors in data. [0003] The problem of detection and elimination of duplicated tuples in a database is one of the major problems in the broad area of data cleaning and data quality. It is often the case that the same logical, real-world entity may have multiple representations in the data warehouse. [0004] For example, when a customer named Isabel purchases products from SuperMegaMarket twice, her name might appear as two different records: [Isabel Christie, Seattle, WA, USA, 98025] and [Christy Isabel, Seattle, WA, United States, 98025]. The discrepancy may be due to data entry errors and/or preferences of the salesperson who enters the data. [0005] Such duplicated information can significantly increase direct mailing costs because several customers, like Isabel, may receive multiple catalogs. In direct mailing campaigns with tight budget constraints such errors can be the difference between success and failure of the campaign. Moreover, such errors can cause incorrect query results (e.g., How many SuperMegaMarket customers are there in Seattle?) as well as erroneous analysis model creation. [0006] Ridding a database of seemingly distinct, but yet duplicate, entries is the fuzzy duplicate elimination problem. Herein, "fuzzy duplicates" are seemingly distinct tuples (i.e., records) that are not exact matches but yet represent the same real world entity or phenomenon. [0007] This problem is different from the standard exact duplicate elimination problem where two tuples are considered duplicates only when they exactly match all attributes. Unless the context clearly indicates otherwise, assume hereinafter that references to duplicate detection and elimination is focused on the fuzzy duplicate elimination problem. [0008] Previous solutions to fuzzy duplicate elimination can be classified into supervised and unsupervised approaches. Supervised approaches learn rules characterizing pairs of duplicates from training data consisting of known duplicates. Further, these approaches assume that training data exhibit the variety and distribution of errors observed in practice. It is difficult, if not impossible, to obtain such comprehensive training data, an issue that was addressed, to a limited extent, by active learning approaches which have the drawback of requiring interactive manual guidance. In many real data integration scenarios, it is not possible to obtain good training data or interactive user guidance. [0009] The problems of unsupervised duplicate elimination are similar to those of clustering, in that both attempt to partition a dataset into disjoint groups. But, there are some distinct differences between standard clustering formulations and the duplicate elimination problem. These differences will be discussed later. [0010] Current unsupervised approaches tend to ignore these differences and, instead, rely on standard textual similarity functions (e.g., well-known single-linkage clustering algorithms such as edit distance and cosine metric) between multi-attribute tuples and threshold-based constraints for detecting duplicate pairs. However, such threshold-based approaches result in large numbers of false positives (tuples which are not true duplicates but predicted to be so) or large number of false negatives (tuples which truly are duplicates but not recognized as such). SUMMARY [0011] At least one implementation, described herein, detects fuzzy duplicates and eliminates such duplicates. Fuzzy duplicates are multiple, seemingly distinct tuples (i.e., records) in a database that represent the same real-world entity or phenomenon. BRIEF DESCRIPTION OF THE DRAWINGS [0012] The same numbers are used throughout the drawings to reference like elements and features. [0013] Fig. 1 is a block diagram of an implementation described herein. [0014] Fig. 2 is a flow diagram showing a methodological implementation described herein. [0015] Fig. 3 is an example of a computing operating environment capable of (wholly or partially) implementing at least one embodiment described herein. DETAILED DESCRIPTION [0016] The following description sets forth techniques that facilitate the detection and elimination of fuzzy duplicate tuples in a database. The techniques may be implemented in many ways, including (but not limited to) program modules, general- and special-purpose computing systems, dedicated electronics, and as part of one or more computer networks. [0017] An exemplary implementation of these techniques may be referred to as an "exemplary fuzzy duplicate detector" and is described below. [0018] The exemplary fuzzy duplicate detector addresses the fuzzy duplicate elimination problem. Herein, "fuzzy duplicates" are seemingly distinct tuples (i.e., records) that are not exact matches but yet represent the same real world entity or phenomenon nonetheless. Detecting and eliminating fuzzy duplicates is the fuzzy duplicate elimination problem. Criteria Characterizing Duplicates [0019] In detecting fuzzy duplicates, the exemplary fuzzy duplicate detector utilizes at least two new constraints that conventional approaches do not use. In particular, these two new criteria are called the compact set (CS) and the sparse neighborhood (SN). These criteria explicitly capture local structural properties of the data to characterize groups of duplicate tuples. [0020] The CS and SN criteria capture these properties: • duplicates in a group are "closer" to each other than to others; and • the "local neighborhood" of duplicate tuples is empty or sparse. [0021] Tuples that satisfy these criteria may be grouped together as duplicates even though they are far from each other, while tuples that are closer but do not satisfy these criteria may not be grouped together. These localized structural properties differentiate the duplicate elimination problem from standard clustering formulations. (Table Removed) Table 1 Examples from a media database. Tuples tagged with asterisks are duplicate tuples. [0022] Table 1 provides an example of a typical music database. The first six tuples (tagged with an asterisk "*") are duplicate tuples , while the remaining tuples (7-14) are unique. Compact Set Criterion [0023] The Compact Set (CS) criterion is that a set of duplicates is a compact set of mutual nearest neighbors. The premise behind this criterion is that duplicate tuples are closer to each other than they are to other distinct tuples. That is, duplicate tuples are usually mutually nearest neighbors. For the example in Table 1, tuple 1 is the nearest neighbor of tuple 2 and vice-versa. In contrast, tuple 8 may be the nearest neighbor of tuple 7 and tuple 9 that of tuple 8. [0024] In contrast, conventional threshold-based approaches based on single linkage clustering assumes transitivity (i.e., if 'a' is a duplicate of 'b' and 'b' that of 'c' then 'a' is a duplicate of 'c') and identify connected components in a threshold-graph. Hence, they are more likely to yield a large number of false positives. Sparse Neighborhood Criterion [0025] The premise behind the sparse neighborhood (SN) criterion is that the local neighborhood of a group of duplicates is sparse. For instance, this criterion is not met around the unique tuples 7-14 in Table 1, which occur in larger (4, for this example) groups than sets of duplicates. [0026] From one perspective, the local neighborhood of a group of tuples is the immediate vicinity defined in terms of a surrounding region of size, dependent on the local distribution of tuples. For example, it may be a sphere of radius 2-nn(v), where nn(v) is the nearest neighbor distance of the tuple v. [0027] If the rate of growth—the number of tuples in the outer sphere— around a tuple is small, its local neighborhood is called "sparse." This notion is extended to a group of tuples, and their joint local neighborhood is called "sparse," if an aggregate of individual growth rates of tuples is small (e.g., less than a threshold c). For instance, the aggregation function max requires that the neighborhood values of all tuples in the group be less than the threshold, whereas the function average only requires the average of all growth rates to be small. The maximum function is more constraining than the average function. Formalization of the Criteria [0028] In the following definitions, let R be a relation (i.e., dataset) and d: R x R → [0, 1] be a symmetric distance function over tuples in R.. For clarity of exposition, it is henceforth assumed that (i) the distance between two tuples is zero only if the tuples are exactly identical; and that (ii) no two tuples in R are identical to each other. The validity of this assumption can be ensured by modifying d to return 0 when tuples are exactly identical, and to otherwise return d(vlt v2) + , for some small e > 0. [0029] CS Criterion: A set S of tuples from R is a compact set if,for every tuple v in S, the distance d(v, v') between v and any other tuple v' in S is less than the distance d(v, v") between v and any other v" in R-S. [0030] SN Criterion: For a tuple v, consider two concentric spheres: the smaller sphere has a radius nn(v), the distance between v and its nearest neighbor, and the larger sphere a radius of g(nn(v)) (>nn(v)). Herein, g(x) = 2x is used. The neighborhood growth ratio ng(v) is the number of points in the larger sphere around v. [0031] Let AGG : 2R→R be an aggregation function and c (>0) be a constant. We say that a set of tuples S is an SN(AGG, c) group if (i) |S| = 1, or (ii) the aggregated value of neighborhood growth ratios of all tuples in S is less than c (i.e., AGG({ng(v): v in S}) < c). [0032] SG (Small Group) Criterion: Another characteristic of groups of duplicates that may be considered is that they are usually very small. A group G of duplicates is small if |G| < K, for some pre-defined constant K > 1. This may also be called the "small cardinality" criterion since the cardinality (i.e., number of members) of the group is small. Exemplary Fuzzy Duplicate Detector [0033] Generally, the exemplary fuzzy duplicate detector partitions input relation R (e.g., a dataset of a database) into the minimum number of "valid" groups where a group is valid if it is small and satisfies the CS and SN criteria. [0034] In the context of the exemplary fuzzy duplicate detector, this is the Duplicate Elimination (DE) Problem: Given a relation R, a distance function d, a positive integer K (>1), an aggregation function AGG, and a positive real number c, the exemplary fuzzy duplicate detector partitions R into the minimum number of groups {G1,.. .,Gm} such that for all 1 ≤ i ≤ m: • |Gi|

Documents

Orders

Section	Controller	Decision Date

Application Documents

#	Name	Date
1	1555-DEL-2005-Correspondence to notify the Controller (Mandatory) [12-10-2017(online)].pdf	2017-10-12
1	1555-DEL-2005-Form-13-(26-08-2008).pdf	2008-08-26
2	1555-DEL-2005-GPA-(02-06-2010).pdf	2010-06-02
2	1555-DEL-2005-HearingNoticeLetter.pdf	2017-10-06
3	Abstract [16-05-2017(online)].pdf	2017-05-16
3	1555-DEL-2005-Correspondence-Others-(02-06-2010).pdf	2010-06-02
4	Claims [16-05-2017(online)].pdf	2017-05-16
4	1555-DEL-2005-Form-1-(22-12-2010).pdf	2010-12-22
5	Correspondence [16-05-2017(online)].pdf	2017-05-16
5	1555-DEL-2005-Correspondence-Others-(22-12-2010).pdf	2010-12-22
6	Description(Complete) [16-05-2017(online)].pdf	2017-05-16
6	1555-del-2005-gpa.pdf	2011-08-21
7	Description(Complete) [16-05-2017(online)].pdf_619.pdf	2017-05-16
7	1555-del-2005-form-5.pdf	2011-08-21
8	Examination Report Reply Recieved [16-05-2017(online)].pdf	2017-05-16
8	1555-del-2005-form-3.pdf	2011-08-21
9	1555-del-2005-form-2.pdf	2011-08-21
9	Other Document [16-05-2017(online)].pdf	2017-05-16
10	1555-DEL-2005-FER.pdf	2017-04-27
10	1555-del-2005-form-18.pdf	2011-08-21
11	1555-del-2005-form-13.pdf	2011-08-21
11	FORM-6-501-600(PRS).15.pdf	2015-03-13
12	1555-del-2005-form-1.pdf	2011-08-21
12	MS to MTL Assignment.pdf	2015-03-13
13	1555-del-2005-drawings.pdf	2011-08-21
13	MTL-GPOA - PRS.pdf	2015-03-13
14	1555-del-2005-abstract.pdf	2011-08-21
14	1555-del-2005-description (complete).pdf	2011-08-21
15	1555-del-2005-assignment.pdf	2011-08-21
15	1555-del-2005-correspondence-others.pdf	2011-08-21
16	1555-del-2005-claims.pdf	2011-08-21
17	1555-del-2005-correspondence-others.pdf	2011-08-21
17	1555-del-2005-assignment.pdf	2011-08-21
18	1555-del-2005-description (complete).pdf	2011-08-21
18	1555-del-2005-abstract.pdf	2011-08-21
19	1555-del-2005-drawings.pdf	2011-08-21
19	MTL-GPOA - PRS.pdf	2015-03-13
20	1555-del-2005-form-1.pdf	2011-08-21
20	MS to MTL Assignment.pdf	2015-03-13
21	1555-del-2005-form-13.pdf	2011-08-21
21	FORM-6-501-600(PRS).15.pdf	2015-03-13
22	1555-DEL-2005-FER.pdf	2017-04-27
22	1555-del-2005-form-18.pdf	2011-08-21
23	1555-del-2005-form-2.pdf	2011-08-21
23	Other Document [16-05-2017(online)].pdf	2017-05-16
24	Examination Report Reply Recieved [16-05-2017(online)].pdf	2017-05-16
24	1555-del-2005-form-3.pdf	2011-08-21
25	Description(Complete) [16-05-2017(online)].pdf_619.pdf	2017-05-16
25	1555-del-2005-form-5.pdf	2011-08-21
26	Description(Complete) [16-05-2017(online)].pdf	2017-05-16
26	1555-del-2005-gpa.pdf	2011-08-21
27	Correspondence [16-05-2017(online)].pdf	2017-05-16
27	1555-DEL-2005-Correspondence-Others-(22-12-2010).pdf	2010-12-22
28	Claims [16-05-2017(online)].pdf	2017-05-16
28	1555-DEL-2005-Form-1-(22-12-2010).pdf	2010-12-22
29	Abstract [16-05-2017(online)].pdf	2017-05-16
29	1555-DEL-2005-Correspondence-Others-(02-06-2010).pdf	2010-06-02
30	1555-DEL-2005-HearingNoticeLetter.pdf	2017-10-06
30	1555-DEL-2005-GPA-(02-06-2010).pdf	2010-06-02
31	1555-DEL-2005-Correspondence to notify the Controller (Mandatory) [12-10-2017(online)].pdf	2017-10-12
31	1555-DEL-2005-Form-13-(26-08-2008).pdf	2008-08-26

Search Strategy

1	search_1555del2005_21-02-2017.pdf