Entity Resolution From Documents

< Back

Entity Resolution From Documents

Abstract: The present subject matter relates to entity resolution, and in particular, relates to providing an entity resolution from documents. The method comprises obtaining a plurality of documents corresponding to a plurality of entities, from at least one data source. Upon receiving the plurality of documents, the plurality of documents is blocked into at least one bucket based on textual similarity. Further, a graph including a plurality of record vertices and at least one bucket vertex is created. The plurality of record vertices and the at least one bucket vertex are indicative of the plurality of documents and the at least one bucket, respectively. Subsequently, a notification is provided to a user for selecting one of a Bucket-Centric Parallelization (BCP) technique and a Record-Centric Parallelization (RCP) technique for resolving entities from the plurality of documents. Based on the selection, a resolved entity-document for each entity is created.

Get Free WhatsApp Updates!
Notices, Deadlines & Correspondence

Patent Information

Application #

Filing Date

06 March 2014

Publication Number

44/2015

Publication Type

INA

Invention Field

COMPUTER SCIENCE

Status

iprdel@lakshmisri.com

Parent Application

Patent Number

Legal Status

Grant Date

2022-01-21

Renewal Date

Applicants

TATA CONSULTANCY SERVICES LIMITED

Nirmal Building, 9th Floor, Nariman Point, Mumbai, Maharashtra 400021

Inventors

1. AGARWAL, Puneet

154B, Block A, Sector 63 Noida, Uttar Pradesh

2. SHROFF, Gautam

Tata Consultancy Services Limited Block C, Kings Canyon, ASF Insignia, Gurgaon-Faridabad Road, Gawal Pahari, Gurgaon - 122003, Haryana,

3. MALHOTRA, Pankaj

154B, Block A, Sector 63 Noida, Uttar Pradesh

Specification

FORM 2 THE PATENTS ACT, 1970 (39 of 1970) & THE PATENTS RULES, 2003 COMPLETE SPECIFICATION (See section 10, rule 13) 1. Title of the invention: ENTITY RESOLUTION FROM DOCUMENTS 2. Applicant(s) NAME NATIONALITY ADDRESS TATA CONSULTANCY SERVICES LIMITED Indian Nirmal Building, 9th Floor, Nariman Point, Mumbai, Maharashtra 400021, India 3. Preamble to the description COMPLETE SPECIFICATION The following specification particularly describes the invention and the manner in which it is to be performed. 1 2 TECHNICAL FIELD [0001] The present subject matter relates, in general, to entity resolution and, particularly but not exclusively, to entity resolution from a plurality of documents. 5 BACKGROUND [0002] Generally, when data from different sources is analyzed, often multiple records in the data may belong to the same real-world entity, such as a customer, a product or an organization. In order to find different records that belong to the same entity, a technique known as Entity resolution (ER) is widely used. In various disciplines, ER is also referred to 10 as record linkage, de-duplication, co-reference resolution, reference reconciliation, object consolidation, identity uncertainty and database hardening. ER has a wide scope of application, for example, in government and public health data maintenance, web search, ecommerce and law enforcement. In practice, dynamics pertaining to the ER may keep changing, e.g., corresponding data set may keep changing over a period of time. Therefore, in 15 order to accommodate such changes associated with the data, ER has to be performed regularly to update an ER result set of resolved entities. BRIEF DESCRIPTION OF THE DRAWINGS [0003] The detailed description is described with reference to the accompanying figures. In the figures, the left-most digit(s) of a reference number identifies the figure in which the 20 reference number first appears. The same numbers are used throughout the drawings to reference like features and components. [0004] Fig. 1 illustrates a network environment implementing an entity resolution system, in accordance with an embodiment of the present subject matter. [0005] Fig. 1(a) illustrates an example including a plurality of records and a plurality of 25 buckets for entity resolution, in accordance with an embodiment of the present subject matter. [0006] Fig. 1(b) illustrates an outcome of entity resolution from a plurality of documents by executing Record-Centric Parallelization (RCP) technique for entity resolution, in accordance with an embodiment of the present subject matter. 3 [0007] Fig. 2 illustrates a method for entity resolution from a plurality of documents, in accordance with an embodiment of the present subject matter. DETAILED DESCRIPTION [0008] System(s) and method(s) for entity resolution from a plurality of documents are described. The system(s) and method(s) can be implemented in a variety 5 of computing devices, such as laptops, desktops, workstations, tablet-PCs, notebooks, portable computers, tablet computers, internet appliances, and similar systems. However, a person skilled in the art will comprehend that the embodiments of the present subject matter are not limited to any particular computing system, architecture, or application device, as they may be adapted to 10 new computing systems and platforms as they become available. [0009] In the last few decades, Entity Resolution (ER) has emerged as a growing challenge in the realm of data management across industries. Often, multiple records available in various data sources may pertain to the same real-world entity, such as a person, a product, or an organization. To resolve such situations, ER analysis is performed for identifying those 15 records that refer to the same entity and once identified, merging those records. The various records may be interchangeably referred to as documents or textual documents. Therefore, in the ER analysis, a plurality of documents obtained from the various data sources may be matched, in pairs, for determining similarity among the plurality of textual documents. Based on the determination, a set of textual documents related to an entity may be identified, and the 20 identified set of textual documents may then be combined to create a merged document for the entity. As would be understood, the merged document of an entity may include all the details disclosed in each of the identified set of textual documents. [0010] Usually, ER analysis includes a large number of records to be processed in order to resolve the entities involved. For example, in case of a citizen of a country being considered 25 as an entity, the records may include identity proofs, such as a passport, a voter ID, a driving license, a credit card, a Permanent Account Number (PAN), a telephone number, and a bank account number. Considering that each citizen owns an average of 3 of the above-mentioned IDs, the number of records to be processed for resolving entities may turn out to be in millions, or even billions. 4 [0011] In order to make the ER analysis scalable, the conventional ER techniques employ a blocking technique to divide the records in various blocks based on some pre-defined parameters, such as textual similarity. Now, each block may contain a relatively small number of potentially matching textual documents. Thereafter, a pair-wise comparison of the textual documents is performed in each block to identify a set of textual documents 5 pertaining to an entity. In the pair-wise comparison, based on a match function, two textual documents are considered as matching. The match function may include but is not limited to predefined rules, and binary classifiers derived using machine learning. Therefore, based on the match function, a set of textual documents pertaining to each entity may be identified, within each 10 block. The set of textual documents may then be merged to create a merged document for each entity. As may be understood, the merged document contains all the information as disclosed in each of the set of textual documents pertaining to the entity. Therefore, within each block, the textual documents are resolved to entities, and such resolved entities are referred to as partial entities. 15 [0012] However, the conventional blocking techniques may block different textual documents belonging to a single entity into more than one block. In such a case, multiple partial entities belonging to the same entity may be obtained from multiple blocks. Such partial entities from different blocks may be connected by the fact that the partial entities may share the same textual document. Therefore, the textual documents pertaining to each of the 20 pair of the partial entities can be consolidated to form an entity-resolved document for an entity. As would be gathered, an entity-resolved document of an entity may include all the information pertaining to the entity as disclosed in each of the plurality of documents. [0013] As mentioned previously, in the course of resolving the entities from the records, the blocking techniques may result into formation of a plurality of blocks for collection of 25 potentially matching documents. Further, it may happen that a large number of blocks formed are singletons, i.e., blocks including only one textual document. This may indicate that, within a singleton bucket, such textual documents may not have to be further processed or compared with other textual documents. However, the conventional techniques may involve sending a textual document to a singleton block the textual document is blocked to. As would be 30 understood, since no comparisons have to be performed within a singleton block, sending of the textual document to the singleton block is unnecessary. In fact, sending the textual 5 documents to the singleton blocks would result into wastage of resources and time, and therefore, may add to the cost of ER analysis as well. The cost, resource and time wastage would be more in case the documents are large in size, and therefore, may affect the overall economic facet of the ER analysis. [0014] Further, as a result of the execution of the blocking technique, 5 there may be instances where the records may be blocked in a skewed manner, i.e., size of blocks, in terms of number of hashed textual documents, may turn out to be uneven. In case there are more number of blocks, the textual documents may be processed by employing a parallel computation technique. As may be understood, in parallel computation, the blocks can be 10 distributed across multiple processing units for performing the analysis. In such scenarios, time to be utilized for processing the textual documents in a block having more number of textual documents may be disproportionately more than the time to be utilized for a block having less number of textual documents. Therefore, a processing unit with blocks having larger number of textual documents than other blocks may act as a bottleneck for the overall 15 ER analysis, and an overall time required for completion of the ER analysis would be significantly more. [0015] Furthermore, consolidating merged documents to form entity-resolved documents is a complex process as it involves determination of common textual documents shared among partial entities, which is an iterative process. Therefore, time spent and resources used 20 for determining common textual documents are significant. Thus, as is evident, the conventional ER techniques can be time-extensive, inefficient, and expensive. [0016] According to the present subject matter, an entity resolution system, hereinafter referred to as a system, for entity resolution from a plurality of documents is disclosed. In one implementation, the system may obtain the plurality of documents corresponding to a 25 plurality of entities from at least one data source. The plurality of documents may be blocked into at least one bucket, based on textual similarity among the plurality of documents. Further, a graph including a plurality of record vertices and at least one bucket vertex may be created. Subsequent to the generation of the graph, a notification may be provided to a user for selecting one of a Bucket-Centric Parallelization (BCP) technique and a Record-Centric 30 Parallelization (RCP) technique for resolving entities from the plurality of documents. The 6 notification may include but is not limited to a suggestion for selecting one of the BCP technique and the RCP technique based on the blocking of the plurality of documents. Based on the selection by the user, a resolved entity document for each entity may be generated. [0017] In one implementation, the plurality of documents may be interchangeably referred to as records. As is generally understood, records can include tangible objects, 5 such as paper documents, like birth certificates, driver's licenses, and physical medical x-rays, as well as digital information, such as electronic office documents, data in application databases, web site content, and electronic mail (email). Further, the at least one data source may include, but is not limited to, an external database and/or an in-house database. Once the plurality of 10 documents is obtained, a blocking technique, e.g., Locality Sensitive Hashing (LSH) may be employed to block the plurality of documents. [0018] The LSH technique may use hash functions for grouping or blocking the plurality of documents based on textual similarity among the plurality of documents. In one implementation, a unique identification (ID) may be allotted to each of the plurality of 15 documents, and instead of blocking the plurality of documents themselves, unique IDs of the documents may be blocked into the at least one bucket. Further, singletons buckets, i.e., buckets having one document may be discarded, and may not be considered for the further computations of the ER analysis. As would be gathered, blocking of the plurality of documents may facilitate in avoiding undesired comparisons among the plurality of 20 documents. [0019] In one implementation, computations to be performed for the ER analysis may be distributed across multiple processing units. For example, the buckets can be provided to multiple processing units for the subsequent stages of the ER analysis. This would assist in parallel computation for performing ER analysis and therefore, time to be utilized and 25 complexity involved in the ER analysis can be minimized. [0020] Thereafter, a graph including a plurality of record vertices and at least one bucket vertex may be created. The plurality of record vertices and the at least one bucket vertex correspond to the plurality of documents and the at least one bucket, respectively. In other words, each of the plurality of documents and the at least one bucket may be considered as a 30 vertex in the graph. In one implementation, the plurality of record vertices and the at least one 7 bucket vertex may be connected to each other by edges, depending on the blocking of the plurality of documents. [0021] In one implementation, an adjacency list for each record vertex and each bucket vertex may be generated. In one example, the adjacency list of a record vertex may include details of bucket vertices to which the record vertex is hashed to. The adjacency 5 list of a record vertex may hereinafter be referred to as a record adjacency list. Similarly, the adjacency list of a bucket vertex may include details of record vertices hashed to the bucket vertex. The adjacency list of a bucket vertex may hereinafter be referred to as a bucket adjacency list. 10 [0022] Subsequent to the creation of the graph, a notification may be provided to a user for selecting at least one of a Bucket-Centric Parallelization (BCP) technique and a Record- Centric Parallelization (RCP) technique for resolving entities from the graph. In one implementation, the notification may include but is not limited to a suggestion for selecting one of the BCP technique and the RCP technique for resolving the entities from the plurality 15 of documents. In one implementation, the suggestion may be provided based on the blocking of the plurality of documents. For example, in case the blocking of the plurality of documents may result into substantially uniform distribution of the plurality of documents among the buckets, the BCP technique for entity resolution may be provided as the suggestion. On the other hand, in case the plurality of documents is distributed among the buckets in a non20 uniform manner, then the RCP technique may be provided as the suggestion. This is due to the fact that the RCP technique may utilize relatively lesser time than the BCP technique for entity resolution in case of non-uniform distribution of the plurality of documents. [0023] Further, in the BCP technique, the plurality of documents may be compared at bucket vertices. On the other hand, in the RCP technique, the plurality of documents may be 25 compared at record vertices. In one implementation, the BCP technique and the RCP technique may be employed using a Pregel-based platform. [0024] In one implementation, the user may select the BCP technique for entity resolution. As mentioned earlier, initially, only IDs of documents hashed to a bucket are available at a corresponding bucket vertex. Therefore, a value, i.e., content of a corresponding 30 document, of each record vertex may be provided to one or more bucket vertices as provided 8 in a record adjacency list. Once each bucket vertex receives values of the record vertices hashed to the bucket vertex, the documents are compared at each bucket vertex. In one implementation, an Iterative Match Merge (IMM) technique may be used for comparing the documents at a bucket vertex. In accordance with the IMM technique, at each bucket vertex, at least one matching pair of documents may be identified and merged 5 to create a merged document for each entity. Entities resolved, at a bucket vertex, by creating merged documents may be referred to as partial entities. [0025] As per the IMM technique, multiple partial entities belonging to the same entity can be obtained from multiple buckets. However, such partial entities may share at least one 10 document, and therefore can be considered to be connected. In order to determine such shared or common or connected documents, for each partial entity, one of the corresponding documents may be considered as a central document, and one or more edges between a corresponding central record vertex and each of the remaining record vertices of the partial entity are created. Similar vertex-edge structures may be created for each partial entity. In 15 case a document is shared by multiple partial entities, the document may appear in the vertexedge structure of each of the multiple partial entities. In such a case, all the record vertices belonging to the two partial entities may be connected and may be considered to be belonging to the same entity. Therefore, the connected record vertices, i.e., the connected documents can be consolidated to form an entity-resolved document for the entity. As would be gathered, an 20 entity-resolved document of an entity may include all the information pertaining to the entity as disclosed in each of the plurality of documents. [0026] In an alternate implementation, the user may select the RCP technique for entity resolution. In the RCP technique, from each bucket vertex, a comparison message may be provided to one or more record vertices connected to a bucket vertex, in order to schedule 25 comparisons among the plurality of documents using the IMM technique. For example, for each pair of record vertices, a comparison message may be provided to one of the two record vertices, e.g., {rj} is sent to ri, if i

Documents

Application Documents

#	Name	Date
1	770-MUM-2014-Request For Certified Copy-Online(16-10-2014).pdf	2014-10-16
2	770-MUM-2014-Request For Certified Copy-Online(23-04-2015).pdf	2015-04-23
3	SPEC FOR FILING.pdf	2018-08-11
4	PD012095IN-SC_Request for Priority Documents-PCT.pdf	2018-08-11
5	PD012095IN-SC_Certified copy of Priority doc. and form 13.pdf	2018-08-11
6	FORM 5.pdf	2018-08-11
7	FORM 3.pdf	2018-08-11
8	Form 13 for amendment in figure.pdf	2018-08-11
9	figure for filing.pdf	2018-08-11
10	ABSTRACT1.jpg	2018-08-11
11	770-MUM-2014-Power of Attorney-130215.pdf	2018-08-11
12	770-MUM-2014-POWER OF ATTORNEY(28-4-2015).pdf	2018-08-11
13	770-MUM-2014-FORM 18.pdf	2018-08-11
14	770-MUM-2014-FORM 1 (27-4-2014).pdf	2018-08-11
15	770-MUM-2014-Correspondence-130215.pdf	2018-08-11
16	770-MUM-2014-CORRESPONDENCE(28-4-2015).pdf	2018-08-11
17	770-MUM-2014-CORRESPONDENCE (27-4-2014).pdf	2018-08-11
18	770-MUM-2014-FER.pdf	2019-10-16
19	770-MUM-2014-Information under section 8(2) [20-03-2020(online)].pdf	2020-03-20
20	770-MUM-2014-FORM 3 [20-03-2020(online)].pdf	2020-03-20
21	770-MUM-2014-OTHERS [15-04-2020(online)].pdf	2020-04-15
22	770-MUM-2014-FER_SER_REPLY [15-04-2020(online)].pdf	2020-04-15
23	770-MUM-2014-DRAWING [15-04-2020(online)].pdf	2020-04-15
24	770-MUM-2014-CLAIMS [15-04-2020(online)].pdf	2020-04-15
25	770-MUM-2014-PatentCertificate21-01-2022.pdf	2022-01-21
26	770-MUM-2014-IntimationOfGrant21-01-2022.pdf	2022-01-21
27	770-MUM-2014-RELEVANT DOCUMENTS [26-09-2023(online)].pdf	2023-09-26

Search Strategy

1	SearchStrategyMatrix10_15-10-2019.pdf