Abstract:
The present subject matter relates to entity resolution, and in particular, relates to providing an entity resolution from documents. The method comprises obtaining a plurality of documents corresponding to a plurality of entities, from at least one data source. Upon receiving the plurality of documents, the plurality of documents is blocked into at least one bucket based on textual similarity. Further, a graph including a plurality of record vertices and at least one bucket vertex is created. The plurality of record vertices and the at least one bucket vertex are indicative of the plurality of documents and the at least one bucket, respectively. Subsequently, a notification is provided to a user for selecting one of a Bucket-Centric Parallelization (BCP) technique and a Record-Centric Parallelization (RCP) technique for resolving entities from the plurality of documents. Based on the selection, a resolved entity-document for each entity is created.
Get Free WhatsApp Updates!
Notices, Deadlines & Correspondence
FORM 2
THE PATENTS ACT, 1970
(39 of 1970)
&
THE PATENTS RULES, 2003
COMPLETE SPECIFICATION
(See section 10, rule 13)
1. Title of the invention: ENTITY RESOLUTION FROM DOCUMENTS
2. Applicant(s)
NAME NATIONALITY ADDRESS
TATA CONSULTANCY
SERVICES LIMITED
Indian Nirmal Building, 9th Floor, Nariman
Point, Mumbai, Maharashtra 400021,
India
3. Preamble to the description
COMPLETE SPECIFICATION
The following specification particularly describes the invention and the manner in which it
is to be performed.
1
2
TECHNICAL FIELD
[0001] The present subject matter relates, in general, to entity resolution and, particularly
but not exclusively, to entity resolution from a plurality of documents.
5 BACKGROUND
[0002] Generally, when data from different sources is analyzed, often multiple records in
the data may belong to the same real-world entity, such as a customer, a product or an
organization. In order to find different records that belong to the same entity, a technique
known as Entity resolution (ER) is widely used. In various disciplines, ER is also referred to
10 as record linkage, de-duplication, co-reference resolution, reference reconciliation, object
consolidation, identity uncertainty and database hardening. ER has a wide scope of
application, for example, in government and public health data maintenance, web search, ecommerce
and law enforcement. In practice, dynamics pertaining to the ER may keep
changing, e.g., corresponding data set may keep changing over a period of time. Therefore, in
15 order to accommodate such changes associated with the data, ER has to be performed
regularly to update an ER result set of resolved entities.
BRIEF DESCRIPTION OF THE DRAWINGS
[0003] The detailed description is described with reference to the accompanying figures.
In the figures, the left-most digit(s) of a reference number identifies the figure in which the
20 reference number first appears. The same numbers are used throughout the drawings to
reference like features and components.
[0004] Fig. 1 illustrates a network environment implementing an entity resolution system,
in accordance with an embodiment of the present subject matter.
[0005] Fig. 1(a) illustrates an example including a plurality of records and a plurality of
25 buckets for entity resolution, in accordance with an embodiment of the present subject matter.
[0006] Fig. 1(b) illustrates an outcome of entity resolution from a plurality of documents
by executing Record-Centric Parallelization (RCP) technique for entity resolution, in
accordance with an embodiment of the present subject matter.
3
[0007] Fig. 2 illustrates a method for entity resolution from a plurality of documents, in
accordance with an embodiment of the present subject matter.
DETAILED DESCRIPTION
[0008] System(s) and method(s) for entity resolution from a plurality of documents are
described. The system(s) and method(s) can be implemented in a variety 5 of computing
devices, such as laptops, desktops, workstations, tablet-PCs, notebooks, portable computers,
tablet computers, internet appliances, and similar systems. However, a person skilled in the art
will comprehend that the embodiments of the present subject matter are not limited to any
particular computing system, architecture, or application device, as they may be adapted to
10 new computing systems and platforms as they become available.
[0009] In the last few decades, Entity Resolution (ER) has emerged as a growing
challenge in the realm of data management across industries. Often, multiple records available
in various data sources may pertain to the same real-world entity, such as a person, a product,
or an organization. To resolve such situations, ER analysis is performed for identifying those
15 records that refer to the same entity and once identified, merging those records. The various
records may be interchangeably referred to as documents or textual documents. Therefore, in
the ER analysis, a plurality of documents obtained from the various data sources may be
matched, in pairs, for determining similarity among the plurality of textual documents. Based
on the determination, a set of textual documents related to an entity may be identified, and the
20 identified set of textual documents may then be combined to create a merged document for
the entity. As would be understood, the merged document of an entity may include all the
details disclosed in each of the identified set of textual documents.
[0010] Usually, ER analysis includes a large number of records to be processed in order to
resolve the entities involved. For example, in case of a citizen of a country being considered
25 as an entity, the records may include identity proofs, such as a passport, a voter ID, a driving
license, a credit card, a Permanent Account Number (PAN), a telephone number, and a bank
account number. Considering that each citizen owns an average of 3 of the above-mentioned
IDs, the number of records to be processed for resolving entities may turn out to be in
millions, or even billions.
4
[0011] In order to make the ER analysis scalable, the conventional ER techniques employ
a blocking technique to divide the records in various blocks based on some pre-defined
parameters, such as textual similarity. Now, each block may contain a relatively small number
of potentially matching textual documents. Thereafter, a pair-wise comparison of the textual
documents is performed in each block to identify a set of textual documents 5 pertaining to an
entity. In the pair-wise comparison, based on a match function, two textual documents are
considered as matching. The match function may include but is not limited to predefined
rules, and binary classifiers derived using machine learning. Therefore, based on the match
function, a set of textual documents pertaining to each entity may be identified, within each
10 block. The set of textual documents may then be merged to create a merged document for
each entity. As may be understood, the merged document contains all the information as
disclosed in each of the set of textual documents pertaining to the entity. Therefore, within
each block, the textual documents are resolved to entities, and such resolved entities are
referred to as partial entities.
15 [0012] However, the conventional blocking techniques may block different textual
documents belonging to a single entity into more than one block. In such a case, multiple
partial entities belonging to the same entity may be obtained from multiple blocks. Such
partial entities from different blocks may be connected by the fact that the partial entities may
share the same textual document. Therefore, the textual documents pertaining to each of the
20 pair of the partial entities can be consolidated to form an entity-resolved document for an
entity. As would be gathered, an entity-resolved document of an entity may include all the
information pertaining to the entity as disclosed in each of the plurality of documents.
[0013] As mentioned previously, in the course of resolving the entities from the records,
the blocking techniques may result into formation of a plurality of blocks for collection of
25 potentially matching documents. Further, it may happen that a large number of blocks formed
are singletons, i.e., blocks including only one textual document. This may indicate that, within
a singleton bucket, such textual documents may not have to be further processed or compared
with other textual documents. However, the conventional techniques may involve sending a
textual document to a singleton block the textual document is blocked to. As would be
30 understood, since no comparisons have to be performed within a singleton block, sending of
the textual document to the singleton block is unnecessary. In fact, sending the textual
5
documents to the singleton blocks would result into wastage of resources and time, and
therefore, may add to the cost of ER analysis as well. The cost, resource and time wastage
would be more in case the documents are large in size, and therefore, may affect the overall
economic facet of the ER analysis.
[0014] Further, as a result of the execution of the blocking technique, 5 there may be
instances where the records may be blocked in a skewed manner, i.e., size of blocks, in terms
of number of hashed textual documents, may turn out to be uneven. In case there are more
number of blocks, the textual documents may be processed by employing a parallel
computation technique. As may be understood, in parallel computation, the blocks can be
10 distributed across multiple processing units for performing the analysis. In such scenarios,
time to be utilized for processing the textual documents in a block having more number of
textual documents may be disproportionately more than the time to be utilized for a block
having less number of textual documents. Therefore, a processing unit with blocks having
larger number of textual documents than other blocks may act as a bottleneck for the overall
15 ER analysis, and an overall time required for completion of the ER analysis would be
significantly more.
[0015] Furthermore, consolidating merged documents to form entity-resolved documents
is a complex process as it involves determination of common textual documents shared
among partial entities, which is an iterative process. Therefore, time spent and resources used
20 for determining common textual documents are significant. Thus, as is evident, the
conventional ER techniques can be time-extensive, inefficient, and expensive.
[0016] According to the present subject matter, an entity resolution system, hereinafter
referred to as a system, for entity resolution from a plurality of documents is disclosed. In one
implementation, the system may obtain the plurality of documents corresponding to a
25 plurality of entities from at least one data source. The plurality of documents may be blocked
into at least one bucket, based on textual similarity among the plurality of documents. Further,
a graph including a plurality of record vertices and at least one bucket vertex may be created.
Subsequent to the generation of the graph, a notification may be provided to a user for
selecting one of a Bucket-Centric Parallelization (BCP) technique and a Record-Centric
30 Parallelization (RCP) technique for resolving entities from the plurality of documents. The
6
notification may include but is not limited to a suggestion for selecting one of the BCP
technique and the RCP technique based on the blocking of the plurality of documents. Based
on the selection by the user, a resolved entity document for each entity may be generated.
[0017] In one implementation, the plurality of documents may be interchangeably referred
to as records. As is generally understood, records can include tangible objects, 5 such as paper
documents, like birth certificates, driver's licenses, and physical medical x-rays, as well as
digital information, such as electronic office documents, data in application databases, web
site content, and electronic mail (email). Further, the at least one data source may include, but
is not limited to, an external database and/or an in-house database. Once the plurality of
10 documents is obtained, a blocking technique, e.g., Locality Sensitive Hashing (LSH) may be
employed to block the plurality of documents.
[0018] The LSH technique may use hash functions for grouping or blocking the plurality
of documents based on textual similarity among the plurality of documents. In one
implementation, a unique identification (ID) may be allotted to each of the plurality of
15 documents, and instead of blocking the plurality of documents themselves, unique IDs of the
documents may be blocked into the at least one bucket. Further, singletons buckets, i.e.,
buckets having one document may be discarded, and may not be considered for the further
computations of the ER analysis. As would be gathered, blocking of the plurality of
documents may facilitate in avoiding undesired comparisons among the plurality of
20 documents.
[0019] In one implementation, computations to be performed for the ER analysis may be
distributed across multiple processing units. For example, the buckets can be provided to
multiple processing units for the subsequent stages of the ER analysis. This would assist in
parallel computation for performing ER analysis and therefore, time to be utilized and
25 complexity involved in the ER analysis can be minimized.
[0020] Thereafter, a graph including a plurality of record vertices and at least one bucket
vertex may be created. The plurality of record vertices and the at least one bucket vertex
correspond to the plurality of documents and the at least one bucket, respectively. In other
words, each of the plurality of documents and the at least one bucket may be considered as a
30 vertex in the graph. In one implementation, the plurality of record vertices and the at least one
7
bucket vertex may be connected to each other by edges, depending on the blocking of the
plurality of documents.
[0021] In one implementation, an adjacency list for each record vertex and each bucket
vertex may be generated. In one example, the adjacency list of a record vertex may include
details of bucket vertices to which the record vertex is hashed to. The adjacency 5 list of a
record vertex may hereinafter be referred to as a record adjacency list. Similarly, the
adjacency list of a bucket vertex may include details of record vertices hashed to the bucket
vertex. The adjacency list of a bucket vertex may hereinafter be referred to as a bucket
adjacency list.
10 [0022] Subsequent to the creation of the graph, a notification may be provided to a user
for selecting at least one of a Bucket-Centric Parallelization (BCP) technique and a Record-
Centric Parallelization (RCP) technique for resolving entities from the graph. In one
implementation, the notification may include but is not limited to a suggestion for selecting
one of the BCP technique and the RCP technique for resolving the entities from the plurality
15 of documents. In one implementation, the suggestion may be provided based on the blocking
of the plurality of documents. For example, in case the blocking of the plurality of documents
may result into substantially uniform distribution of the plurality of documents among the
buckets, the BCP technique for entity resolution may be provided as the suggestion. On the
other hand, in case the plurality of documents is distributed among the buckets in a non20
uniform manner, then the RCP technique may be provided as the suggestion. This is due to
the fact that the RCP technique may utilize relatively lesser time than the BCP technique for
entity resolution in case of non-uniform distribution of the plurality of documents.
[0023] Further, in the BCP technique, the plurality of documents may be compared at
bucket vertices. On the other hand, in the RCP technique, the plurality of documents may be
25 compared at record vertices. In one implementation, the BCP technique and the RCP
technique may be employed using a Pregel-based platform.
[0024] In one implementation, the user may select the BCP technique for entity
resolution. As mentioned earlier, initially, only IDs of documents hashed to a bucket are
available at a corresponding bucket vertex. Therefore, a value, i.e., content of a corresponding
30 document, of each record vertex may be provided to one or more bucket vertices as provided
8
in a record adjacency list. Once each bucket vertex receives values of the record vertices
hashed to the bucket vertex, the documents are compared at each bucket vertex. In one
implementation, an Iterative Match Merge (IMM) technique may be used for comparing the
documents at a bucket vertex. In accordance with the IMM technique, at each bucket vertex,
at least one matching pair of documents may be identified and merged 5 to create a merged
document for each entity. Entities resolved, at a bucket vertex, by creating merged documents
may be referred to as partial entities.
[0025] As per the IMM technique, multiple partial entities belonging to the same entity
can be obtained from multiple buckets. However, such partial entities may share at least one
10 document, and therefore can be considered to be connected. In order to determine such shared
or common or connected documents, for each partial entity, one of the corresponding
documents may be considered as a central document, and one or more edges between a
corresponding central record vertex and each of the remaining record vertices of the partial
entity are created. Similar vertex-edge structures may be created for each partial entity. In
15 case a document is shared by multiple partial entities, the document may appear in the vertexedge
structure of each of the multiple partial entities. In such a case, all the record vertices
belonging to the two partial entities may be connected and may be considered to be belonging
to the same entity. Therefore, the connected record vertices, i.e., the connected documents can
be consolidated to form an entity-resolved document for the entity. As would be gathered, an
20 entity-resolved document of an entity may include all the information pertaining to the entity
as disclosed in each of the plurality of documents.
[0026] In an alternate implementation, the user may select the RCP technique for entity
resolution. In the RCP technique, from each bucket vertex, a comparison message may be
provided to one or more record vertices connected to a bucket vertex, in order to schedule
25 comparisons among the plurality of documents using the IMM technique. For example, for
each pair of record vertices, a comparison message may be provided to one of the two record
vertices, e.g., {rj} is sent to ri, if i
Documents
Application Documents
#
Name
Date
1
770-MUM-2014-Request For Certified Copy-Online(16-10-2014).pdf
2014-10-16
2
770-MUM-2014-Request For Certified Copy-Online(23-04-2015).pdf
2015-04-23
3
SPEC FOR FILING.pdf
2018-08-11
4
PD012095IN-SC_Request for Priority Documents-PCT.pdf
2018-08-11
5
PD012095IN-SC_Certified copy of Priority doc. and form 13.pdf
2018-08-11
6
FORM 5.pdf
2018-08-11
7
FORM 3.pdf
2018-08-11
8
Form 13 for amendment in figure.pdf
2018-08-11
9
figure for filing.pdf
2018-08-11
10
ABSTRACT1.jpg
2018-08-11
11
770-MUM-2014-Power of Attorney-130215.pdf
2018-08-11
12
770-MUM-2014-POWER OF ATTORNEY(28-4-2015).pdf
2018-08-11
13
770-MUM-2014-FORM 18.pdf
2018-08-11
14
770-MUM-2014-FORM 1 (27-4-2014).pdf
2018-08-11
15
770-MUM-2014-Correspondence-130215.pdf
2018-08-11
16
770-MUM-2014-CORRESPONDENCE(28-4-2015).pdf
2018-08-11
17
770-MUM-2014-CORRESPONDENCE (27-4-2014).pdf
2018-08-11
18
770-MUM-2014-FER.pdf
2019-10-16
19
770-MUM-2014-Information under section 8(2) [20-03-2020(online)].pdf