Method And System For Learning To Map Between Schemas Using Knowledge

< Back

Method And System For Learning To Map Between Schemas Using Knowledge Graph

Abstract: Data migration is a crucial task for data management across industries. Conventional schema mapping requires expert’s intervention and accuracy is less in automated schema mapping due to unclear and complex field names. The present disclosure presents an accurate schema mapping method. Initially, a logical filed name is identified for each field name of a source schema and a target schema based on a knowledge graph. Further, a plurality of potential matches between source field names and target field names are filtered based on semantic data type. A data similarity score and conceptual similarity score are computed from the plurality of potential matches using knowledge graph. Further, the data similarity score and the conceptual similarity score are combined to decide the top scoring matches between the source filed names and the target filed names. The knowledge graph, is updated dynamically via learning from historically mapped fields between source and target schema. [To be published with FIG. 3A and 3B]

Get Free WhatsApp Updates!
Notices, Deadlines & Correspondence

Patent Information

Application #

Filing Date

13 December 2019

Publication Number

25/2021

Publication Type

INA

Invention Field

COMMUNICATION

Status

kcopatents@khaitanco.com

Parent Application

Patent Number

Legal Status

Grant Date

2024-02-28

Renewal Date

Applicants

Tata Consultancy Services Limited

Nirmal Building, 9th Floor, Nariman Point Mumbai 400021 Maharashtra India

Inventors

1. BHATTACHARYA, Indrajit

Tata Consultancy Services Limited Gitanjali Park. Tata Consultancy Services Ltd. IT/ITES SEZ, Plot-IIF/3, Action Area-II, New Town, Rajarhat, Kolkata West Bengal 700160 India

2. SHROFF, Gautam

Tata Consultancy Services Limited GG6, Block-C, Kings Canyon, ASF Insignia, Gurgaon- Faridabad Rd, Gwal Pahari, Gurgaon Haryana 122002 India

3. DASGUPTA, Tirthankar

Tata Consultancy Services Limited Gitanjali Park. Tata Consultancy Services Ltd. IT/ITES SEZ, Plot-IIF/3, Action Area-II, New Town, Rajarhat, Kolkata West Bengal 700160 India

4. PANJA, Arnab

Tata Consultancy Services Limited OVAL- GIII, Candor Tech Space , IT & ITES SEZ, Block G, 3rd, 4th, 5th & 6th Floor, Tower GIII, Plot No. DH1, DH2, DH3 & DH 3/1, Action Area 1, New Town, Kolkata West Bengal 700160 India

5. CHAKRABORTY, Snehasish

Tata Consultancy Services Limited Block -1A,Eco Space, Plot No. IIF/12 (Old No. AA-II/BLK 3. I.T) Street 59 M. WIDE (R.O.W.) Road, New Town, Rajarhat, P.S. Rajarhat, Dist - N. 24 Parganas, Kolkata West Bengal 700160 India

6. MUKHERJEE, Debayan

7. BANDYOPADHYAY, Atreya

Claims

1. A processor-implemented method, the method comprising: receiving, by one or more hardware processors, a source schema, a target schema and a knowledge graph, wherein the source schema comprises at least one source table and the target schema comprises at least one target table, wherein the at least one source table comprises a plurality of source field names and the at least one target table comprises a plurality of target field names (302); performing (304), by the one or more hardware processors: a computation of a logical source field name for each of the plurality of source field names to obtain a plurality of corresponding logical source field names by utilizing the knowledge graph; and identification of a source semantic data type for each of the plurality of logical source field names by utilizing the knowledge graph; simultaneously performing (306), by the one or more hardware processors: a computation of a logical target field name for each of the plurality of target field names to obtain a plurality of corresponding logical target field names by utilizing the knowledge graph; and identification of a target semantic data type for each of the plurality of logical target field names by utilizing the knowledge graph; computing, by the one or more hardware processors, a plurality of potential source field matches for each of the plurality of target field names by matching the source semantic data type corresponding to each of the plurality of logical source field names and the target semantic data type corresponding to each of the plurality of logical target field names (308); computing, by the one or more hardware processors, a data similarity score between each of the plurality of potential source field matches and each of the plurality of logical target field names based on a data based distance (310); computing, by the one or more hardware processors, a conceptual similarity score between each of the plurality of potential source field matches with each of the plurality of logical target field names based on a linguistic distance using a plurality of semantic relationship between a plurality of concept names in the knowledge graph (312); and computing, by the one or more hardware processors, a plurality of matching scores between the source schema and the target schema based on the data similarity score and the conceptual similarity score (314).

2. The method as claimed in claim 1, wherein the knowledge graph comprises the plurality of concept names and the plurality of semantic relationships between each of the plurality of concept names, wherein the plurality of semantic relationships comprises a plurality of hyponym-hypernym relationship, a plurality of synonym relationships, a plurality of hasAttribute relationships, and a plurality of acronym relationships, wherein each of the plurality of concept names is associated with a semantic data type and a semantic data , wherein a plurality of associations exist between the semantic data type and each of the plurality of concept names, and wherein a plurality of associations exist between the semantic data type and the semantic data.

3. The method as claimed in claim 1, The method as claimed in claim 1, wherein the knowledge graph is updated dynamically based on a historical matching information between each of the plurality of target field names of the target schema and each of the plurality of source field names associated with a plurality of source schemas.

4. The method as claimed in claim 1, wherein the data based distance is computed by comparing a target data statistics associated with target field name and a source data statistics associated with a source field name, wherein the target data statistics is computed on the semantic data associated with the target field name and the source data statistics is computed on the semantic data associated with the source field name.

5. A system (100), the system (100) comprising: at least one memory (104) storing programmed instructions; one or more Input /Output (I/O) interfaces (112); and one or more hardware processors (102) operatively coupled to the at least one memory (104), wherein the one or more hardware processors (102) are configured by the programmed instructions to: receive a source schema, a target schema and a knowledge graph, wherein the source schema comprises at least one source table and the target schema comprises at least one target table, wherein the at least one source table comprises a plurality of source field names and the at least one target table comprises a plurality of target field names; perform: a computation of a logical source field name for each of the plurality of source field names to obtain a plurality of corresponding logical source field names by utilizing the knowledge graph; and identification of a source semantic data type for each of the plurality of logical source field names by utilizing the knowledge graph; simultaneously perform: a computation of a logical target field name for each of the plurality of target field names to obtain a plurality of corresponding logical target field names by utilizing the knowledge graph; and identification of a target semantic data type for each of the plurality of logical target field names by utilizing the knowledge graph; compute a plurality of potential source field matches for each of the plurality of target field names by matching the source semantic data type corresponding to each of the plurality of logical source field names and the target semantic data type corresponding to each of the plurality of logical target field names; compute a data similarity score between each of the plurality of potential source field matches and each of the plurality of logical target field names based on a data based distance; compute a conceptual similarity score between each of the plurality of potential source field matches with each of the plurality of logical target field names based on a linguistic distance using a plurality of semantic relationship between a plurality of concept names in the knowledge graph; and compute a plurality of matching scores between the source schema and the target schema based on the data similarity score and the conceptual similarity score.

6. The system as claimed in claim 5, wherein the knowledge graph comprises the plurality of concept names and the plurality of semantic relationships between each of the plurality of concept names, wherein the plurality of semantic relationships comprises a plurality of hyponym-hypernym relationship, a plurality of synonym relationships, a plurality of hasAttribute relationships, and a plurality of acronym relationships, wherein each of the plurality of concept names is associated with a semantic data type and a semantic data , wherein a plurality of associations exist between the semantic data type and each of the plurality of concept names, and wherein a plurality of associations exist between the semantic data type and the semantic data.

7. The system as claimed in claim 5, The method as claimed in claim 1, wherein the knowledge graph is updated dynamically based on a historical matching information between each of the plurality of target field names of the target schema and each of the plurality of source field names associated with a plurality of source schemas.

8. The system as claimed in claim 5, wherein the data based distance is computed by comparing a target data statistics associated with target field name and a source data statistics associated with a source field name, wherein the target data statistics is computed on the semantic data associated with the target field name and the source data statistics is computed on the semantic data associated with the source field name.

Specification

FORM 2
THE PATENTS ACT, 1970
(39 of 1970)
&
THE PATENT RULES, 2003
COMPLETE SPECIFICATION
(See Section 10 and Rule 13)
Title of invention:
METHOD AND SYSTEM FOR LEARNING TO MAP BETWEEN SCHEMAS USING KNOWLEDGE GRAPH
Applicant
Tata Consultancy Services Limited A company Incorporated in India under the Companies Act, 1956
Having address:
Nirmal Building, 9th floor,
Nariman point, Mumbai 400021,
Maharashtra, India
Preamble to the Description
The following specification particularly describes the invention and the manner in
which it is to be performed.

TECHNICAL FIELD
[001] The disclosure herein generally relates to the field of data bases management and, more particular, to a method and system for learning to map between schemas.
BACKGROUND
[002] Data migration remains a crucial task for data management solutions across industries. The data migration includes data mapping, wherein a source schema is mapped to a target schema. The source schema is the incoming schema to be mapped with the existing target schema. Generally the source schemas are large and data types of the source schema are often uninformative. Field names of the source schema may be complex and often available only as cryptic database physical names with little metadata. Moreover, usage information of fields in the source schema is typically unavailable.
[003] Conventional schema mapping tools provides an interface for specifying the mapping rules. It requires intervention of domain experts and requires more time and cost. Many automated schema mapping methods are available. However accuracy is less in the existing automated schema mapping methods. Hence there is challenge in mapping complex and unclear source schema fields to the target schema fields.
SUMMARY [004] Embodiments of the present disclosure present technological
improvements as solutions to one or more of the above-mentioned technical
problems recognized by the inventors in conventional systems. For example, in
one embodiment, a method for learning to map between schemas is provided. The
method includes receiving a source schema, a target schema and a knowledge
graph, wherein the source schema comprises at least one source table and the
target schema comprises at least one target table, wherein the at least one source
table comprises a plurality of source field names and the at least one target table
comprises a plurality of target field names. Further, the method includes
performing a computation of a logical source field name for each of the plurality
of source field names to obtain a plurality of corresponding logical source field

names by utilizing the knowledge graph and identification of a source semantic data type for each of the plurality of logical source field names by utilizing the knowledge graph. Further, the method includes simultaneously performing a computation of a logical target field name for each of the plurality of target field names to obtain a plurality of corresponding logical target field names by utilizing the knowledge graph and identification of a target semantic data type for each of the plurality of logical target field names by utilizing the knowledge graph. Furthermore, the method includes computing a plurality of potential source field matches for each of the plurality of target field names by matching the source semantic data type corresponding to each of the plurality of logical source field names and the target semantic data type corresponding to each of the plurality of logical target field names. Furthermore, the method includes a data similarity score between each of the plurality of potential source field matches and each of the plurality of logical target field names based on a data based distance. Furthermore, the method includes computing, by the one or more hardware processors, a conceptual similarity score between each of the plurality of potential source field matches with each of the plurality of logical target field names based on a linguistic distance using a plurality of semantic relationship between a plurality of concept names in the knowledge graph. Finally the method includes computing a plurality of matching scores between the source schema and the target schema based on the data similarity score and the conceptual similarity score.
[005] In another aspect, a system for learning to map between schemas is provided. The system includes at least one memory storing programmed instructions, one or more Input /Output (I/O) interfaces, and one or more hardware processors operatively coupled to the at least one memory, wherein the one or more hardware processors are configured by the programmed instructions to receive a source schema, a target schema and a knowledge graph, wherein the source schema comprises at least one source table and the target schema comprises at least one target table, wherein the at least one source table comprises a plurality of source field names and the at least one target table comprises a

plurality of target field names. Further, the one or more hardware processors are configured by the programmed instructions to perform a computation of a logical source field name for each of the plurality of source field names to obtain a plurality of corresponding logical source field names by utilizing the knowledge graph and identification of a source semantic data type for each of the plurality of logical source field names by utilizing the knowledge graph. Further, the one or more hardware processors are configured by the programmed instructions to simultaneously perform a computation of a logical target field name for each of the plurality of target field names to obtain a plurality of corresponding logical target field names by utilizing the knowledge graph and identification of a target semantic data type for each of the plurality of logical target field names by utilizing the knowledge graph. Furthermore, the one or more hardware processors are configured by the programmed instructions to compute a plurality of potential source field matches for each of the plurality of target field names by matching the source semantic data type corresponding to each of the plurality of logical source field names and the target semantic data type corresponding to each of the plurality of logical target field names. Furthermore the one or more hardware processors are configured by the programmed instructions to compute a data similarity score between each of the plurality of potential source field matches and each of the plurality of logical target field names based on a data based distance. Furthermore, the one or more hardware processors are configured by the programmed instructions to compute a conceptual similarity score between each of the plurality of potential source field matches with each of the plurality of logical target field names based on a linguistic distance using a plurality of semantic relationship between a plurality of concept names in the knowledge graph. Finally, the one or more hardware processors are configured by the programmed instructions to compute a plurality of matching scores between the source schema and the target schema based on the data similarity score and the conceptual similarity score.
[006] In yet another aspect, a computer program product including a non-transitory computer-readable medium having embodied therein a computer

program for method and system for learning to map between schemas is provided. The computer readable program, when executed on a computing device, causes the computing device to receive a source schema, a target schema and a knowledge graph, wherein the source schema comprises at least one source table and the target schema comprises at least one target table, wherein the at least one source table comprises a plurality of source field names and the at least one target table comprises a plurality of target field names. Further, the computer readable program, when executed on a computing device, causes the computing device to perform a computation of a logical source field name for each of the plurality of source field names to obtain a plurality of corresponding logical source field names by utilizing the knowledge graph and identification of a source semantic data type for each of the plurality of logical source field names by utilizing the knowledge graph. Further, the computer readable program, when executed on a computing device, causes the computing device to simultaneously perform a computation of a logical target field name for each of the plurality of target field names to obtain a plurality of corresponding logical target field names by utilizing the knowledge graph and identification of a target semantic data type for each of the plurality of logical target field names by utilizing the knowledge graph. Further, the computer readable program, when executed on a computing device, causes the computing device to compute a plurality of potential source field matches for each of the plurality of target field names by matching the source semantic data type corresponding to each of the plurality of logical source field names and the target semantic data type corresponding to each of the plurality of logical target field names. Furthermore, the computer readable program, when executed on a computing device, causes the computing device to compute a data similarity score between each of the plurality of potential source field matches and each of the plurality of logical target field names based on a data based distance. Furthermore, the computer readable program, when executed on a computing device, causes the computing device to compute a conceptual similarity score between each of the plurality of potential source field matches with each of the plurality of logical target field names based on a linguistic distance using a

plurality of semantic relationship between a plurality of concept names in the knowledge graph. Finally, the computer readable program, when executed on a computing device, causes the computing device to compute a plurality of matching scores between the source schema and the target schema based on the data similarity score and the conceptual similarity score.
[007] It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention, as claimed.
BRIEF DESCRIPTION OF THE DRAWINGS
[008] The accompanying drawings, which are incorporated in and constitute a part of this disclosure, illustrate exemplary embodiments and, together with the description, serve to explain the disclosed principles:
[009] FIG. 1 is a functional block diagram of a system for learning to map between schemas, according to some embodiments of the present disclosure.
[010] FIG. 2 is an exemplary knowledge graph associated with a processor implemented method for learning to map between schemas, according to some embodiments of the present disclosure.
[011] FIG. 3A and 3B are exemplary flow diagrams for the processor implemented method for learning to map between schemas implemented by the system of FIG. 1, according to some embodiments of the present disclosure.
DETAILED DESCRIPTION OF EMBODIMENTS
[012] Exemplary embodiments are described with reference to the accompanying drawings. In the figures, the left-most digit(s) of a reference number identifies the figure in which the reference number first appears. Wherever convenient, the same reference numbers are used throughout the drawings to refer to the same or like parts. While examples and features of disclosed principles are described herein, modifications, adaptations, and other implementations are possible without departing from the spirit and scope of the

disclosed embodiments. It is intended that the following detailed description be considered as exemplary only, with the true scope and spirit being indicated by the following claims.
[013] Embodiments herein provide a method and system for learning to map between schemas. The system for learning to map between schemas provides a one to one mapping between each field name of a source schema with each field name of a target schema. In an embodiment, the field name can be a physical field name. Initially, a logical field name is identified for each physical field name of the source schema and the target schema based on a knowledge graph. Further, a semantic data type is assigned to each logical field name using knowledge graph and a plurality of potential matches between source logical field names and target logical field names are filtered based on the semantic type. A data similarity score and conceptual similarity score are computed from the plurality of potential matches using the knowledge graph. Further, the data similarity score and the conceptual similarity score are combined to decide the top scoring matches between the source logical field names and the target logical field names providing one to one mapping between complex and unclear source schema and the target schema. The knowledge graph is updated dynamically via learning from historically mapped fields between the source schema and the target schema.
[014] Referring now to the drawings, and more particularly to FIG. 1 through 3B, where similar reference characters denote corresponding features consistently throughout the figures, there are shown preferred embodiments and these embodiments are described in the context of the following exemplary system and/or method.
[015] FIG. 1 is a functional block diagram of a system for learning to map between schemas, according to some embodiments of the present disclosure. The system 100 includes or is otherwise in communication with hardware processors 102, at least one memory such as a memory 104, an I/O interface 112. The hardware processors 102, memory 104, and the Input /Output (I/O) interface 112 may be coupled by a system bus such as a system bus 108 or a similar

mechanism. In an embodiment, the hardware processors 102 can be one or more hardware processors.
[016] The I/O interface 112 may include a variety of software and hardware interfaces, for example, a web interface, a graphical user interface, and the like. The I/O interface 112 may include a variety of software and hardware interfaces, for example, interfaces for peripheral device(s), such as a keyboard, a mouse, an external memory, a printer and the like. Further, the interface 112 may enable the system 100 to communicate with other devices, such as web servers and external databases.
[017] The I/O interface 112 can facilitate multiple communications within a wide variety of networks and protocol types, including wired networks, for example, local area network (LAN), cable, etc., and wireless networks, such as Wireless LAN (WLAN), cellular, or satellite. For the purpose, the I/O interface 112 may include one or more ports for connecting a number of computing systems with one another or to another server computer. The I/O interface 112 may include one or more ports for connecting a number of devices to one another or to another server.
[018] The one or more hardware processors 102 may be implemented as one or more microprocessors, microcomputers, microcontrollers, digital signal processors, central processing units, state machines, logic circuitries, and/or any devices that manipulate signals based on operational instructions. Among other capabilities, the hardware processor 102 is configured to fetch and execute computer-readable instructions stored in the memory 104.
[019] The memory 104 may include any computer-readable medium known in the art including, for example, volatile memory, such as static random access memory (SRAM) and dynamic random access memory (DRAM), and/or non-volatile memory, such as read only memory (ROM), erasable programmable ROM, flash memories, hard disks, optical disks, and magnetic tapes. In an embodiment, the memory 104 includes a plurality of modules 106, a schema mapping unit 120. The memory 104 also includes a data repository 110 for storing data processed, received, and generated by one or more of the modules 106 and

the schema mapping unit 120. The modules 106 may include routines, programs, objects, components, data structures, and so on, which perform particular tasks or implement particular abstract data types.
[020] The memory 104 also includes module(s) 106 and the data
repository 110. The module(s) 106 include programs or coded instructions that
supplement applications or functions performed by the system 100 to map
between schemas. The modules 106, amongst other things, can include routines, programs, objects, components, and data structures, which perform particular tasks or implement particular abstract data types. The modules 106 may also be used as, signal processor(s), state machine(s), logic circuitries, and/or any other device or component that manipulates signals based on operational instructions. Further, the modules 106 can be used by hardware, by computer-readable instructions executed by a processing unit, or by a combination thereof. The modules 106 can include various sub-modules (not shown). The modules 106 may include computer-readable instructions that supplement applications or functions performed by the system 100 for learning to map between schemas.
[021] The data repository 110 may include a historical knowledge graph and other data. Further, the other data may serve as a repository for storing data that is processed, received, or generated as a result of the execution of one or more modules in the module(s) 106 and the modules associated with the schema mapping unit 120.
[022] Although the repository 110 is shown internal to the system 100, it will be noted that, in alternate embodiments, the repository 110 can also be implemented external to the computing device 100, where the repository 110 may be stored within a database (not shown in FIG. 1) communicatively coupled to the system 100. The data contained within such external database may be periodically updated. For example, new data may be added into the database (not shown in FIG. 1) and/or existing data may be modified and/or non-useful data may be deleted from the database (not shown in FIG. 1). In one example, the data may be stored in an external system, such as a Lightweight Directory Access Protocol (LDAP) directory and a Relational Database Management System (RDBMS).

[023] The schema mapping unit 120, executed by the one or more processors of the system 100, receives the source schema, the target schema and the knowledge graph. The source schema includes at least one source table and the target schema includes at least one target table. The at least one source table includes a plurality of source field names and the at least one target table includes a plurality of target field names. In an embodiment, each of the plurality of source field names and each of the plurality of target field names are physical field names.
[024] FIG. 2 is an exemplary knowledge graph associated with a processor implemented method for learning to map between schemas, according to some embodiments of the present disclosure.
[025] Referring to FIG. 2, the knowledge graph includes a plurality of concept names and the plurality of semantic relationships between each of the plurality of concept names. For example, the plurality of concept names includes pol, st_dt, Eff_dt, end date. The plurality of semantic relationships includes a plurality of hyponym-hypernym relationship, a plurality of synonym relationships, a plurality of hasAttribute relationships, and a plurality of acronym relationships as depicted in FIG. 2. For example, the concept name “Policy” is having a “has Attribute” relationship with the concept name “Start date”. The concept name “St_dt” is having an “acronym” relationship with “Start date”. The concept name “End date” and the concept name “Sem type: Date” is having a “hyponym-hypernym” relationship. Some of the plurality of concept names is associated with a semantic data type. The logical field names (source logical field name, target logical field name) includes one or more non overlapping concept names. For example, a user account has a “birth date” as a field name, wherein the user account and birth date are associated concepts.
[026] Each of the plurality of field names is associated with a semantic type and a semantic data. A plurality of associations exist between the semantic data type and each of the plurality of concept names and a plurality of associations exist between the semantic data type and the semantic data. The knowledge graph is updated dynamically based on a historical matching information between each

of the plurality of target field names of the target schema and each of the plurality of source field names associated with the source schemas.
[027] Further, the schema mapping unit 120, executed by the one or more processors of the system 100, performs a computation of a logical source field name for each of the plurality of source field names to obtain a plurality of corresponding logical source field names by utilizing the knowledge graph. Further, a source semantic data type is identified for each of the plurality of logical source field names by utilizing the knowledge graph. The computation of logical field name (logical source field name or logical target filed name) from the physical source name or physical target name is performed by utilizing a dynamic programming approach, for example, Viterbi algorithm. Here, Hidden Markov Model, is utilized for finding a most likely logical field name for a given physical field name.
[028] Further, the schema mapping unit 120, executed by one or more processors of the system 100, simultaneously performs a computation of a logical target field name for each of the plurality of target field names to obtain a plurality of corresponding logical target field names by utilizing the knowledge graph. Further, a target semantic data type is identified for each of the plurality of logical target field names by utilizing the knowledge graph.
[029] A schema (the source schema or the target schema) is represented as S = {fi}, a plurality of field names fi and is defined as fi = (pi,li,si), where pi is the physical or database name for the field, li is its logical name and si is the semantic type of the field. For brevity of description, the terms “semantic type” and “semantic data type” are used interchangeably used throughout the description. For example, a physical field name is represented in a format “table physical name.field physical name” and the corresponding physical name is “ben esc.contr ben exp dt”. The logical name for the physical name is “benefit escalation.contributor benefit expiry date”. The semantic type si takes values from a set of categorical types including “Date”, “Amount”, “Rate”, and the like.
[030] Let a target schema be Sx and a source schema be Sy for a similar domain. Let the mapping between the plurality of source field names fi, of the

source schema Sy and the plurality of target fields names fi0 of the target schema Sx is represented as M(Sx,Sy), where fi ∈ Sx and fi0 ∈ Sy.
[031] Further, the schema mapping unit 120, executed by one or more processors of the system 100, computes a plurality of potential source field matches for each of the plurality of target field names by matching the source semantic data type corresponding to each of the plurality of logical source field names and the target semantic data type corresponding to each of the plurality of logical target field names. The semantic data type is utilized to reduce number of potential matches.
[032] In an embodiment, a set of semantic types are defined and one semantic type from the set of semantic types is assigned as the semantic type si for a field fi in both source and target. For example, the set of semantic types includes a date, numeric data, categorical data, string and the like. The numeric data includes frequency, number, rate, money, amount, age, day, month, year, century, period and the like. In another embodiment, a semantic type map Ms(s1,s2) that defines whether or not a target field having semantic type s1 can map to a source field having semantic type s2. The knowledge of semantic type map is obtained from domain experts. Then the candidate maps for a target field fix are all source

fields fjy having semantic type S yj such that Ms( Sxi , S yj) ≠ 0.
[033] Further, the schema mapping unit 120, executed by one or more processors of the system 100, computes the data similarity score between each of the plurality of potential source field matches and each of the plurality of logical target field names based on a data based distance. The data based distance is computed by comparing a target data statistics associated with target field name and a source data statistics associated with a source field name. The target data statistics is computed on the semantic data associated with the target field name and the source data statistics is computed on the semantic data associated with the source field name. For example, the source data statistics and the target data statistics includes mean and median for numeric data types and distribution of values (normalized histogram) for categorical data types like FLAG, ID, and

CODE. Here ID is identification number and CODE is categorical code assigned for a category.
[034] Further, the schema mapping unit 120, executed by one or more processors of the system 100, computes the conceptual similarity score between each of the plurality of potential source field matches with each of the plurality of logical target field names based on a linguistic distance using a plurality of semantic relationship between a plurality of concept names in the knowledge graph. The linguistic distance uses the notion of derivation length or derivation confidence from Natural Logic Inference, where the source logical name is derived from the target logical name using a plurality of transformations following relations from the knowledge graph, such as synonym, hyponym/hyperym, etc., The plurality of transformations includes drop prefix concept (generalization), add prefix concept (specialization), replace by synonym, replace by hypernym, replace by hyponym. In general, if there are multiple possible derivation paths available in the knowledge graph for the source logical name from the target logical name, the length of the shortest derivation path is taken as the linguistic distance between the source logical name and the target logical name.
[035] Further, the schema mapping unit 120, executed by one or more processors of the system 100, computing a plurality of matching scores between the source schema and the target schema based on the data similarity score and the conceptual similarity score. The data similarity score and the conceptual similarity score are combined using the following formula to obtain the overall distance between a target and a source field as given in equation 1.
d ( x,y ) = a dD (x,y ) + (1 - a ) dL (x , y ) ………………….(1)
Where, a is a smoothing constant, d D is data distance and dL is linguistic distance.
[036] In an embodiment, the knowledge graph is updated based on a graph updation algorithm and is described here. If the historical mapping data includes annotations of the individual concepts in the target logical name and source logical name and their alignments, then the knowledge graph can be updated by incrementing weights associated with the individual source and target

concepts and weights associated with the aligned source and target concept pairs. The weights are calculated based on a number of occurrence in historical mapping data. This is called the parameter estimation step, which constitutes the only step in the Knowledge Graph Updation algorithm. If the historical mapping data does not contain annotations of the individual concepts in the target logical name and the source logical name and their alignments, then these need to be inferred. This is called the Inference step. In this step, the most likely concept annotation of the source and target logical names along with their most likely alignment is computed using the current concept and relationship weights in the knowledge graph. The overall algorithm for Knowledge Graph Updation then iterates over the Inference step and the Parameter estimation step until no more changes happen.
[037] FIG. 3A and 3B are exemplary flow diagrams for a processor implemented method for learning to map between schemas implemented by the system of FIG. 1, according to some embodiments of the present disclosure. The method 300 may be described in the general context of computer executable instructions. Generally, computer executable instructions can include routines, programs, objects, components, data structures, procedures, modules, functions, etc., that perform particular functions or implement particular abstract data types. The method 300 may also be practiced in a distributed computing environment where functions are performed by remote processing devices that are linked through a communication network. The order in which the method 300 is described is not intended to be construed as a limitation, and any number of the described method blocks can be combined in any order to implement the method 300, or an alternative method. Furthermore, the method 300 can be implemented in any suitable hardware, software, firmware, or combination thereof.
[038] At 302, the method 300, receives, by a one or more hardware processors, the source schema, the target schema and the knowledge graph. The source schema includes at least one source table and the target schema includes at least one target table. The at least one source table includes a plurality of source

field names and the at least one target table includes a plurality of target field names. The knowledge graph includes the plurality of concept names and the plurality of semantic relationships between each of the plurality of concept names. The plurality of semantic relationships comprises the plurality of hyponym-hypernym relationship, the plurality of synonym relationships, the plurality of hasAttribute relationships, and the plurality of acronym relationships. Each of the plurality of concept names is associated with a semantic data type and a semantic data. A plurality of associations exist between the semantic data type and each of the plurality of concept names. A plurality of associations exist between the semantic data type and the semantic data. The knowledge graph is updated dynamically based on a historical matching information between each of the plurality of target field names of the target schema and each of the plurality of source field names associated with a plurality of source schemas.
[039] At 304, the method 300, performs, by the one or more hardware processors, the computation of a logical source field name for each of the plurality of source field names to obtain a plurality of corresponding logical source field names by utilizing the knowledge graph and identification of a source semantic data type for each of the plurality of logical source field names by utilizing the knowledge graph.
[040] At 306, the method 300, simultaneously performs, by the one or more hardware processors, the computation of a logical target field name for each of the plurality of target field names to obtain a plurality of corresponding logical target field names by utilizing the knowledge graph and identification of a target semantic data type for each of the plurality of logical target field names by utilizing the knowledge graph.
[041] At 308, the method 300, computes, by the one or more hardware processors, the plurality of potential source field matches for each of the plurality of target field names by matching the source semantic data type corresponding to each of the plurality of logical source field names and the target semantic data type corresponding to each of the plurality of logical target field names.

[042] At 310, the method 300, computes, by the one or more hardware processors, the data similarity score between each of the plurality of potential source field matches and each of the plurality of logical target field names based on a data based distance. The data based distance is computed by comparing a target data statistics associated with target field name and a source data statistics associated with a source field name, wherein the target data statistics is computed on the semantic data associated with the target field name and the source data statistics is computed on the semantic data associated with the source field name.
[043] At 312, the method 300, computes, by the one or more hardware processors, the conceptual similarity score between each of the plurality of potential source field matches with each of the plurality of logical target field names based on a linguistic distance using a plurality of semantic relationship between the plurality of concept names in the knowledge graph.
[044] At 314, the method 300, computes, by the one or more hardware processors, a plurality of matching scores between the source schema and the target schema based on the data similarity score and the conceptual similarity score. The initial filtering based on semantic type provided potential matches. Further filtering based on the combination of data similarity score and the conceptual similarity score has led to more accurate matches between the fields of the source schema and the target schema.
[045] In an embodiment, the system 100 is experimented as follows: One target schema and 8 source schema with available knowledge of the field mapping is utilzed. Each source schema is tested by using knowledge of target source mapping of the remaining 7 source schemas as historical mapping information. The knowledge graph is updated using the historical mapping information and the knowledge is further utilized for creating field mappings between the target schema and the test source schema.and let it be denoted as A1.The accuracy of the target-source mapping is evaluated. Similarly, a baseline target-source mapping is created for the same test source without using the historical mapping information. The accuracy of the mapping is tested and let it be denoted as A2. A1 is compared with A2 to quantify improvement in mapping

accuracy and the mapping accuracy improved by 30-60% by using knowledge graph updation.
[046] The written description describes the subject matter herein to enable any person skilled in the art to make and use the embodiments. The scope of the subject matter embodiments is defined by the claims and may include other modifications that occur to those skilled in the art. Such other modifications are intended to be within the scope of the claims if they have similar elements that do not differ from the literal language of the claims or if they include equivalent elements with insubstantial differences from the literal language of the claims.
[047] The embodiments of present disclosure herein addresses unresolved problem of automating the schema mapping in an accurate manner. Further, the intervention of the domain expert is avoided in the present disclosure. The historical schema mapping information available in the knowledge graph is utilized for computing logical field names from the physical field names. Here, complex and unclear source schema fields and the target schema fields are refined initially and further mapped by utilizing the scoring technique. The knowledge graph, is updated dynamically via learning from historically mapped fields between the source schema and the target schema. The updated knowledge graph acts as a repository for future schema mappings.
[048] It is to be understood that the scope of the protection is extended to such a program and in addition to a computer-readable means having a message therein; such computer-readable storage means contain program-code means for implementation of one or more steps of the method, when the program runs on a server or mobile device or any suitable programmable device. The hardware device can be any kind of device which can be programmed including e.g. any kind of computer like a server or a personal computer, or the like, or any combination thereof. The device may also include means which could be e.g. hardware means like e.g. an application-specific integrated circuit (ASIC), a field-programmable gate array (FPGA), or a combination of hardware and software means, e.g. an ASIC and an FPGA, or at least one microprocessor and at least one memory with software modules located therein. Thus, the means can include both

hardware means and software means. The method embodiments described herein could be implemented in hardware and software. The device may also include software means. Alternatively, the embodiments may be implemented on different hardware devices, e.g. using a plurality of CPUs.
[049] The embodiments herein can comprise hardware and software elements. The embodiments that are implemented in software include but are not limited to, firmware, resident software, microcode, etc. The functions performed by various modules described herein may be implemented in other modules or combinations of other modules. For the purposes of this description, a computer-usable or computer readable medium can be any apparatus that can comprise, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.
[050] The illustrated steps are set out to explain the exemplary
embodiments shown, and it should be anticipated that ongoing technological
development will change the manner in which particular functions are performed.
These examples are presented herein for purposes of illustration, and not
limitation. Further, the boundaries of the functional building blocks have been
arbitrarily defined herein for the convenience of the description. Alternative
boundaries can be defined so long as the specified functions and relationships
thereof are appropriately performed. Alternatives (including equivalents,
extensions, variations, deviations, etc., of those described herein) will be apparent to persons skilled in the relevant art(s) based on the teachings contained herein. Such alternatives fall within the scope and spirit of the disclosed embodiments. Also, the words “comprising,” “having,” “containing,” and “including,” and other similar forms are intended to be equivalent in meaning and be open ended in that an item or items following any one of these words is not meant to be an exhaustive listing of such item or items, or meant to be limited to only the listed item or items. It must also be noted that as used herein and in the appended claims, the singular forms “a,” “an,” and “the” include plural references unless the context clearly dictates otherwise.

[051] Furthermore, one or more computer-readable storage media may be utilized in implementing embodiments consistent with the present disclosure. A computer-readable storage medium refers to any type of physical memory on which information or data readable by a processor may be stored. Thus, a computer-readable storage medium may store instructions for execution by one or more processors, including instructions for causing the processor(s) to perform steps or stages consistent with the embodiments described herein. The term “computer-readable medium” should be understood to include tangible items and exclude carrier waves and transient signals, i.e. non-transitory. Examples include random access memory (RAM), read-only memory (ROM), volatile memory, nonvolatile memory, hard drives, CD ROMs, DVDs, flash drives, disks, and any other known physical storage media.
[052] It is intended that the disclosure and examples be considered as exemplary only, with a true scope and spirit of disclosed embodiments being indicated by the following claims.

WE CLAIM: 1. A processor-implemented method, the method comprising:
receiving, by one or more hardware processors, a source schema, a target schema and a knowledge graph, wherein the source schema comprises at least one source table and the target schema comprises at least one target table, wherein the at least one source table comprises a plurality of source field names and the at least one target table comprises a plurality of target field names (302);
performing (304), by the one or more hardware processors:
a computation of a logical source field name for each of the plurality of source field names to obtain a plurality of corresponding logical source field names by utilizing the knowledge graph; and
identification of a source semantic data type for each of the plurality of logical source field names by utilizing the knowledge graph; simultaneously performing (306), by the one or more hardware processors: a computation of a logical target field name for each of the plurality of target field names to obtain a plurality of corresponding logical target field names by utilizing the knowledge graph; and
identification of a target semantic data type for each of the
plurality of logical target field names by utilizing the knowledge
graph;
computing, by the one or more hardware processors, a plurality of
potential source field matches for each of the plurality of target field names by
matching the source semantic data type corresponding to each of the plurality
of logical source field names and the target semantic data type corresponding
to each of the plurality of logical target field names (308);
computing, by the one or more hardware processors, a data similarity score between each of the plurality of potential source field matches and each of the plurality of logical target field names based on a data based distance (310);

computing, by the one or more hardware processors, a conceptual similarity score between each of the plurality of potential source field matches with each of the plurality of logical target field names based on a linguistic distance using a plurality of semantic relationship between a plurality of concept names in the knowledge graph (312); and
computing, by the one or more hardware processors, a plurality of matching scores between the source schema and the target schema based on the data similarity score and the conceptual similarity score (314).
2. The method as claimed in claim 1, wherein the knowledge graph comprises the plurality of concept names and the plurality of semantic relationships between each of the plurality of concept names, wherein the plurality of semantic relationships comprises a plurality of hyponym-hypernym relationship, a plurality of synonym relationships, a plurality of hasAttribute relationships, and a plurality of acronym relationships, wherein each of the plurality of concept names is associated with a semantic data type and a semantic data , wherein a plurality of associations exist between the semantic data type and each of the plurality of concept names, and wherein a plurality of associations exist between the semantic data type and the semantic data.
3. The method as claimed in claim 1, The method as claimed in claim 1, wherein the knowledge graph is updated dynamically based on a historical matching information between each of the plurality of target field names of the target schema and each of the plurality of source field names associated with a plurality of source schemas.
4. The method as claimed in claim 1, wherein the data based distance is computed by comparing a target data statistics associated with target field name and a source data statistics associated with a source field name, wherein the target data statistics is computed on the semantic data associated with the

target field name and the source data statistics is computed on the semantic data associated with the source field name.
5. A system (100), the system (100) comprising:
at least one memory (104) storing programmed instructions;
one or more Input /Output (I/O) interfaces (112); and one or more hardware processors (102) operatively coupled to the at least one memory (104), wherein the one or more hardware processors (102) are configured by the programmed instructions to:
receive a source schema, a target schema and a knowledge graph, wherein the source schema comprises at least one source table and the target schema comprises at least one target table, wherein the at least one source table comprises a plurality of source field names and the at least one target table comprises a plurality of target field names; perform:
a computation of a logical source field name for each of the plurality of source field names to obtain a plurality of corresponding logical source field names by utilizing the knowledge graph; and
identification of a source semantic data type for each of the plurality of logical source field names by utilizing the knowledge graph; simultaneously perform:
a computation of a logical target field name for each of the plurality of target field names to obtain a plurality of corresponding logical target field names by utilizing the knowledge graph; and
identification of a target semantic data type for each of the
plurality of logical target field names by utilizing the knowledge
graph;
compute a plurality of potential source field matches for each of the
plurality of target field names by matching the source semantic data type

corresponding to each of the plurality of logical source field names and the target semantic data type corresponding to each of the plurality of logical target field names;
compute a data similarity score between each of the plurality of potential source field matches and each of the plurality of logical target field names based on a data based distance;
compute a conceptual similarity score between each of the plurality of potential source field matches with each of the plurality of logical target field names based on a linguistic distance using a plurality of semantic relationship between a plurality of concept names in the knowledge graph; and
compute a plurality of matching scores between the source schema and the target schema based on the data similarity score and the conceptual similarity score.
6. The system as claimed in claim 5, wherein the knowledge graph comprises the
plurality of concept names and the plurality of semantic relationships between
each of the plurality of concept names, wherein the plurality of semantic
relationships comprises a plurality of hyponym-hypernym relationship, a
plurality of synonym relationships, a plurality of hasAttribute relationships,
and a plurality of acronym relationships, wherein each of the plurality of
concept names is associated with a semantic data type and a semantic data ,
wherein a plurality of associations exist between the semantic data type and
each of the plurality of concept names, and wherein a plurality of associations
exist between the semantic data type and the semantic data.
7. The system as claimed in claim 5, The method as claimed in claim 1, wherein
the knowledge graph is updated dynamically based on a historical matching
information between each of the plurality of target field names of the target
schema and each of the plurality of source field names associated with a
plurality of source schemas.

8. The system as claimed in claim 5, wherein the data based distance is computed by comparing a target data statistics associated with target field name and a source data statistics associated with a source field name, wherein the target data statistics is computed on the semantic data associated with the target field name and the source data statistics is computed on the semantic data associated with the source field name.

Documents

Application Documents

#	Name	Date
1	201921051886-STATEMENT OF UNDERTAKING (FORM 3) [13-12-2019(online)].pdf	2019-12-13
2	201921051886-REQUEST FOR EXAMINATION (FORM-18) [13-12-2019(online)].pdf	2019-12-13
3	201921051886-FORM 18 [13-12-2019(online)].pdf	2019-12-13
4	201921051886-FORM 1 [13-12-2019(online)].pdf	2019-12-13
5	201921051886-FIGURE OF ABSTRACT [13-12-2019(online)].jpg	2019-12-13
6	201921051886-DRAWINGS [13-12-2019(online)].pdf	2019-12-13
7	201921051886-DECLARATION OF INVENTORSHIP (FORM 5) [13-12-2019(online)].pdf	2019-12-13
8	201921051886-COMPLETE SPECIFICATION [13-12-2019(online)].pdf	2019-12-13
9	Abstract1.jpg	2019-12-17
10	201921051886-FORM-26 [24-03-2020(online)].pdf	2020-03-24
11	201921051886-Proof of Right [19-06-2020(online)].pdf	2020-06-19
12	201921051886-FER.pdf	2021-10-19
13	201921051886-OTHERS [20-01-2022(online)].pdf	2022-01-20
14	201921051886-FER_SER_REPLY [20-01-2022(online)].pdf	2022-01-20
15	201921051886-CLAIMS [20-01-2022(online)].pdf	2022-01-20
16	201921051886-US(14)-HearingNotice-(HearingDate-03-11-2023).pdf	2023-09-25
17	201921051886-FORM-26 [26-10-2023(online)].pdf	2023-10-26
18	201921051886-Correspondence to notify the Controller [26-10-2023(online)].pdf	2023-10-26
19	201921051886-Written submissions and relevant documents [09-11-2023(online)].pdf	2023-11-09
20	201921051886-PatentCertificate28-02-2024.pdf	2024-02-28
21	201921051886-IntimationOfGrant28-02-2024.pdf	2024-02-28

Search Strategy

1	Search_Strategy_201921051886E_12-10-2021.pdf