System And Method For Data Management

< Back

System And Method For Data Management

Abstract: System and method for data management are disclosed. The method includes receiving source data associated with various data formats corresponding to distinct sources. Prominent and non-prominent attributes corresponding to the distinct sources are defined. A conformance score associated with the prominent attributes, and a deviation score associated with the non-prominent attributes are computed. The conformance score is indicative of conformance of the prominent attributes to a source of the plurality of distinct sources, and the deviation score indicative of deviation of the non-prominent attributes from the source. A lexical similarity score is computed based on a lexical analysis of the source data and a target data based on an ontological model. A matching score is assigned to the source data based on the conformance score, the deviation score and the lexical similarity score, and the source data is matched with the target data based on the matching score.

Get Free WhatsApp Updates!
Notices, Deadlines & Correspondence

Patent Information

Application #

Filing Date

27 March 2017

Publication Number

39/2018

Publication Type

INA

Invention Field

COMPUTER SCIENCE

Status

ip@legasis.in

Parent Application

Patent Number

Legal Status

Grant Date

2024-02-05

Renewal Date

Applicants

Tata Consultancy Services Limited

Nirmal Building, 9th Floor, Nariman Point, Mumbai 400021, Maharashtra, India

Inventors

1. SWAIN, Debiprasad

Tata Consultancy Services Limited, Kalinga Park IT/ITES Special Economic Zone, Plot - 35, Chandaka Industrial Estate, Patia, Chandrasekharpur, Bhubaneswar - 751 024, Odisha, India

2. DASH, Raiguru Birendra Kumar

Tata Consultancy Services Limited, Kalinga Park IT/ITES Special Economic Zone, Plot - 35, Chandaka Industrial Estate, Patia, Chandrasekharpur, Bhubaneswar - 751 024, Odisha, India

3. SAHOO, Parshuram

Tata Consultancy Services Limited, Kalinga Park IT/ITES Special Economic Zone, Plot - 35, Chandaka Industrial Estate, Patia, Chandrasekharpur, Bhubaneswar - 751 024, Odisha, India

Specification

Claims:1. A processor-implemented method for data management comprising:
receiving source data associated with a plurality of data formats corresponding to a plurality of distinct sources, via one or more hardware processors;
defining one or more prominent attributes and one or more non-prominent attributes corresponding to each of the plurality of distinct sources, via the one or more hardware processors;
computing a conformance score associated with the one or more prominent attributes in the source data, via the one or more hardware processors, the conformance score indicative of conformance of the one or more prominent attributes to a source of the plurality of distinct sources;
computing a deviation score associated with the one or more non-prominent attributes in the source data, via the one or more hardware processors, the deviation score indicative of deviation of the non-prominent attributes from the source;
computing a lexical similarity score based on a lexical analysis of the source data and a target data based on an ontological model, via the one or more hardware processors;
assigning a matching score to the source data based on the conformance score, the deviation score and the lexical similarity score, via the one or more hardware processors, the matching score indicative of extent of matching of formats of the source data and the target data; and
matching the source data with the target data based on the matching score, via the one or more hardware processors, to obtain a matched source data-target data pair.

2. The method as claimed in claim 1, wherein receiving the source data comprises receiving one or more input files comprising the source data.

3. The method as claimed in claim 2, wherein the one or more prominent attributes comprises at least one of prominent table names, prominent field names, and master data columns for each table in the one or more input files.

4. The method as claimed in claim 3, wherein computing the conformance score comprises:
examining naming convention of the one or more input files, and
examining prominent table names, prominent field names, and master data columns for each table in the one or more input files, wherein said examining determines conformance of the one or more prominent attributes to the source of the plurality of distinct sources.

5. The method as claimed in claim 1, wherein the lexical analysis of the source data and the target data comprises:
accessing, from the ontological model, a set of compiled nouns derived from the source data associated with the plurality of sources;
facilitating, by the ontological model, matching the set of compiled nouns with domain specific precompiled nouns to obtain a set of matching noun-domain pairs;
assigning, by the ontological model, the lexical similarity score to the matching noun-domain pairs; and
selecting one or more domains associated with the compiled nouns-domain pairs having highest values of the lexical similarity score as the target data.

6. The method as claimed in claim 1, further comprising storing the matched source data-target data pair and the matching score associated with the matching.

7. The method as claimed in claim 6, further comprising:
comparing the matching score associated with the matched source data-target data pair with a threshold value of the matching score; and
executing transformation of the format of source data to the format of the target data on determination of the matching score being greater than the threshold value of the matching score.

8. A system for data management comprising:
one or more memories storing instructions; and
one or more hardware processors coupled to the one or more memories, wherein said one or more hardware processors are configured by said instructions to:
receive source data associated with a plurality of data formats corresponding to a plurality of distinct sources;
define one or more prominent attributes and one or more non-prominent attributes corresponding to each of the plurality of distinct sources;
compute a conformance score associated with the one or more prominent attributes in the source data, the conformance score indicative of conformance of the one or more prominent attributes to a source of the plurality of distinct sources;
compute a deviation score associated with the one or more non-prominent attributes in the source data, the deviation score indicative of deviation of the non-prominent attributes from the source;
compute a lexical similarity score based on a lexical analysis of the source data and a target data based on an ontological model;
assign a matching score to the source data based on the conformance score, the deviation score and the lexical similarity score, the matching score indicative of extent of matching of formats of the source data and the target data; and
match the source data with the target data based on the matching score to obtain a matched source data-target data pair.

9. The system as claimed in claim 8, wherein to receive the source data, the one or more hardware processors are further configured by the instructions to receive one or more input files comprising the source data.

10. The system as claimed in claim 8, wherein the one or more prominent attributes comprises at least one of prominent table names, prominent field names, and master data columns for each table in the one or more input files.

11. The system as claimed in claim 10, wherein to compute the conformance score, the one or more hardware processors are further configured by the instructions to:
examine naming convention of the one or more input files, and
examine prominent table names, prominent field names, and master data columns for each table in the one or more input files to determine conformance of the one or more prominent attributes to the source of the plurality of distinct sources.

12. The system as claimed in claim 8, wherein to perform the lexical analysis of the source data and the target data, the one or more hardware processors are further configured by the instructions to:
access, from the ontological model, a set of compiled nouns derived from the source data associated with the plurality of sources; facilitating, by the ontological model, matching the set of compiled nouns with domain specific precompiled nouns to obtain a set of matching noun-domain pairs;
assign, by the ontological model, the lexical similarity score to the matching noun-domain pairs; and
select one or more domains associated with the compiled nouns-domain pairs having highest values of the lexical similarity score as the target data.

13. The system as claimed in claim 8, wherein the one or more hardware processors are further configured by the instructions to store the matched source data-target data pair and the matching score associated with the matching.

14. The system as claimed in claim 8, wherein the one or more hardware processors are further configured by the instructions to
compare the matching score associated with the matched source data-target data pair with a threshold value of the matching score; and
execute transformation of the format of source data to the format of the target data on determination of the matching score being greater than the threshold value of the matching score.
, Description:FORM 2

THE PATENTS ACT, 1970
(39 of 1970)
&
THE PATENT RULES, 2003

COMPLETE SPECIFICATION
(See Section 10 and Rule 13)

Title of invention:
SYSTEM AND METHOD FOR DATA MANAGEMENT

Applicant:
Tata Consultancy Services Limited
A Company Incorporated in India under the Companies Act, 1956
Having address:
Nirmal Building, 9th Floor,
Nariman Point, Mumbai 400021,
Maharashtra, India

The following specification describes the invention and the manner in which it is to be performed.
TECHNICAL FIELD
[001] The present disclosure in general relates to data management, and more particularly, to a system and method for data induction by identifying and mapping source data to target data.
BACKGROUND
[002] In an enterprise, data management is being viewed as a common point for various core functions including governance, compliance, risk management, effective client relationships and so on. These aforementioned functions rely on the accuracy of data for effective decision making. Various business groups such as risk mitigation strategy group, compliance group, operations management groups, and the like consume the same information in different manners. However, in service industry, there are mismatches between format of data or information retrieved from customer-end and the format acceptable for processing said information at the vendor’s end. Such discrepancies may lead to material disputes about data quality, definitions, information storage, and control.
[003] Typically, the enterprises incorporate a data management to avoid such disputes, thereby holding responsibility to establish standards of conformity, integrity and reliability thereby increasing efficiency and throughput. For instance, conventional data management systems initially interpret input data source and accordingly, map the input data to an existing desired format. However, the establishment of mapping between input data source and the existing desired format is a time-intensive and resource-intensive task. Moreover, such task is dependent on the expertise level of the professionals working on the mapping.
SUMMARY
[004] Embodiments of the present disclosure present technological improvements as solutions to one or more of the above-mentioned technical problems recognized by the inventors in conventional systems. For example, in one embodiment, a processor-implemented method for data management is provided. The method includes receiving, by one or more hardware processors, source data associated with a plurality of data formats corresponding to a plurality of distinct sources. Further, the method includes defining, by the one or more hardware processors, one or more prominent attributes and one or more non-prominent attributes corresponding to each of the plurality of distinct sources. Furthermore, the method includes computing, by the one or more hardware processors, a conformance score associated with the one or more prominent attributes in the source data, the conformance score indicative of conformance of the one or more prominent attributes to a source of the plurality of distinct sources. Moreover the method includes computing, by the one or more hardware processors a deviation score associated with the one or more non-prominent attributes in the source data, the deviation score indicative of deviation of the non-prominent attributes from the source. Also, the method includes computing, by the one or more hardware processors a lexical similarity score based on a lexical analysis of the source data and a target data based on an ontological model. Additionally, the method includes assigning, by the one or more hardware processors, a matching score to the source data based on the conformance score, the deviation score and the lexical similarity score. The matching score is indicative of extent of matching of formats of the source data and the target data. Also, the method includes matching, by the one or more hardware processors, the source data with the target data based on the matching score to obtain a matched source data-target data pair.
[005] In another embodiment, a system for data management is provided. The system includes one or more memories storing instructions; and one or more hardware processors coupled to the one or more memories. The one or more hardware processors are configured by said instructions to receive source data associated with a plurality of data formats corresponding to a plurality of distinct sources. Further, the one or more hardware processors are configured by said instructions to define one or more prominent attributes and one or more non-prominent attributes corresponding to each of the plurality of distinct sources. Furthermore, one or more hardware processors are configured by said instructions to compute a conformance score associated with the one or more prominent attributes in the source data, the conformance score indicative of conformance of the one or more prominent attributes to a source of the plurality of distinct sources. Also, the one or more hardware processors are configured by said instructions to compute a deviation score associated with the one or more non-prominent attributes in the source data, the deviation score indicative of deviation of the non-prominent attributes from the source. Additionally, the one or more hardware processors are configured by said instructions to compute a lexical similarity score based on a lexical analysis of the source data and a target data based on an ontological model. Also, the one or more hardware processors are configured by said instructions to assign a matching score to the source data based on the conformance score, the deviation score and the lexical similarity score. The matching score is indicative of extent of matching of formats of the source data and the target data. Also, the one or more hardware processors are configured by said instructions to match the source data with the target data based on the matching score to obtain a matched source data-target data pair.
[006] In yet another implementation, a non-transitory computer-readable medium having embodied thereon a computer program for executing a method for data management is provided. The method includes receiving source data associated with a plurality of data formats corresponding to a plurality of distinct sources. Further, the method includes defining one or more prominent attributes and one or more non-prominent attributes corresponding to each of the plurality of distinct sources. Furthermore, the method includes computing a conformance score associated with the one or more prominent attributes in the source data, the conformance score indicative of conformance of the one or more prominent attributes to a source of the plurality of distinct sources. Moreover the method includes computing a deviation score associated with the one or more non-prominent attributes in the source data, the deviation score indicative of deviation of the non-prominent attributes from the source. Also, the method includes computing a lexical similarity score based on a lexical analysis of the source data and a target data based on an ontological model. Additionally, the method includes assigning a matching score to the source data based on the conformance score, the deviation score and the lexical similarity score. The matching score is indicative of extent of matching of formats of the source data and the target data. Also, the method includes matching the source data with the target data based on the matching score to obtain a matched source data-target data pair.
[007] It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention, as claimed.
BRIEF DESCRIPTION OF DRAWINGS
[008] The accompanying drawings, which are incorporated in and constitute a part of this disclosure, illustrate exemplary embodiments and, together with the description, serve to explain the disclosed principles.
[009] FIG. 1 illustrates a network implementation of a system for data management according to some embodiments of the present disclosure.
[010] FIG. 2 illustrates a system for data management according to some embodiments of the present disclosure.
[011] FIGS. 3A and 3B illustrate example representation of matching of source data with target data for data management according to some embodiments of the present disclosure.
[012] FIG. 4 illustrates a flowchart for data management according to some embodiments of the present disclosure.
DETAILED DESCRIPTION
[013] Exemplary embodiments are described with reference to the accompanying drawings. In the figures, the left-most digit(s) of a reference number identifies the figure in which the reference number first appears. Wherever convenient, the same reference numbers are used throughout the drawings to refer to the same or like parts. While examples and features of disclosed principles are described herein, modifications, adaptations, and other implementations are possible without departing from the spirit and scope of the disclosed embodiments. It is intended that the following detailed description be considered as exemplary only, with the true scope and spirit being indicated by the following claims.
[014] In a data warehouse (DWH) set up, the data is ingested from a variety of sources. Sources. Typically, business intelligence is derived from said data by following a Load-Transform-Aggregate (LTA) path. For example, initially data from inputs sources are mapped to a standardized/prescribed format, and necessary transformations are carried out, if needed, to comply to the standardized/prescribed format. Thereafter, the data is moved to staging from where further transformation, aggregation and analytics are extracted. Subsequently, data is visualized/ analysed through a user interface. On the input side of the DHW, the source formats needs to be interpreted and mapped to a predefined format. Said mapping is a manual and time consuming activity and at many times prohibitive in large programmes. The present disclosure provides system and method to overcome above-mentioned technical problems recognized in conventional systems and present technological improvements as solutions to one or more of the above-mentioned technical problems for data management. The data supplied to DWH are mostly from enterprise application databases; a) these by and large follow standard / interpretable naming conventions, b) they are also often recognizable through presence of a specific set of files and formats. These lexical and compositional metadata can be utilized to infer/recognize input file format and carry out mapping and transformation in a predefined and automated way. The disclosed methods and systems disclosed embodiments that facilitates in eliminating or reducing the challenge associated with understanding the input data source and mapping data received from said data sources to an existing desired format, thereby making it efficient and accurate. For instance, the disclosed system facilitates in reducing cycle time of researching/understanding enterprise data sources, removing the pain of laboriously establishing mappings, eliminating error through automation, and improving accuracy though continuous learning data management.
[015] While aspects of described system and method for data management are described with respect to an enterprise, it may be implemented in any number of different computing systems, environments, and/or configurations, the embodiment's are described in the context of the following exemplary system.
[016] Referring now to FIG. 1, a network implementation 100 of a system 102 for data management in an enterprise is illustrated, in accordance with an embodiment of the present disclosure. Although the present disclosure is explained by considering that the system 102 is implemented as a software program on a server, it may be understood that the system 102 may also be implemented in a variety of computing systems, such as a laptop computer, a desktop computer, a notebook, a workstation, a mainframe computer, a server, a network server, cloud, and the like. It will be understood that the system 102 may be accessed by multiple users through one or more user devices 104-1, 104-2…104-N, collectively referred to as user devices 104 hereinafter, or applications residing on the user devices 104. Examples of the user devices 104 may include, but are not limited to, a portable computer, a personal digital assistant, a hand-held device, and a workstation.
[017] The user devices 104 are communicatively coupled to the system 102 through a network 106. Said devices can be enhanced by analytics (such as predictive analytics, transactional analytics, and real-time analytics) and/or AI (Artificial Intelligence) engines.
[018] In one implementation, the network 106 may be a wireless network, a wired network or a combination thereof. The network 106 can be implemented as one of the different types of networks, such as intranet, local area network (LAN), wide area network (WAN), the internet, and the like. The network 106 may either be a dedicated network or a shared network. The shared network represents an association of the different types of networks that use a variety of protocols, for example, Hypertext Transfer Protocol (HTTP), Transmission Control Protocol/Internet Protocol (TCP/IP), Wireless Application Protocol (WAP), and the like, to communicate with one another. Further the network 106 may include a variety of network devices, including routers, bridges, servers, computing devices, storage devices, and the like. Various components and functionalities of the system 102 are described further with reference to FIG. 2.
[019] FIG. 2 illustrates a block diagram of a system 200 for data management in accordance with an example embodiment. In an example embodiment, the system 200 may be embodied in, or is in direct communication with the system, for example the system 102 (FIG. 1). In an embodiment, the system 200 facilitates in management of data in an enterprise. The system 200 includes or is otherwise in communication with at least one processor such as a processor 202, at least one memory such as a memory 204, and an I/O interface 206. The processor 202, memory 204, and the network interface element 206 may be coupled by a system bus such as a system bus 208 or a similar mechanism.
[020] The at least one processor 202 may be implemented as one or more microprocessors, microcomputers, microcontrollers, digital signal processors, central processing units, state machines, logic circuitries, and/or any devices that manipulate signals based on operational instructions. Among other capabilities, the at least one processor 202 is configured to fetch and execute computer-readable instructions stored in the memory 204.
[021] The I/O interface 206 may include a variety of software and hardware interfaces, for example, a web interface, a graphical user interface, and the like. Further, the I/O interface 206 may enable the system 200 to communicate with other computing devices, such as web servers and external data servers (not shown). The I/O interface 206 may facilitate multiple communications within a wide variety of networks and protocol types, including wired networks, for example, LAN, cable, etc., and wireless networks, such as WLAN, cellular, or satellite. The I/O interface 206 may include one or more ports for connecting a number of devices to one another or to another server.
[022] The memory 204 may include any computer-readable medium known in the art including, for example, volatile memory, such as static random access memory (SRAM) and dynamic random access memory (DRAM), and/or non-volatile memory, such as read only memory (ROM), erasable programmable ROM, flash memories, hard disks, optical disks, and magnetic tapes. In an embodiment, the memory 204 includes a plurality of modules 220 and a repository 240 for storing data processed, received, and generated by one or more of the modules 220. The modules 220 may include routines, programs, objects, components, data structures, and so on, which perform particular tasks or implement particular abstract data types. In one implementation, the modules 220 may include a receiving module 222, a matching score assigning module 224, and a matching module 226, and other modules 228. The other modules 236 may include programs or coded instructions that supplement applications and functions of the system 200.
[023] The repository 240, amongst other things, includes a system database 242 and other data 244. The other data 244 may include data generated as a result of the execution of one or more modules in the other modules 236. The repository 240 is further configured to include target data 246 and a source data 248 for data management. The source data 248 and the target data 246 are described further in the description.
[024] In an embodiment, the system 200 is caused to receive source data associated with a plurality of data formats corresponding to a plurality of distinct sources. The plurality of data formats may include data formats associated with Siebel® Application, Oracle Application, PeopleSoft, SAP, and other such applications. In an embodiment, the source data may be received in form of input files. In an embodiment, the system 200 is caused to collect information regarding structure and format of specific data structures associated with input files, and store said information as the source data 248. For instance, in a target system a target data pertain to “Account information” of a customer may be read as “myCustomer_Account_Info”. However, the source data exported from the source such as a Siebel® application may be a file named “S_Account” with a set of known attribute/column list. The OracleApp exports the “Account information” as “o_acc_info” which contains specific known attributes/columns.
[025] In an embodiment, the source data may be received in form of a plurality of bundled input files. In an embodiment, the source data associated with the plurality of source format may provide information regarding the input files. For example, Siebel® system exports Seibel® files with S_ prefix. If the input file bundle contains all files with names starting with S_, then it may be derived that the input files are received from Siebel® application.
[026] Additionally or alternatively, the source data associated with the plurality of data formats may include structural nomenclature of the sources of said formats. The structural nomenclature may include structural content of the input file. For example, the Cust_ID can be numeric and alphanumeric (<10), the Cust_Name can be CHAR(<50), Address can be multi-line, and so on In certain scenarios, the source data may include input files in form of a set/bundle. In an example scenario, for examination of Siebel® data files, the bundle of input files may be in .csv format.
[027] The system is caused to perform a set of examinations of the input source files, and based on said examinations, determine data files (or source data) associated with a specific data source. For instance, the system 200 may be caused to examine a bundle of input files, and based on the examination of said bundle of input files, determine the list of data files which are from Siebel® Data systems. Herein, the system is caused to assign a matching score to each examination of the source data, and based on said examination, determine mapping of source data with a target data. The set of examinations of the source data is explained below in further detail.
[028] In an embodiment, the system 200 is caused to identify one or more input files that are associated with a specific data source by examining the file names of said files. For instance, the input files having file name initiating with file name initiating with S_ or EIM_ may be aggregated for further examination. In an example scenario, names of the files determined to be associated with a specific input source may be stored in a text file.
[029] In an embodiment, the system 200 may be caused to perform a set of examination on the input files based on the attribute nomenclature. During the examination of the input files, the system 200 may be caused to define one or more prominent attributes and one or more non-prominent fields corresponding to each of the plurality of distinct sources of the source data. The one or more prominent attributes may include prominent fields and prominent tables corresponding to a data source, and may refer to the specific significant or important fields associated with the particular applications that may be useful in identification of data source from the input files. In an embodiment, the system 200 may facilitate in defining the master data columns for each table, prominent table name list, and prominent field name list in the properties file of the input file. In an embodiment, Prominent attributes may include a set of user-defined attributed expected to be present in the input file.
[030] Considering an example of Seibel® app as data source, the Siebel® data files starts with S_ or EIM_. Accordingly, the system 200 may take all the data files that may start with S_ and EIM_ for as the source data. Also, the Siebel® data systems have table names starting with S_OPTY, S_SRV_REQ, S_CONTACT, S_ASSET, S_ACCOUNT, S_EVT_ACT, S_PROD_INT, and so on. Such tables may be termed as prominent tables for Seibel application. The system 200 may be caused to examine the source data for prominent tables and compute a Prominent_Table_Score. The Prominent_Table_Score may refer to a weightage assigned to the tables of the source data. Herein, the Prominent_Table_Score computes fraction of typical file/table names expected in the input set. The system 200 may examine all the stored file names with the prominent table name list from the properties file and find a count of matched prominent table names for all the file names. The system 200 may determine the Prominent_Table_Score as:
Prominent_Table_Score = Total no of matched prominent Tables / count of defined prominent tables in properties file.
[031] Additionally or alternatively, the system 200 may be caused to examine the source data for prominent fields or columns. For instance, in case of Seibel application, there are some columns which Siebel® data system use most commonly. For instance, the prominent fields for Siebel® data systems may include fields such as ROW_ID, CREATED, CREATED_BY, LAST_UPD, LAST_UPD_BY, SEX_MF, DESC_TEXT, REASON_CD, STATUS_CD, ACTIVE_FLG, and so on. In an embodiment, the prominent file and/or attributes may be defined by the system-user and gets saved as a property file. These are 1-time definitions for a specific deployment, but can be altered if needed. The system may be caused to compute a Prominent_Field_Score based on the prominent columns identified for the specific data source. For instance, the system may read the first line of data files (all .csv files) one by one and compare the fields with the prominent fields defined in properties file. The system is configured to compute the Prominet_Field_Score as:
Prominet_Field_Score = Sum of all (sum of matched prominent fields / total no of prominent fields defined in properties file)
[032] In an embodiment, the system 200 may be caused to determine a conformance score based on the Prominet_Table_Score and the Prominet_Field_Score associated with the source data. Herein, the conformance score is indicative of a conformance of whether or not the files/attributes confirm/belong to a source of the plurality of distinct sources. In an embodiment, computing the conformance score includes examining naming convention of the one or more input files and examining prominent table names, prominent field names, and master data columns for each table in the one or more input files.
[033] Herein, as described above the system 200 is caused to define master data column names in properties file defined for each tables. The system 200 is further caused to check columns other than prominent fields and compute a Non_Prominent_Field_Score based on the fields other than the prominent fields. The system 200 may examine the first line of data files (all csv files) one by one and compare the fields with the non-prominent fields by excluding all the prominent fields defined in the properties file. The system 200 may be caused to compute a deviation score based on the comparison of the fields with the non-prominent field. The deviation score may also be termed as Non_Prominent_Field_Score, and may be computed as:
Non_Prominent_Field_Score = Total_Non-Prominent_Value / No. of input files checked
[034] In an embodiment, the system 200 is caused to compute Total_Non-Prominent_Value as:
Total_Non-Prominent_Value = Sum of all (matched non-prominent fields / total non-prominent values)
[035] Additionally or alternatively, the system 200 is further caused to compute a lexical similarity score based on a lexical analysis of the source data. For example, in case it is determined that the attribute name is inconclusive or absent from the source data, lexical inference or analysis can be used for inferring and mapping the incoming fields. The lexical analysis of the source data may be performed by an ontological model. The ontological model may include compilation of nouns derived from the plurality of source, and may facilitate in matching said nouns with domain specific precompiled nouns (for example, in the target data). The ontological model may further facilitate in assigning the lexical similarity score to the matching nouns and domains. An example of lexical similarity score being assigned by the ontological model is described further with reference to FIGS. 3A and 3B.
[036] In an embodiment, the system 200 is caused to determine a second weightage for the input sources by employing machine learning model. In an embodiment, the machine learning model may include one or more rules for ontology based lexical analysis of the source data. For example, the machine learning model may include a rule for noun combination, a rule for synonym, rule for detection of attributes, and so on. For instance, the Noun combination rule may be that Attribute names may be formed using Noun parts with separator; Customer_name, Cust_Acct_ID, and so on. The synonym rule may be that the terms Customer, Cust, Party may srepresent the same business entity. The attribute detection rule may be that if the data under a particular column resembles a pattern, (say john smith would be a name), then 9 digit number with or without delimiters is a telephone number, and so on.
[037] The system 200 is caused to compute or assign a matching score for the source data received from the plurality of sources based on the conformance score, the deviation score, and the lexical similarity score.
[038] In an embodiment, the system 200 may be caused to continuously update the source templates based on the matchings arrived at and approved by the SME. For instance, in an embodiment, the system 200 may be caused to store a matched source data-target data pair obtained by matching of the source data with the target data, and the matching score associated with the matching.
[039] In an embodiment, the system 200 may determine whether the matching score is greater than or equal to a threshold matching score. If the system 200 is caused to determine that the matching score is greater than or equal to a threshold matching score, the system 200 may assume the mapping as the final mapping. Subsequently, the system 200 may cause execution of transformation of the data from the source data format to the target data format. Alternatively, if the system is caused to determine that the matching score is less than the threshold matching score, the system 200 may present mapping of the source data-target data pair for user workflow for review. In an embodiment, the threshold matching score may be defined by a user.
[040] FIGS. 3A and 3B example representation of matching of source data with target data for data management according to some embodiments of the present disclosure. In particular, FIGS. 3A and 3B illustrate lexical similarity score being assigned by the ontological model for matching of the source data with the target data. A lexical similarity score is assigned based on a lexical analysis of the source data. As explained with reference to FIG. 2, the lexical analysis of the source data may be performed by an ontological model.
[041] Referring now to FIG. 3A, a process flow 310 for matching the source data with the target data is illustrated. At 312, the ontological model may include compilation or a set of compiled nouns derived from the source data associated with plurality of source, as illustrates at the process step 312. For instance, the ontological model may include compilation of nouns such as Customer, Bank, Capital, Revenue, Receipt, and Address. At process step 314, the ontological model may facilitate in matching said set of compiled nouns with domain specific precompiled nouns (for example, in the target data) to obtain a set of matching noun-domain pairs. In an embodiment, the ontological model may facilitate in assigning the lexical similarity score to the matching nouns-domain pairs, so as to deduce the target data. For instance, in the present example, the compiled noun-precompiled noun pairs may be associated with lexical similarity scores as given below:
Customer-BFS-NonExclusive-20%
Customer-Retail-NonExclusive-20%
Client-BFS-NonExclusive-35%
Bank-BFS-NonExclusive-80%
Capital-BFS-Exclusive-95%
Revenue-BFS-NonExclusive-70%
Receipt-Retail-NonExclusive-40%
Address-All-NonExclusive-0%
[042] At step 316, the target data may be deduced by selecting one or more domains associated with the compiled nouns-domain specific precompiled nouns pairs having highest values of the lexical similarity score. For instance, in the present example, the following domains may be selected as the target data:
BFS – 85%
Retail – 17%
Telecom – 2%
[043] In the present example, the system 200 has been described to facilitate matching of the source data with the target data, and in response, infer domains from terminology or precompiled nouns. In another example, the system 200 may facilitate in matching the source data with the target data, as described further with reference to FIG. 3B.
[044] Referring now to FIG. 3B, the source data may include entity suffix (represented as 322) for a word ‘customer’ and attribute suffix (represented as 324) for the word ‘name’. The entity suffix and the attribute suffix may be joined by joiners such as joiners 326 to generate various combinations (or lexical possibilities) 328. The different combinations may be assigned the lexical similarity score 330. The target data may be deduced by selecting the lexical combination having highest values of the lexical similarity score. For instance, in the present example, the target data may be “Customer Name” which is associated with highest similarity score (i.e. 99%).
[045] The disclosed system, for instance the system 200 (FIG. 2) further facilitate in data discovery based on said mapping. For instance, the system may read values of the attributes and provide suggestions as below:
[046] Values under Attribute19 is “Apollo Bunder, Mumbai” looks like an Address linked to Customer.Profile.HomeAddress
[047] Value under Attribute62 is between 1 and 78 with normal distribution looks like Age linked to Customer.Profile.Self
[048] A flow-diagram illustrating a method for data management is described further with reference to FIG. 4.
[049] FIG. 4 illustrates a flow diagram of a method 400 for data management, in accordance with the present disclosure. The method 400 may be described in the general context of computer executable instructions. Generally, computer executable instructions can include routines, programs, objects, components, data structures, procedures, modules, functions, etc., that perform particular functions or implement particular abstract data types. The method 400 may also be practiced in a distributed computing environment where functions are performed by remote processing devices that are linked through a communication network. The order in which the method 400 is described is not intended to be construed as a limitation, and any number of the described method blocks can be combined in any order to implement the method 400, or an alternative method. Furthermore, the method 400 can be implemented in any suitable hardware, software, firmware, or combination thereof. In an embodiment, the method 400 depicted in the flow chart may be executed by a system, for example, the system 200 of FIG. 2. In an example embodiment, the system 200 may be embodied in a computing device, for example, the computing device 110 (FIG. 1).
[050] At 402, the method includes receiving source data associated with a plurality of data formats corresponding to a plurality of distinct sources. The plurality of distinct sources may include Seibel Application, Oracle application and so on. At 404, the method includes defining one or more prominent attributes and one or more non-prominent attributes corresponding to each of the plurality of distinct sources. At 406, the method includes computing a conformance score associated with the one or more prominent attributes in the source data. The conformance score is indicative of conformance of the one or more prominent attributes to a source of the plurality of distinct sources. At 408, the method includes computing a deviation score associated with the one or more non-prominent attributes in the source data. The deviation score indicative of deviation of the non-prominent attributes from the source. At 410, the method includes computing a lexical similarity score based on a lexical analysis of the source data and a target data based on an ontological model. At 412, the method includes assigning a matching score to the source data based on the conformance score, the deviation score and the lexical similarity score. At 414, it is determined whether the matching score is less a threshold matching score. If it is determined at 414 that the matching score is less than the threshold matching score, matching of the source data-target data pair may be presented for human workflow for review at 416. matching score If, however, it is determined at 414 that the matching score is greater than or equal to the threshold matching score, the source data is matched with the target data and said matching may be assumed as the final matching. Subsequently, the transformation of the data from the source data format to the target data format is executed at 420.
[051] Various embodiments of the disclosed method and system provide a method and system for data management. The disclosed embodiments rely on multiple sources of knowledge for identifying the source data and matching with the target data. The system exploits information available in the input files/ sources and assigns an aggregated matching score based on various factors such as potential sources of data available for a specific system, taxonomy of the source data format versus the desired (or target) format, taxonomy of the data contained in the source format, and so on. In an embodiment, the system includes an ontological model for lexical analysis of the source data and matching of the input source data to the desired format. The system is further caused to store the inferred mapping of the source data with the target data, reasoning for inferring said mapping, and confidence associated with the mapping. A significant outcome of the disclosed system is that the system is caused to utilize the stored inferred mapping, reasoning for same and the matching score associated therewith for training a machine learning model. Said machine learning model can be utilized for intelligent management of data.
[052] The illustrated steps are set out to explain the exemplary embodiments shown, and it should be anticipated that ongoing technological development will change the manner in which particular functions are performed. These examples are presented herein for purposes of illustration, and not limitation. Further, the boundaries of the functional building blocks have been arbitrarily defined herein for the convenience of the description. Alternative boundaries can be defined so long as the specified functions and relationships thereof are appropriately performed. Alternatives (including equivalents, extensions, variations, deviations, etc., of those described herein) will be apparent to persons skilled in the relevant art(s) based on the teachings contained herein. Such alternatives fall within the scope and spirit of the disclosed embodiments. Also, the words “comprising,” “having,” “containing,” and “including,” and other similar forms are intended to be equivalent in meaning and be open ended in that an item or items following any one of these words is not meant to be an exhaustive listing of such item or items, or meant to be limited to only the listed item or items. It must also be noted that as used herein and in the appended claims, the singular forms “a,” “an,” and “the” include plural references unless the context clearly dictates otherwise.
[053] Furthermore, one or more computer-readable storage media may be utilized in implementing embodiments consistent with the present disclosure. A computer-readable storage medium refers to any type of physical memory on which information or data readable by a processor may be stored. Thus, a computer-readable storage medium may store instructions for execution by one or more processors, including instructions for causing the processor(s) to perform steps or stages consistent with the embodiments described herein. The term “computer-readable medium” should be understood to include tangible items and exclude carrier waves and transient signals, i.e., be non-transitory. Examples include random access memory (RAM), read-only memory (ROM), volatile memory, nonvolatile memory, hard drives, CD ROMs, DVDs, flash drives, disks, and any other known physical storage media.

Documents

Application Documents

#	Name	Date
1	Form 3 [27-03-2017(online)].pdf	2017-03-27
2	Form 20 [27-03-2017(online)].jpg	2017-03-27
3	Form 18 [27-03-2017(online)].pdf_316.pdf	2017-03-27
4	Form 18 [27-03-2017(online)].pdf	2017-03-27
5	Drawing [27-03-2017(online)].pdf	2017-03-27
6	Description(Complete) [27-03-2017(online)].pdf_317.pdf	2017-03-27
7	Description(Complete) [27-03-2017(online)].pdf	2017-03-27
8	Form 26 [06-05-2017(online)].pdf	2017-05-06
9	Other Patent Document [08-05-2017(online)].pdf	2017-05-08
10	201721010677-ORIGINAL UNDER RULE 6(1A)-12-05-2017.pdf	2017-05-12
11	ABSTRACT1.jpg	2018-08-11
12	201721010677-OTHERS [27-02-2021(online)].pdf	2021-02-27
13	201721010677-FER_SER_REPLY [27-02-2021(online)].pdf	2021-02-27
14	201721010677-COMPLETE SPECIFICATION [27-02-2021(online)].pdf	2021-02-27
15	201721010677-CLAIMS [27-02-2021(online)].pdf	2021-02-27
16	201721010677-ABSTRACT [27-02-2021(online)].pdf	2021-02-27
17	201721010677-FER.pdf	2021-10-18
18	201721010677-US(14)-HearingNotice-(HearingDate-15-12-2023).pdf	2023-11-30
19	201721010677-FORM-26 [10-12-2023(online)].pdf	2023-12-10
20	201721010677-FORM-26 [10-12-2023(online)]-1.pdf	2023-12-10
21	201721010677-Correspondence to notify the Controller [10-12-2023(online)].pdf	2023-12-10
22	201721010677-US(14)-ExtendedHearingNotice-(HearingDate-22-12-2023).pdf	2023-12-18
23	201721010677-Correspondence to notify the Controller [20-12-2023(online)].pdf	2023-12-20
24	201721010677-Written submissions and relevant documents [04-01-2024(online)].pdf	2024-01-04
25	201721010677-PatentCertificate05-02-2024.pdf	2024-02-05
26	201721010677-IntimationOfGrant05-02-2024.pdf	2024-02-05

Search Strategy

1	SearchStrategy201721010677E_28-08-2020.pdf