“System And Method For Systematically Populating Unstructured Dataset

< Back

“System And Method For Systematically Populating Unstructured Dataset In A Data Model”

Abstract: The present invention describes system and method of performing data modelling in a data science lifecycle environment. The method (500) recites receiving (502) a plurality of unstructured datasets from a plurality of data sources, wherein each unstructured dataset is associated with a domain. The method (500) further describes grouping (504) the plurality of unstructured datasets into a set of clusters such that each cluster comprises unstructured datasets having a similar domain. Said method (500) further discloses identifying (506) one or more entities, by applying a NER technique on the unstructured datasets, for each cluster. Further, the method (500) recites generating (508) a set of data models corresponding to the set of clusters based on the one or more entities identified for each cluster. [FIG. 5]

Get Free WhatsApp Updates!
Notices, Deadlines & Correspondence

Patent Information

Application #

Filing Date

10 February 2022

Publication Number

32/2023

Publication Type

INA

Invention Field

COMPUTER SCIENCE

Status

Parent Application

Applicants

ZENSAR TECHNOLOGIES LTD.

Zensar Knowledge Park, Plot#4, MIDC, Kharadi, Off Nagar Road, Pune, Maharashtra – 411014, India

Inventors

1. Sridhar Gadi

ZENSAR TECHNOLOGIES LTD., Zensar Knowledge Park, Plot#4, MIDC, Kharadi, Off Nagar Road, Pune, Maharashtra – 411014, India

2. Manish Kumar

ZENSAR TECHNOLOGIES LTD., Zensar Knowledge Park, Plot#4, MIDC, Kharadi, Off Nagar Road, Pune, Maharashtra – 411014, India

3. Pavan Jakati

ZENSAR TECHNOLOGIES LTD., Zensar Knowledge Park, Plot#4, MIDC, Kharadi, Off Nagar Road, Pune, Maharashtra – 411014, India

4. Abhishek Upadhyay

ZENSAR TECHNOLOGIES LTD., Zensar Knowledge Park, Plot#4, MIDC, Kharadi, Off Nagar Road, Pune, Maharashtra – 411014, India

Specification

F O R M 2
THE PATENTS ACT, 1970
(39 of 1970)
The patent Rule, 2003
COMPLETE SPECIFICATION
(See section 10 and rule 13)
TITLE OF THE INVENTION
SYSTEM AND METHOD FOR SYSTEMATICALLY POPULATING UNSTRUCTURED DATASET IN A DATA MODEL
APPLICANT:
Zensar Technologies Limited, A company Incorporated in India under the Companies Act, 1956
Having address:
Zensar knowledge park,
Plot # 4, MIDC, Kharadi, off Nagar road, Pune-411014,
Maharashtra, India
The following specification particularly describes the invention and the manner in which it is to be performed.

TECHNICAL FIELD
[0001] The present subject matter described herein, in general, discloses a system and method for performing data modelling in a data science lifecycle environment. In other words, the present application discloses techniques for intelligently analysing datasets and defining data model using said datasets.
BACKGROUND
[0002] Traditionally working on a data science project or building a data driven application involves multiple steps like data gathering, transformation, data analysis, data modelling and data visualization. Some of the steps like data analysis take significant amount of time and effort to manually analyse and process the data to make it ready to feed to the next step of the data science lifecycle. The current data engineering steps have been dependent on manual generation of data models from the raw data stored in different systems. Current systems store the raw data into different type of datastores. Data engineers have to go through the raw data to analyse, define and create a data model which defines the structure and relationships within data. Once the data model is defined, the incoming data stream may be stored as per the defined data model. Once the data gets stored as per the defined model, it is easy for the data scientists to apply different data science techniques and algorithms for creating and deploying different statistical models to derive insights, take decisions and predict outcomes.
[0003] Further, the critical decisions made from anomalous data are not only inconvenient but also extremely expensive. This could be avoided with an advanced data quality management platform. Also, the impact of bad data doesn’t stop at dollar and cents. It slows employees down and decrease productivity. Accumulation of bad data is both time-consuming and expensive. According to various reports that nearly one-third of analysts spends more than forty percent of their time vetting and validating their analytics data before it can be used for strategic decision-making.

[0004] The whole paradigm of Data Life Cycle revolves around ensuring that the data retains its true characteristics such as Quality, Precision, Accuracy, Completeness, Reliability, Relevance and Timeliness. Operational efficiency is typically derived based on Processes, Products and People. Streamlining of Data Operation is complex as Enterprises often neglects the foundational activities such as discovering of data, data profiling, data cleansing, data transformation etc.
[0005] Furthermore, there are multiple compliance guidelines enforced by each industry vertical. With each country crafting their own data privacy laws, the world of compliance has become a major challenge. It requires diligent focus to harbour a culture of data that doesn’t work in isolation. Today, each business is processing abundant personal information and consuming sensitive information in different processes and systems. Identifying, classifying, and documenting internal and external personal information is critical to data privacy compliance.
[0006] Thus, there exists a need of a system and method to automatically generate the data model which overcome above-mentioned objective problems.
SUMMARY
[0007] The present disclosure overcomes one or more shortcomings of the prior art and provides additional advantages discussed throughout the present disclosure. Additional features and advantages are realized through the techniques of the present disclosure. Other embodiments and aspects of the disclosure are described in detail herein and are considered a part of the claimed disclosure.
[0008] In one non-limiting embodiment of the present disclosure a systematically populating unstructured dataset in a data model is disclosed. The method comprises receiving real-time unstructured dataset from a data source and identifying one or more real-time entities by applying a Named Entity Recognition (NER) technique upon the real-time unstructured dataset. The method further comprises determining a real-time frequency of occurrence of each of the one or more real-time entities,

and comparing the real-time frequency of occurrence of each of the one or more real-time entities with pre-associated frequency of occurrence of one or more entities corresponding to a set of pre-generated data models, wherein the pre-associated frequency of occurrence is stored in form of metadata in association with the one or more entities. The method further comprises generating a set of Programming Structured Document scores (PSD scores) based on the comparison, wherein the set of PSD scores indicate likelihood of the set of pre-generated data models to be selected for the real-time unstructured dataset. The method further comprises selecting a pre-generated data model, among the set of pre-generated data models, based on a PSD-max score and a predefined threshold PSD score, wherein the PSD-max score is a highest PSD score among the set of PSD scores, and systematically populating the one or more real-time entity data against the one or more entities of the selected pre-generated data model.
[0009] In another non-limiting embodiment, the present disclosure recites that the set of pre-generated data models are generated by receiving a plurality of unstructured datasets from a plurality of data sources, wherein each unstructured dataset is associated with a domain, grouping the plurality of unstructured datasets into a set of clusters such that each cluster comprises unstructured datasets having a similar domain, identifying one or more entities, by applying the NER technique on the unstructured datasets, for each cluster, wherein the one or more entities indicate information type provided within the cluster, and generating the set of pre-generated data models corresponding to the set of clusters based on the one or more entities identified for each cluster.
[0010] In another non-limiting embodiment, the present disclosure recites that the pre-associated frequency of occurrence of one or more entities is determined by: for each of the set of clusters, determining a frequency of occurrence of the one or more entities present within their corresponding cluster, associating the frequency of occurrence, determined for the one or more entities present within each of the set of clusters, with the corresponding set of pre-generated data models, and storing the

frequency of occurrence in form of the metadata in association with the one or more entities corresponding to each of the set of pre-generated data models.
[0011] In yet another non-limiting embodiment, the present disclosure recites that the method further comprises generating a new data model corresponding to the real-time unstructured dataset when the set of PSD scores are less than the predefined threshold PSD score.
[0012] In yet another non-limiting embodiment, the present disclosure recites that the grouping the plurality of unstructured datasets into a set of clusters is based on K-means clustering technique.
[0013] In yet another non-limiting embodiment of the present disclosure, the present application discloses a data modelling system for systematically populating unstructured dataset in a data model. The system comprises a receiving unit to receive real-time unstructured dataset from a data source. The system further comprises an identifying unit to identify one or more real-time entities by applying a Named Entity Recognition (NER) technique upon the real-time unstructured dataset. The system further comprises a determining unit to determine a real-time frequency of occurrence of each of the one or more real-time entities. The system further comprises a comparing unit to compare the real-time frequency of occurrence of each of the one or more real-time entities with pre-associated frequency of occurrence of one or more entities corresponding to a set of pre-generated data models, wherein the pre-associated frequency of occurrence are stored in form of metadata in association with the one or more entities. The system further comprises a score generating unit to generate a set of Programming Structured Document scores (PSD scores) based on the comparison, wherein the set of PSD scores indicate likelihood of the set of pre-generated data models to be selected, for being populated, for the real-time unstructured dataset and a selecting unit to select a pre-generated data model, among the set of pre-generated data models, based on a PSD-max score and a predefined threshold PSD score, wherein the PSD-max score is a highest PSD score among the set of PSD scores. The system

also comprises a populating unit to systematically populate the one or more real¬time entity data against the one or more entities of the selected pre-generated data model.
[0014] In yet another non-limiting embodiment of the present disclosure, to generate the set of pre-generated data models, the receiving unit to receive a plurality of unstructured datasets from a plurality of data sources, wherein each unstructured dataset is associated with a domain. The data modelling system comprises a grouping unit to group the plurality of unstructured datasets into a set of clusters, such that each cluster comprises unstructured datasets having a similar domain. The identifying unit is configured to identify one or more entities, by applying the NER technique on the unstructured datasets, for each cluster, wherein the one or more entities indicate information type provided within the cluster. The system also comprises a model generating unit to generate the set of pre-generated data models corresponding to the set of clusters based on the one or more entities identified for each cluster.
[0015] In yet another non-limiting embodiment of the present disclosure, to determine the pre-associated frequency of occurrence of one or more entities, the determining unit is configured to determine, for each of the set of clusters, a frequency of occurrence of the one or more entities present within their corresponding cluster. The system comprises an associating unit to associate the frequency of occurrence, determined for the one or more entities present within each of the set of clusters, with the corresponding set of pre-generated data models, and a storage unit to store the frequency of occurrence in form of the metadata in association with the one or more entities corresponding to each of the set of pre-generated data models.
[0016] In yet another non-limiting embodiment of the present disclosure, the present application discloses that the model generating unit is further configured to

generate a new data model corresponding to the real-time unstructured dataset when the set of PSD scores are less than the predefined threshold PSD score.
[0017] In yet another non-limiting embodiment of the present disclosure, the present application discloses that the grouping unit groups the plurality of unstructured datasets into a set of clusters is based on K-means clustering technique.
[0018] The foregoing summary is illustrative only and is not intended to be in any way limiting. In addition to the illustrative aspects, embodiments, and features described above, further aspects, embodiments, and features will become apparent by reference to the drawings and the following detailed description.
BRIEF DESCRIPTION OF THE DRAWINGS
[0019] The accompanying drawings, which are incorporated in and constitute a part of this disclosure, illustrate exemplary embodiments and, together with the description, serve to explain the disclosed embodiments. In the figures, the left-most digit(s) of a reference number identifies the figure in which the reference number first appears. The same numbers are used throughout the figures to reference like features and components. Some embodiments of system and/or methods in accordance with embodiments of the present subject matter are now described, by way of example only, and with reference to the accompanying figures, in which:
[0020] Fig. 1 illustrates a network implementation of a system for performing data modelling, in accordance with an embodiment of the present subject matter.
[0021] Fig. 2 illustrates a block diagram of a system for performing data modelling, in accordance with an embodiment of the present subject matter.
[0022] Fig. 3 illustrates an example of entity extraction from a respective cluster, in accordance with an embodiment of the present subject matter.

[0023] Fig. 4a-4c show an exemplary data model generated by the system, in accordance with an embodiment of the present subject matter.
[0024] Fig. 5 illustrates a flow chart of a method performing data modelling in a data science lifecycle environment, in accordance with an embodiment of the present subject matter.
[0025] Fig. 6 illustrates a process of generating the data model, in accordance with an embodiment of the present subject matter.
[0026] Fig. 7 illustrates a process of storing/writing data into the generated data model, in accordance with an embodiment of the present subject matter.
[0027] It should be appreciated by those skilled in the art that any block diagrams herein represent conceptual views of illustrative systems embodying the principles of the present subject matter. Similarly, it will be appreciated that any flow charts, flow diagrams, state transition diagrams, pseudo code, and the like represent various processes which may be substantially represented in computer readable medium and executed by a computer or processor, whether or not such computer or processor is explicitly shown.
DETAILED DESCRIPTION
[0028] In the present document, the word “exemplary” is used herein to mean
“serving as an example, instance, or illustration.” Any embodiment or
implementation of the present subject-matter described herein as “exemplary” is
not necessarily to be construed as preferred or advantageous over other
embodiments.
[0029] While the disclosure is susceptible to various modifications and alternative forms, specific embodiment thereof has been shown by way of example in the drawings and will be described in detail below. It should be understood, however that it is not intended to limit the disclosure to the particular forms disclosed, but

on the contrary, the disclosure is to cover all modifications, equivalents, and alternatives falling within the scope of the disclosure.
[0030] The terms “comprises”, “comprising”, “include(s)”, or any other variations thereof, are intended to cover a non-exclusive inclusion, such that a setup, system or method that comprises a list of components or steps does not include only those components or steps but may include other components or steps not expressly listed or inherent to such setup or system or method. In other words, one or more elements in a system or apparatus proceeded by “comprises… a” does not, without more constraints, preclude the existence of other elements or additional elements in the system or apparatus.
[0031] In the following detailed description of the embodiments of the disclosure, reference is made to the accompanying drawings that form a part hereof, and which are shown by way of illustration specific embodiments in which the disclosure may be practiced. These embodiments are described in sufficient detail to enable those skilled in the art to practice the disclosure, and it is to be understood that other embodiments may be utilized and that changes may be made without departing from the scope of the present disclosure. The following description is, therefore, not to be taken in a limiting sense.
[0032] The present invention will be described herein below with reference to the accompanying drawings. In the following description, well known functions or constructions are not described in detail since they would obscure the description with unnecessary detail.
[0033] The present disclosure discloses a system for data modelling to simplify and automate the data engineering steps. The system may be connected to different sources of data to collect raw data in any form and volume. The system then processes the raw data to transform, analyse the raw data and create data relationship model dynamically to store in automatically defined schema thereby helping data scientists and analysts to apply data analytics techniques and learnings

with ease on the already defined data model. This present disclosure will fasten the development of data driven applications and help business take decisions quickly based on the derived data insights.
[0034] Referring to figure 1, a network implementation 100 of a system 102 for systematically populating unstructured dataset in a data model is illustrated. The system 102 may be implemented in a variety of computing systems, such as a laptop computer, a desktop computer, a notebook, a workstation, a mainframe computer, a server, a network server, a cloud-based computing environment. The system 102 may be in communication with a plurality of data sources 104a, 104b…104n to receive a plurality of unstructured datasets. Each unstructured dataset will be associated with a domain. It may be possible that one or more datasets of the plurality of datasets belong to same domain. According to an embodiment, the data source can be a mailbox, Log etc, but not limited thereto. The system 102 captures the unstructured datasets, store it, provide structure to the datasets by extracting key information and create data design in a NOSQL datastore. NOSQL system designs are scalable, faster and meets the huge demand of exponentially growing data. Thus, the structured datasets will be saved in a NOSQL data store.
[0035] Further referring now to Figure 2, the system 102 is illustrated in accordance with an embodiment of the present disclosure. In one embodiment, the system 102 may include a receiving unit 202, an identifying unit 204, a determining unit 206, a comparing unit 208, a score generating unit 210, a selecting unit 212, a populating unit 214, a grouping unit 216, a model generating unit 218, an associating unit 220, and a storage unit 222. Various units 202-220 may comprise one or more processing devices implemented as one or more microprocessors, microcomputers, microcontrollers, digital signal processors, central processing units, state machines, logic circuitries, and/or any devices that manipulate signals based on operational instructions. Among other capabilities, the processing devices may be configured to fetch and execute computer-readable instructions stored in the storage unit 222.

[0036] The receiving unit 202 may include one or more I/Os to receive real-time unstructured dataset from a data source. The data source may be mailbox, log, medical data sources etc., but not limited thereto. The receiving unit 202 may store the received real-time unstructured data into the storage unit 222. The received real¬time unstructured data is further processed by the identifying unit 204 to identify one or more real-time entities by applying a Named Entity Recognition (NER) technique upon the real-time unstructured dataset. According to an embodiment, the entities may be any one of name, location, date, etc.
[0037] Further, the determining unit 206 may analyse the received data to determine a real-time frequency of occurrence of each of the one or more real-time entities. After determining the frequency of occurrence, the comparing unit 208 may compare the real-time frequency of occurrence of each of the one or more real-time entities with pre-associated frequency of occurrence of one or more entities corresponding to a set of pre-generated data models. The pre-associated frequency of occurrence is stored in form of metadata in association with the one or more entities.
[0038] In order to generate the set of data models, the receiving unit 202 may receive a plurality of unstructured datasets from a plurality of data sources. The received plurality of unstructured datasets may be stored in the storage unit 222.
[0039] Further, the grouping unit 216 may group the plurality of unstructured datasets into a set of clusters such that each cluster comprises unstructured datasets having a similar domain. According to an embodiment, the domain may be any one of medical, finance, logistics, etc. The grouping unit 216 reads the copy of unstructured data from the storage unit 222 and apply clustering technique (ML techniques) to simplify datasets. The grouping unit 216 may assign cluster ID’s to the formed clusters. The clusters information may be stored into NOSQL datastore in the storage unit 22 for further analysis. The metadata for the clusters is also stored in NOSQL datastore. Below is an example of information stored in the NOSQL datastore.

Description cluster No.
Unreachable Interface Standard 92ms 1
Available memory of VINC1APPPR0DAN5IBLE is warning because its vale h 71<75 % 2
Issue irrith Printer configuration in the system 3
No Response from VI NCI DEMOCRM 4
VINCIAPPDEMOZEUOSSCollettorhas same issue 5
MySQL server Down 6
[0040] Further, the identifying unit 204 may identify one or more entities, by applying the NER technique on the unstructured datasets, for each cluster. The one or more entities indicate information type provided within the cluster. The clusters information will be passed through the NER channel to identify and segment the named entities and classify them under different categories. In an exemplary embodiment, the NLP based NER is be applied to extract information from the data. One such NLP based NER technique is GLOVE 6 which when applied facilitates information extraction from the data. Fig. 3 of the present disclosure illustrate an example of information extraction from a respective cluster.
[0041] Further, the model generating unit 218 generates a set of data models corresponding to the set of clusters based on the one or more entities identified for each cluster. Each data model generated is capable of being systematically populated in real-time, with one or more real-time entity data of the real-time unstructured dataset against their corresponding one or more entities provided in the data model. Fig. 4 shows an exemplary data model generated by the generation unit 208.
[0042] Further, the determining unit 206 determines a frequency of occurrence of the one or more entities present within each of the set of clusters. The determining unit 206 reads each record of a cluster to determine the frequency of entities. Furthermore, the associating unit 220 associates the frequency of occurrence, determined for the one or more entities present within each of the set of clusters,

with the corresponding set of pre-generated data models. The storage unit 222 stores the frequency of occurrence in form of the metadata in association with the one or more entities corresponding to each of the set of pre-generated data models. Fig. 4b shows an exemplary data model generated by the model generating unit 218.
[0043] Referring to fig. 2 again, the score generating 210 unit generates a set of Programming Structured Document (PSD) scores based on comparison between the real-time frequency of occurrence of each of the one or more real-time entities with pre-associated frequency of occurrence of one or more entities corresponding to a set of pre-generated data models. The set of PSD scores indicates likelihood of the set of pre-generated data models to be selected for the real-time unstructured dataset.
[0044] Further, the selecting unit 212 selects a pre-generated data model, among the set of pre-generated data models based on a PSD-max score and a predefined threshold PSD score. It is to be noted that the PSD-max score indicates a highest PSD score among the set of PSD scores. Thereafter, the populating unit 214 systematically populates the one or more real-time entity data against the one or more entities of the selected data model. If the determined scores are below the predefined threshold score, the real-time unstructured dataset is not be stored in any of the data model. Fig. 4c shows real-time data stored in the exemplary data model generated by the generation unit 208.
[0045] According to an embodiment of the present disclosure, the model generating unit 218 generates a new data model corresponding to the real-time unstructured dataset, when the set of PSD scores are below than the predefined threshold score. The process of storing the real-time unstructured dataset may easily be understood by way of following example.
[0046] Consider that there are two data model generated by the system, i.e., DM1 and DM2, wherein DM1 is generated based on entities NER1, NER3, and NER4. The frequencies of occurrence of entities associated with DM1 are 3, 4, and 6

respectively, i.e., (NER1)3, (NER3)4, and (NER4)6. DM2, on the other hand, is generated based on entities NER1, NER3, and NER4. The frequencies of occurrence of entities associated with DM2 are 2, 4, and 3 respectively, i.e., (NER1)2, (NER3)4, and (NER4)3.
[0047] Now, upon receiving the real-time unstructured dataset, the real-time unstructured dataset is processed to determine one or more entities and frequency of occurrence of each of the determined entities. Assuming that the real-time unstructured dataset includes entities NER1, NER3, and NER4 with frequencies 2, 2, 3 respectively, a set of PSD scores is calculated for both data models DM1 and DM2 by dividing the real-time dataset frequency with the frequency of occurrence of the entity associated with the data model.
[0048] Score for D1 would be: Score_DM1 = (2/3+2/4+3/6)%
Score_DM1 = 67%+50%+50%
[0049] Score for D1 would be: Score_DM2 = (2/2+2/4+3/3)%
Score_DM2 = 100%+50%+100%
[0050] Since the Score_DM2 is greater than the Score_DM1, the real-time dataset would be stored in data model DM2.
[0051] In this manner, the present disclosure provides a system which automates the entire life cycle of understanding the unstructured data from any source like events, logs, chats, texts and creating the data design in a NOSQL system to store and analyse the data. With the growing size of data, it becomes evident to use the NOSQL structure to store and also define the relationship among data using NLP. The system generates the data model, which is more efficient, cost effective, more precise, and more accurate, while reducing the time for generation of the model.
[0052] Fig. 5 discloses a method 500 for systematically populating
unstructured dataset in a data model. At step 502, the method 500 comprises

receiving real-time unstructured dataset from a data source. According to an embodiment, the datasets may be mailbox data, log data, medical report data, financial data etc., but not limited thereto.
[0053] At step 504, the method 500 recites identifying one or more real-time
entities by applying a Named Entity Recognition (NER) technique upon the real¬time unstructured dataset. Further, at step 506, the method 500 describes determining a real-time frequency of occurrence of each of the one or more real¬time entities.
[0054] Moving ahead, at step 508 the method 500 discloses comparing the real-time frequency of occurrence of each of the one or more real-time entities with pre-associated frequency of occurrence of one or more entities corresponding to a set of pre-generated data models. The pre-associated frequency of occurrence is stored in form of metadata in association with the one or more entities.
[0055] According to an embodiment, the method further recites that the set of pre-generated data models are generated by receiving a plurality of unstructured datasets from a plurality of data sources. Each unstructured dataset is associated with a domain. According to an embodiment, the domain may be any one of medical, finance, logistics, etc. The method further comprises grouping the plurality of unstructured datasets into a set of clusters such that each cluster comprises unstructured datasets having a similar domain. Further, one or more entities are identified, by applying the NER technique on the unstructured datasets, for each cluster. The one or more entities indicate information type provided within the cluster. Further, the set of pre-generated data models corresponding to the set of clusters are generated based on the one or more entities identified for each cluster.
[0056] [0054] According to an embodiment, the method recites that the pre-associated frequency of occurrence of one or more entities is determined by determining a frequency of occurrence of the one or more entities present within their corresponding cluster, for each of the set of clusters. Further, the frequency of

occurrence, determined for the one or more entities present within each of the set of clusters, are associated with the corresponding set of pre-generated data models and the frequency of occurrence is stored in form of the metadata in association with the one or more entities corresponding to each of the set of pre-generated data models.
[0057] At step 510, the method 500 describes generating a set of Programming Structured Document (PSD) scores based on comparison between the real-time frequency of occurrence of each of the one or more real-time entities with pre-associated frequency of occurrence of one or more entities corresponding to a set of pre-generated data models. The set of PSD scores indicates likelihood of the set of pre-generated data models to be selected for the real-time unstructured dataset.
[0058] At step 512, the method describes selecting a pre-generated data model, among the set of pre-generated data models, based on a PSD-max score and a predefined threshold PSD score. It is to be noted that the PSD-max score indicates a highest PSD score among the set of PSD scores.
[0059] At step 514, the method describes systematically populating the one or more real-time entity data against the one or more entities of the selected pre-generated data model. Further, the method 500 recites that a new data model is generated corresponding to the real-time unstructured dataset when the set of PSD scores are less than the predefined threshold score.
[0060] In this manner, the entire life cycle of understanding the unstructured data from any source like events, logs, chats, texts and creating the data design in a NOSQL system to store and analyse the data is automated. The data model is more efficient, cost effective, more precise, and more accurate, while time taken for generation of the model is also reduced.
[0061] The system and method disclosed in the present disclosure can be easily understood with help of fig. 6 and 7 of the present disclosure. Fig. 6 illustrates a

process 600 of generating the data model, in accordance with an embodiment of the present subject matter. According to Fig. 6, the unstructured datasets from mailbox, log and other source are received at S602 and are stored in Raw store. Further, at S604, the unstructured datasets are grouped into a set of clusters such that each cluster comprises unstructured datasets having a similar domain. The clustered are stored in a cluster store. Each cluster is assigned an identification number and the same is stored in cluster store, at S606.
[0062] Further, at S608, the identifying one or more entities are identified, by applying a Named Entity Recognition (NER) technique on the unstructured datasets, for each cluster, wherein the one or more entities indicate information type provided within the cluster. S608 refer to PSD layer which provides structure to the data by extracting entity information and create the data design in a NOSQL datastore.
[0063] Further, at S610, a set of data models corresponding to the set of clusters are generated based on the one or more entities identified for each cluster. Each data model generated is capable of being systematically populated with one or more real¬time entity data of real-time unstructured dataset against their corresponding one or more entities provided in the data model.
[0064] Further, Fig. 7 illustrates a process 700 of storing/writing data into the generated data model, in accordance with an embodiment of the present subject matter. According to Fig. 7, real-time data is received at S702. Said real-time data is processed to determine entity information and a set of PSD scores is generated based in the entity information as described in previous embodiments. Based on the set of PSD scores, the real-time data is stored in the data model at S704. Furthermore, the stored real-time data may be accessed by the end user via an interface at S706.
[0065] The illustrated steps are set out to explain the exemplary
embodiments shown, and it should be anticipated that ongoing technological

development will change the manner in which particular functions are performed. These examples are presented herein for purposes of illustration, and not limitation. Further, the boundaries of the functional building blocks have been arbitrarily defined herein for the convenience of the description. Alternative boundaries can be defined so long as the specified functions and relationships thereof are appropriately performed.
[0066] Alternatives (including equivalents, extensions, variations,
deviations, etc., of those described herein) will be apparent to persons skilled in the relevant art(s) based on the teachings contained herein. Such alternatives fall within the scope of the disclosed embodiments.
[0067] Furthermore, one or more computer-readable storage media may be
utilized in implementing embodiments consistent with the present disclosure. A computer-readable storage medium refers to any type of physical memory on which information or data readable by a processor may be stored. Thus, a computer-readable storage medium may store instructions for execution by one or more processors, including instructions for causing the processor(s) to perform steps or stages consistent with the embodiments described herein. The term “computer-readable medium” should be understood to include tangible items and exclude carrier waves and transient signals, i.e., are non-transitory. Examples include random access memory (RAM), read-only memory (ROM), volatile memory, non¬volatile memory, hard drives, CD ROMs, DVDs, flash drives, disks, and any other known physical storage media.
[0068] Suitable processors/controllers include, by way of example, a
general-purpose processor, a special purpose processor, a conventional processor, a digital signal processor (DSP), a plurality of microprocessors, one or more microprocessors in association with a DSP core, a controller, a microcontroller, Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) circuits, any other type of integrated circuit (IC), and/or a state machine.

WE CLAIM:
1. A method (500) for systematically populating unstructured dataset in a data
model, the method comprising:
receiving (502) real-time unstructured dataset from a data source;
identifying (504) one or more real-time entities by applying a Named Entity Recognition (NER) technique upon the real-time unstructured dataset;
determining (506) a real-time frequency of occurrence of each of the one or more real-time entities;
comparing (508) the real-time frequency of occurrence of each of the one or more real-time entities with pre-associated frequency of occurrence of one or more entities corresponding to a set of pre-generated data models, wherein the pre-associated frequency of occurrence are stored in form of metadata in association with the one or more entities;
generating (510) a set of Programming Structured Document scores (PSD scores) based on the comparison, wherein the set of PSD scores indicate likelihood of the set of pre-generated data models to be selected, for being populated, for the real-time unstructured dataset;
selecting (512) a pre-generated data model, among the set of pre-generated data models, based on a PSD-max score and a predefined threshold PSD score, wherein the PSD-max score is a highest PSD score among the set of PSD scores; and
systematically (514) populating the one or more real-time entity data against the one or more entities of the selected pre-generated data model.
2. The method (500) as claimed in claim 1, wherein the set of pre-generated
data models are generated by:
receiving a plurality of unstructured datasets from a plurality of data sources, wherein each unstructured dataset is associated with a domain;
grouping the plurality of unstructured datasets into a set of clusters such that each cluster comprises unstructured datasets having a similar domain;

identifying one or more entities, by applying the NER technique on the unstructured datasets, for each cluster, wherein the one or more entities indicate information type provided within the cluster; and
generating the set of pre-generated data models corresponding to the set of clusters based on the one or more entities identified for each cluster.
3. The method (500) as claimed in claim 2, wherein the pre-associated
frequency of occurrence of one or more entities is determined by:
for each of the set of clusters, determining a frequency of occurrence of the one or more entities present within their corresponding cluster;
associating the frequency of occurrence, determined for the one or more entities present within each of the set of clusters, with the corresponding set of pre-generated data models; and
storing the frequency of occurrence in form of the metadata in association with the one or more entities corresponding to each of the set of pre-generated data models.
4. The method (500) as claimed in claim 1, further comprising generating a new data model corresponding to the real-time unstructured dataset when the set of PSD scores are less than the predefined threshold PSD score.
5. The method (500) as claimed in claim 1, wherein the grouping the plurality of unstructured datasets into a set of clusters is based on K-means clustering technique.
6. A data modelling system (102) for systematically populating unstructured dataset in a data model, the data modelling system (102) comprises:
a receiving unit (202) to receive real-time unstructured dataset from a data source;

an identifying unit (204) to identify one or more real-time entities by applying a Named Entity Recognition (NER) technique upon the real-time unstructured dataset;
a determining unit (206) to determine a real-time frequency of occurrence of each of the one or more real-time entities;
a comparing unit (208) to compare the real-time frequency of occurrence of each of the one or more real-time entities with pre-associated frequency of occurrence of one or more entities corresponding to a set of pre-generated data models, wherein the pre-associated frequency of occurrence are stored in form of metadata in association with the one or more entities;
a score generating unit (210) to generate a set of Programming Structured Document scores (PSD scores) based on the comparison, wherein the set of PSD scores indicate likelihood of the set of pre-generated data models to be selected, for being populated, for the real-time unstructured dataset;
a selecting unit (212) to select a pre-generated data model, among the set of pre-generated data models, based on a PSD-max score and a predefined threshold PSD score, wherein the PSD-max score is a highest PSD score among the set of PSD scores; and
a populating unit (214) to systematically populate the one or more real-time entity data against the one or more entities of the selected pre-generated data model.
7. The data modelling system (102) as claimed in claim 6, wherein to generate
the set of pre-generated data models, the data modelling system comprises:
the receiving unit (202) to receive a plurality of unstructured datasets from a plurality of data sources, wherein each unstructured dataset is associated with a domain;
a grouping unit (216) to group the plurality of unstructured datasets into a set of clusters, such that each cluster comprises unstructured datasets having a similar domain;

the identifying unit (204) to identify one or more entities, by applying the NER technique on the unstructured datasets, for each cluster, wherein the one or more entities indicate information type provided within the cluster; and
a model generating unit (218) to generate the set of pre-generated data models corresponding to the set of clusters based on the one or more entities identified for each cluster.
8. The data modelling system (102) as claimed in claim 7, wherein to
determine the pre-associated frequency of occurrence of one or more entities, the
data modelling system comprises:
the determining unit (206) to determine, for each of the set of clusters, a frequency of occurrence of the one or more entities present within their corresponding cluster;
an associating unit (220) to associate the frequency of occurrence, determined for the one or more entities present within each of the set of clusters, with the corresponding set of pre-generated data models; and
a storage unit (222) to store the frequency of occurrence in form of the metadata in association with the one or more entities corresponding to each of the set of pre-generated data models.
9. The data modelling system (102) as claimed in claim 7, wherein the model generating unit (218) is further configured to generate a new data model corresponding to the real-time unstructured dataset when the set of PSD scores are less than the predefined threshold PSD score.
10. The data modelling system (102) as claimed in claim 6, wherein the grouping unit (216) groups the plurality of unstructured datasets into a set of clusters is based on K-means clustering technique.

Documents

Application Documents

#	Name	Date
1	202221007074-STATEMENT OF UNDERTAKING (FORM 3) [10-02-2022(online)].pdf	2022-02-10
2	202221007074-REQUEST FOR EXAMINATION (FORM-18) [10-02-2022(online)].pdf	2022-02-10
3	202221007074-POWER OF AUTHORITY [10-02-2022(online)].pdf	2022-02-10
4	202221007074-FORM 18 [10-02-2022(online)].pdf	2022-02-10
5	202221007074-FORM 1 [10-02-2022(online)].pdf	2022-02-10
6	202221007074-DRAWINGS [10-02-2022(online)].pdf	2022-02-10
7	202221007074-DECLARATION OF INVENTORSHIP (FORM 5) [10-02-2022(online)].pdf	2022-02-10
8	202221007074-COMPLETE SPECIFICATION [10-02-2022(online)].pdf	2022-02-10
9	202221007074-Proof of Right [11-02-2022(online)].pdf	2022-02-11
10	Abstract1.jpg	2022-06-11
11	202221007074-FER.pdf	2025-03-03

Search Strategy

1	SearchHistoryE_06-03-2024.pdf