System And Method Of Managing Data Cleaning

< Back

System And Method Of Managing Data Cleaning

Abstract: ABSTRACT SYSTEM AND METHOD OF MANAGING DATA CLEANING The present invention relates to a system (108) and a method (600) of managing data cleaning. The method (600) includes step of receiving the data from one or more data sources (110). Further, training a model (220) utilizing the data to identify correlations and relationships within the data to enable data cleaning. The method (600) further includes step of performing the data cleaning based on the identified correlations and relationships.

Get Free WhatsApp Updates!
Notices, Deadlines & Correspondence

Patent Information

Application #

Filing Date

07 October 2023

Publication Number

15/2025

Publication Type

INA

Invention Field

COMPUTER SCIENCE

Status

Email

Parent Application

Applicants

JIO PLATFORMS LIMITED

OFFICE-101, SAFFRON, NR. CENTRE POINT, PANCHWATI 5 RASTA, AMBAWADI, AHMEDABAD 380006, GUJARAT, INDIA

Inventors

1. Aayush Bhatnagar

Reliance Corporate Park, Thane - Belapur Road, Ghansoli, Navi Mumbai, Maharashtra 400701, India

2. Navi Mumbai,

Reliance Corporate Park, Thane - Belapur Road, Ghansoli, Navi Mumbai, Maharashtra 400701, India

3. Jugal Kishore

Reliance Corporate Park, Thane - Belapur Road, Ghansoli, Navi Mumbai, Maharashtra 400701, India

4. Chandra Ganveer

Reliance Corporate Park, Thane - Belapur Road, Ghansoli, Navi Mumbai, Maharashtra 400701, India

5. Sanjana Chaudhary

Reliance Corporate Park, Thane - Belapur Road, Ghansoli, Navi Mumbai, Maharashtra 400701, India

6. Gourav Gurbani

Reliance Corporate Park, Thane - Belapur Road, Ghansoli, Navi Mumbai, Maharashtra 400701, India

7. Yogesh Kumar

Reliance Corporate Park, Thane - Belapur Road, Ghansoli, Navi Mumbai, Maharashtra 400701, India

8. Avinash Kushwaha

Reliance Corporate Park, Thane - Belapur Road, Ghansoli, Navi Mumbai, Maharashtra 400701, India

9. Dharmendra Kumar Vishwakarma

Reliance Corporate Park, Thane - Belapur Road, Ghansoli, Navi Mumbai, Maharashtra 400701, India

10. Sajal Soni

Reliance Corporate Park, Thane - Belapur Road, Ghansoli, Navi Mumbai, Maharashtra 400701, India

11. Niharika Patnam

Reliance Corporate Park, Thane - Belapur Road, Ghansoli, Navi Mumbai, Maharashtra 400701, India

12. Shubham Ingle

Reliance Corporate Park, Thane - Belapur Road, Ghansoli, Navi Mumbai, Maharashtra 400701, India

13. Harsh Poddar

Reliance Corporate Park, Thane - Belapur Road, Ghansoli, Navi Mumbai, Maharashtra 400701, India

14. Sanket Kumthekar

Reliance Corporate Park, Thane - Belapur Road, Ghansoli, Navi Mumbai, Maharashtra 400701, India

15. Mohit Bhanwria

Reliance Corporate Park, Thane - Belapur Road, Ghansoli, Navi Mumbai, Maharashtra 400701, India

16. Shashank Bhushan

Reliance Corporate Park, Thane - Belapur Road, Ghansoli, Navi Mumbai, Maharashtra 400701, India

17. Vinay Gayki

Reliance Corporate Park, Thane - Belapur Road, Ghansoli, Navi Mumbai, Maharashtra 400701, India

18. Aniket Khade

Reliance Corporate Park, Thane - Belapur Road, Ghansoli, Navi Mumbai, Maharashtra 400701, India

19. Durgesh Kumar

Reliance Corporate Park, Thane - Belapur Road, Ghansoli, Navi Mumbai, Maharashtra 400701, India

20. Zenith Kumar

Reliance Corporate Park, Thane - Belapur Road, Ghansoli, Navi Mumbai, Maharashtra 400701, India

21. Gaurav Kumar

Reliance Corporate Park, Thane - Belapur Road, Ghansoli, Navi Mumbai, Maharashtra 400701, India

22. Manasvi Rajani

Reliance Corporate Park, Thane - Belapur Road, Ghansoli, Navi Mumbai, Maharashtra 400701, India

23. Kishan Sahu

Reliance Corporate Park, Thane - Belapur Road, Ghansoli, Navi Mumbai, Maharashtra 400701, India

24. Sunil Meena

Reliance Corporate Park, Thane - Belapur Road, Ghansoli, Navi Mumbai, Maharashtra 400701, India

25. Supriya Kaushik De

Reliance Corporate Park, Thane - Belapur Road, Ghansoli, Navi Mumbai, Maharashtra 400701, India

26. Kumar Debashish

Reliance Corporate Park, Thane - Belapur Road, Ghansoli, Navi Mumbai, Maharashtra 400701, India

27. Mehul Tilala

Reliance Corporate Park, Thane - Belapur Road, Ghansoli, Navi Mumbai, Maharashtra 400701, India

28. Satish Narayan

Reliance Corporate Park, Thane - Belapur Road, Ghansoli, Navi Mumbai, Maharashtra 400701, India

29. Rahul Kumar

Reliance Corporate Park, Thane - Belapur Road, Ghansoli, Navi Mumbai, Maharashtra 400701, India

30. Harshita Garg

Reliance Corporate Park, Thane - Belapur Road, Ghansoli, Navi Mumbai, Maharashtra 400701, India

31. Kunal Telgote

Reliance Corporate Park, Thane - Belapur Road, Ghansoli, Navi Mumbai, Maharashtra 400701, India

32. Ralph Lobo

Reliance Corporate Park, Thane - Belapur Road, Ghansoli, Navi Mumbai, Maharashtra 400701, India

33. Girish Dange

Reliance Corporate Park, Thane - Belapur Road, Ghansoli, Navi Mumbai, Maharashtra 400701, India

Specification

DESC:
FORM 2
THE PATENTS ACT, 1970
(39 of 1970)
&
THE PATENTS RULES, 2003

COMPLETE SPECIFICATION
(See section 10 and rule 13)
1. TITLE OF THE INVENTION
SYSTEM AND METHOD OF MANAGING DATA CLEANING
2. APPLICANT(S)
NAME NATIONALITY ADDRESS
JIO PLATFORMS LIMITED INDIAN OFFICE-101, SAFFRON, NR. CENTRE POINT, PANCHWATI 5 RASTA, AMBAWADI, AHMEDABAD 380006, GUJARAT, INDIA
3.PREAMBLE TO THE DESCRIPTION

THE FOLLOWING SPECIFICATION PARTICULARLY DESCRIBES THE NATURE OF THIS INVENTION AND THE MANNER IN WHICH IT IS TO BE PERFORMED.

FIELD OF THE INVENTION
[0001] The present invention relates to the field of wireless communication systems, more particularly relates to a method and a system for managing data cleaning.
BACKGROUND OF THE INVENTION
[0002] Generally, telecommunication networks include processing systems for executing a diverse range of algorithms and predictive tasks, including anomaly detection. These processing systems may be powered by Large Language Models (LLMs) and their functions may include conducting thorough analysis of network and operational data using Machine Learning (ML) techniques to extract deep insights into the network data.
[0003] Input network data utilized for training the ML models is expected to be well-defined and cleansed. As such, data cleaning is a crucial step that includes identification and rectification or removal of inaccurate records, inconsistencies, errors, and other noise in the dataset. The quality and reliability of the dataset significantly impact the performance and dependability of ML models. However, in traditional systems, the process of data cleaning is typically carried out manually, which is time-consuming and demands substantial human resources. This is because there are multiple parts to be checked during cleaning the data. Further, there is a need to identify inconsistent, inaccurate, duplicate, or erroneous records or presence of any noises in the data set.
[0004] There is, therefore, a need for effective solutions for integrating data sources and cleaning data for Machine Learning (ML) models.
SUMMARY OF THE INVENTION
[0005] One or more embodiments of the present disclosure provides a method and a system for managing data cleaning.
[0006] In one aspect of the present invention, the method for managing the data cleaning is disclosed. The method includes the step of receiving, by one or more processors, the data from one or more data sources. The method further includes the step of training, by the one or more processors, a model utilizing the data to identify correlations and relationships within the data to enable data cleaning, wherein the correlations and the relationships pertains to at least one of, duplicate data, blank data, inaccurate records, inconsistent data, and erroneous data. The method further includes the step of performing, by the one or more processors, the data cleaning based on the identified correlations and relationships.
[0007] In another embodiment, the one or more data sources includes one or more network functions, and the data is received from the one or more data sources on one of a receipt of a request and a continuous basis
[0008] In yet another embodiment, the method comprises the steps of retraining, by the one or more processors, the model based on updated data received from the one or more data sources. Further, identifying, by the one or more processors, updated correlations and relationships within the updated data. Thereafter, refining, by the one or more processors, the data cleaning process based on the updated correlations and relationships within the updated data.
[0009] In yet another embodiment, the updated data corresponds to latest data received from the one or more data sources.
[0010] In another aspect of the present invention, the system for managing the data cleaning is disclosed. The system includes a receiving unit configured to receive, the data from one or more data sources. The system further includes a training unit configured to train, a model utilizing the data to identify correlations and relationships within the data to enable data cleaning, the correlations and the relationships pertains to at least one of, duplicate data, blank data, inaccurate records, inconsistent data, and erroneous data. The system further includes a data cleaning unit configured to perform, the data cleaning based on the identified correlations and relationships.
[0011] In yet another aspect of the present invention, a non-transitory computer-readable medium having stored thereon computer-readable instructions that, when executed by a processor. The processor is configured to receive data from one or more data sources. The processor is further configured to train, a model utilizing the data to identify correlations and relationships within the data to enable data cleaning, wherein the correlations and the relationships pertains to at least one of, duplicate data, blank data, inaccurate records, inconsistent data, and erroneous data. The processor is further configured to perform data cleaning based on the identified correlations and relationships.
[0012] Other features and aspects of this invention will be apparent from the following description and the accompanying drawings. The features and advantages described in this summary and in the following detailed description are not all-inclusive, and particularly, many additional features and advantages will be apparent to one of ordinary skill in the relevant art, in view of the drawings, specification, and claims hereof. Moreover, it should be noted that the language used in the specification has been principally selected for readability and instructional purposes and may not have been selected to delineate or circumscribe the inventive subject matter, resort to the claims being necessary to determine such inventive subject matter.
BRIEF DESCRIPTION OF THE DRAWINGS
[0013] The accompanying drawings, which are incorporated herein, and constitute a part of this disclosure, illustrate exemplary embodiments of the disclosed methods and systems in which like reference numerals refer to the same parts throughout the different drawings. Components in the drawings are not necessarily to scale, emphasis instead being placed upon clearly illustrating the principles of the present disclosure. Some drawings may indicate the components using block diagrams and may not represent the internal circuitry of each component. It will be appreciated by those skilled in the art that disclosure of such drawings includes disclosure of electrical components, electronic components or circuitry commonly used to implement such components.
[0014] FIG. 1 is an exemplary block diagram of an environment for managing data cleaning, according to one or more embodiments of the present invention;
[0015] FIG. 2 is an exemplary block diagram of a system for managing the data cleaning, according to one or more embodiments of the present invention;
[0016] FIG. 3 is an exemplary architecture of the system of FIG. 2, according to one or more embodiments of the present invention;
[0017] FIG. 4 is an exemplary architecture for managing the data cleaning, according to one or more embodiments of the present disclosure;
[0018] FIG. 5 is an exemplary signal flow diagram illustrating the flow for managing the data cleaning, according to one or more embodiments of the present disclosure; and
[0019] FIG. 6 is a flow diagram of a method for managing the data cleaning, according to one or more embodiments of the present invention.
[0020] The foregoing shall be more apparent from the following detailed description of the invention.
DETAILED DESCRIPTION OF THE INVENTION
[0021] Some embodiments of the present disclosure, illustrating all its features, will now be discussed in detail. It must also be noted that as used herein and in the appended claims, the singular forms "a", "an" and "the" include plural references unless the context clearly dictates otherwise.
[0022] Various modifications to the embodiment will be readily apparent to those skilled in the art and the generic principles herein may be applied to other embodiments. However, one of ordinary skill in the art will readily recognize that the present disclosure including the definitions listed here below are not intended to be limited to the embodiments illustrated but is to be accorded the widest scope consistent with the principles and features described herein.
[0023] A person of ordinary skill in the art will readily ascertain that the illustrated steps detailed in the figures and here below are set out to explain the exemplary embodiments shown, and it should be anticipated that ongoing technological development will change the manner in which particular functions are performed. These examples are presented herein for purposes of illustration, and not limitation. Further, the boundaries of the functional building blocks have been arbitrarily defined herein for the convenience of the description. Alternative boundaries can be defined so long as the specified functions and relationships thereof are appropriately performed. Alternatives (including equivalents, extensions, variations, deviations, etc., of those described herein) will be apparent to persons skilled in the relevant art(s) based on the teachings contained herein. Such alternatives fall within the scope and spirit of the disclosed embodiments.
[0024] Various embodiments of the present invention provide a system and a method for managing the data cleaning. The system is configured to perform automated data cleaning. The system identifies and rectifies inaccuracies and inconsistencies in the input network data without the need for manual intervention. In particular, the present invention identifies, corrects, or removes inaccurate records, inconsistencies, and errors in incoming raw or analyzed network data in near real-time. In particular, the system utilizes a model which dynamically adjusts its ability to detect and handle various types of incorrect, inconsistent, duplicate, or erroneous data. The adaptability of the model ensures that the system remains effective in identifying and addressing new types of data issues as data patterns evolve with time.
[0025] Referring to FIG. 1, FIG. 1 illustrates an exemplary block diagram of an environment 100 for managing the data cleaning, according to one or more embodiments of the present invention. The environment 100 includes a User Equipment (UE) 102, a server 104, a network 106, a system 108, and one or more data sources 110. Herein, managing the data cleaning is referred to removing or correcting at least one of, but not limited to, duplicate data, blank data, inaccurate records, inconsistent data and erroneous data included in the data.
[0026] For the purpose of description and explanation, the description will be explained with respect to one or more user equipment’s (UEs) 102, or to be more specific will be explained with respect to a first UE 102a, a second UE 102b, and a third UE 102c, and should nowhere be construed as limiting the scope of the present disclosure. Each of the at least one UE 102 namely the first UE 102a, the second UE 102b, and the third UE 102c is configured to connect to the server 104 via the network 106.
[0027] In an embodiment, each of the first UE 102a, the second UE 102b, and the third UE 102c is one of, but not limited to, any electrical, electronic, electro-mechanical or an equipment and a combination of one or more of the above devices such as smartphones, Virtual Reality (VR) devices, Augmented Reality (AR) devices, laptop, a general-purpose computer, desktop, personal digital assistant, tablet computer, mainframe computer, or any other computing device.
[0028] The network 106 includes, by way of example but not limitation, one or more of a wireless network, a wired network, an internet, an intranet, a public network, a private network, a packet-switched network, a circuit-switched network, an ad hoc network, an infrastructure network, a Public-Switched Telephone Network (PSTN), a cable network, a cellular network, a satellite network, a fiber optic network, or some combination thereof. The network 106 may include, but is not limited to, a Third Generation (3G), a Fourth Generation (4G), a Fifth Generation (5G), a Sixth Generation (6G), a New Radio (NR), a Narrow Band Internet of Things (NB-IoT), an Open Radio Access Network (O-RAN), and the like.
[0029] The network 106 may also include, by way of example but not limitation, at least a portion of one or more networks having one or more nodes that transmit, receive, forward, generate, buffer, store, route, switch, process, or a combination thereof, etc. one or more messages, packets, signals, waves, voltage or current levels, some combination thereof, or so forth.
[0030] The environment 100 includes the server 104 accessible via the network 106. The server 104 may include by way of example but not limitation, one or more of a standalone server, a server blade, a server rack, a bank of servers, a server farm, hardware supporting a part of a cloud service or system, a home server, hardware running a virtualized server, a processor executing code to function as a server, one or more machines performing server-side functionality as described herein, at least a portion of any of the above, some combination thereof. In an embodiment, the entity may include, but is not limited to, a vendor, a network operator, a company, an organization, a university, a lab facility, a business enterprise side, a defense facility side, or any other facility that provides service.
[0031] The environment 100 further includes the one or more data sources 110. In one embodiment, the data sources are origins from which the data is collected and utilized for at least one of, but not limited to, analysis, research, and decision-making. In one embodiment, the one or more data sources 110 is at least one of, but not limited to, network functions, network elements, and cell towers. In particular, the one or more data sources 110 is associated with the sources included within the network 106 and outside the network 106.
[0032] The environment 100 further includes the system 108 communicably coupled to the server 104, the UE 102, and the one or more data sources 110 via the network 106. The system 108 is adapted to be embedded within the server 104 or is embedded as the individual entity.
[0033] Operational and construction features of the system 108 will be explained in detail with respect to the following figures.
[0034] FIG. 2 is an exemplary block diagram of the system 108 for managing the data cleaning, according to one or more embodiments of the present invention.
[0035] As per the illustrated and preferred embodiment, the system 108 for managing the data cleaning, includes one or more processors 202, a memory 204, a storage unit 206 and a model 220. The one or more processors 202, hereinafter referred to as the processor 202, may be implemented as one or more microprocessors, microcomputers, microcontrollers, digital signal processors, central processing units, state machines, logic circuitries, single board computers, and/or any devices that manipulate signals based on operational instructions. However, it is to be noted that the system 108 may include multiple processors as per the requirement and without deviating from the scope of the present disclosure. Among other capabilities, the processor 202 is configured to fetch and execute computer-readable instructions stored in the memory 204.
[0036] As per the illustrated embodiment, the processor 202 is configured to fetch and execute computer-readable instructions stored in the memory 204 as the memory 204 is communicably connected to the processor 202. The memory 204 is configured to store one or more computer-readable instructions or routines in a non-transitory computer-readable storage medium, which may be fetched and executed for managing the data cleaning. The memory 204 may include any non-transitory storage device including, for example, volatile memory such as RAM, or non-volatile memory such as disk memory, EPROMs, FLASH memory, unalterable memory, and the like.
[0037] As per the illustrated embodiment, the storage unit 206 is configured to store data received from the one or more data sources 110. The storage unit 206 is at least one of, but not limited to, a centralized database, a cloud-based database, a commercial database, an open-source database, a distributed database, an end-user database, a graphical database, a No-Structured Query Language (NoSQL) database, an object-oriented database, a personal database, an in-memory database, a document-based database, a time series database, a wide column database, a key value database, a search database, a cache databases, and so forth. The foregoing examples of the storage unit 206 types are non-limiting and may not be mutually exclusive e.g., the database can be both commercial and cloud-based, or both relational and open-source, etc.
[0038] As per the illustrated embodiment, the system 108 includes the model 220. Herein, the model 220 is at least one of, but not limited to, an Artificial Intelligence/Machine Leaning (AI/ML) model. In an alternate embodiment, the system 108 includes a plurality of models 220. The model 220 is a machine learning model that performs tasks such as recognizing patterns, making predictions, and solving problems, enhance decision-making, and provide insights across various fields. For example, the model 220 facilitates in solving real-world problems without extensive manual intervention.
[0039] As per the illustrated embodiment, the system 108 includes the processor 202 for managing the data cleaning. The processor 202 includes a receiving unit 208, a training unit 210, a data cleaning unit 212, and a retraining unit 214. The processor 202 is communicably coupled to the one or more components of the system 108 such as the memory 204, the storage unit 206 and the model 220. In an embodiment, operations and functionalities of the receiving unit 208, the training unit 210, the data cleaning unit 212, the retraining unit 214, and the one or more components of the system 108 can be used in combination or interchangeably.
[0040] In one embodiment, initially the receiving unit 208 of the processor 202 is configured to receive the data from one or more data sources 110. Herein, the one or more data sources 110 is at least one of, but not limited to, one or more network functions. The one or more network functions refer to the various tasks or services that the network 106 performs to facilitate communication and data transfer between devices. The one or more network functions are implemented in hardware or software and are essential for managing network operations effectively. For example, the one or more network functions are deployed as physical devices such as routers, switches and software applications such as Virtualized Network Functions (VNFs). In yet another example, the one or more network functions includes at least one of, but not limited to, an Access and Mobility Management Function (AMF), a Session Management Function (SMF) and a Policy Control Function (PCF). The receiving unit 208 receives data from the one or more data sources 110 which are present within the network 106 and outside the network 106. In particular, the data received from the one or more data sources 110 is received to train the model 220. In one embodiment, the data received from the one or more data sources 110 include at least one of, but not limited to, network performance data, UE 102 data, subscriber data, historical data related to the one or more data sources 110.
[0041] In one embodiment, the receiving unit 208 receives the data from the one or more data sources 110 based on one or more requests transmitted by the receiving unit 208 to the one or more data sources 110. In another embodiment, the one or more data sources 110 are configured to continuously transmit the data to the receiving unit 208 on periodic basis. For example, the receiving unit 208 receives the data from the one or more data sources 110 in every one hour.
[0042] Upon receiving the data from the one or more data sources 110, the training unit 210 of the processor 202 is configured to train the model 220 utilizing the data to identify correlations and relationships within the data to enable data cleaning. In an alternate embodiment, the system 108 includes a plurality of models 220 from which the training unit 210 selects an appropriate model 220 for training. Thereafter, the selected model 220 is trained using the data to identify correlations and relationships within the data to enable data cleaning. In one embodiment, the relationship describes how two or more variables are connected within the data. Herein, the relationship is at least one of, a linear relationship and a nonlinear relationship. In one embodiment, the correlation quantifies the strength and direction of the relationship between two variables. Herein, the correlation is at least one of, a positive correlation and a negative correlation. For example, let us assume that the data is received from the one or more network functions and the data includes variables related to traffic and latency within the one or more network functions. In one scenario when the traffic is increased, the latency also tends to increase in this way the relationship and the correlation among the data is identified.
[0043] In one embodiment, for training the model 220, the training unit 210 splits the data into at least one of, but not limited to, training data and testing data. Further, the training unit 210 feeds the training data to the model 220. In particular, the training unit 210 uses the training data to teach the model 220 to identify at least one of, but not limited to, correlations, patterns and relationships within the data. Based on the fed training data, the model 220 learns at least one of, but not limited to, one or more patterns/trends, the correlations, and the relationships of the data in the network 106. Subsequent to training, the trained model 220 is fed with the testing data to evaluate performance of the trained model 220. When the trained model 220 generates an output based on the testing data, the training unit 210 evaluates the performance of the trained model 220.
[0044] In one embodiment, based on the evaluation of the performance of the trained model 220, the training unit 210 may tune one or more hyperparameters of the trained model 220 to optimize the performance of the trained model 220. Herein, the one or more hyperparameters of the trained model 220 includes at least one of, but not limited to, a learning rate, a batch size, and a number of epochs. In one embodiment, when the performance of the trained model 220 is optimized, then the trained model 220 is used to identify the correlations and the relationships within the data to enable data cleaning.
[0045] In one embodiment the trained model 220 identifies the patterns, the correlations and the relationships within the data by applying by applying one or more logics. In one embodiment, the one or more logics may include at least one of, but not limited to, a k-means clustering, a hierarchical clustering, a Principal Component Analysis (PCA), an Independent Component Analysis (ICA), a deep learning logics such as Artificial Neural Networks (ANNs), a Convolutional Neural Networks (CNNs), a Recurrent Neural Networks (RNNs), a Long Short-Term Memory Networks (LSTMs), a Generative Adversarial Networks (GANs), a Q-Learning, a Deep Q-Networks (DQN), a Reinforcement Learning Logics, etc.
[0046] Based on the identified/learned patterns, the correlations and the relationships within the data, the model 220 identifies at least one of, but not limited to, duplicate data, blank data, inaccurate records, inconsistent data, and erroneous data within the data received from the one or more data sources 110. Hereinafter, the at least one of, but not limited to, the duplicate data, the blank data, the inaccurate records, the inconsistent data, and the erroneous data is referred to unwanted data without limiting the scope of the invention.
[0047] For example, let us assume, the one or more network functions as the one or more data sources 110 and the data received from the one or more network functions pertains to performance metrics of the one or more network functions which includes data associated with at least one of, but not limited to, latency, a throughput, a packet loss, resource utilization, and a bandwidth. Herein, the data pertaining to the performance metrics is received in at least one of, but not limited to, a tabular format including rows and columns. Further, let us assume that the trained model 220 had learnt patterns, the correlations and the relationships related to performance metrics of the one or more network functions. Based on the learnt patterns, the correlations and the relationships, the trained model 220 checks the rows and the columns of each of the performance metrics. While checking, the trained model 220 identifies the presence of the duplicate data within the rows and the columns of the performance metrics. In other words, while checking the row and the column of the resource utilization, the trained model 220 identifies the repeated entries.
[0048] In one example, while checking the row and the column of the latency, the trained model 220 identifies that there are no entries available in the row and the column of the latency. In other words, the row and the column of the latency includes null values or are blank. So, the trained model 220 infers the null values as the blank data.
[0049] In yet another example, based on the learnt patterns, the correlations and the relationships, the trained model 220 checks the rows and the columns of the performance metrics. Herein, based on the learnt patterns, the correlations and the relationships the entries within the rows and the columns should be in same case. While checking, the trained model 220 identifies that the entries within the rows and the columns are in combination of cases which includes an uppercase and a lowercase. So, the trained model 220 infers the entries with the combination of cases as the inconsistent data.
[0050] In yet another example, based on the learnt patterns, the correlations and the relationships, the trained model 220 had set an ideal range for values of the performance metrics of the one or more network functions. While checking the entries of the performance metrics within the rows and the columns, the trained model 220 identifies that the values of the performance metrics are not within the ideal range. Let us consider the ideal range for the latency such as 1 millisecond (ms) to 5 (ms), so while checking, the trained model 220 identifies that the latency ranges from 20 (ms) to 25 (ms) which is not within the ideal range. So, the trained model 220 infers the entries of the latency as at least one of, the inaccurate records, and the erroneous data.
[0051] Upon identifying the unwanted data within the data received from the one or more data sources 110, the data cleaning unit 212 of the processor 202 is configured to perform data cleaning to remove the unwanted data based on the identified patterns, the correlations and the relationships within the data. In an alternate embodiment, the data cleaning unit 212 of the processor 202 is configured to perform data cleaning to correct the unwanted data. Herein, the data cleaning is performed in real time. In one embodiment, the data cleaning unit 212 is configured to preprocess the unwanted data to ensure the data consistency and quality within the system 108. Herein, the preprocessing includes operations performed on the unwanted data such as at least one of, but not limited to, correcting data errors, and removing/deleting the unwanted data. In an alternate embodiment, the data cleaning unit 212 performs at least one of, but not limited to, data normalization.
[0052] In one embodiment, while preprocessing, the data cleaning unit 212 performs at least one of, but not limited to, reorganizing the unwanted data, removing the redundant data within the unwanted data, removing null values from the unwanted data, handling missing values from the unwanted data. The main goal of the the data cleaning unit 212 is to achieve a standardized data format of the data with no errors. The data cleaning unit 212 eliminates duplicate data and inconsistencies which reduces manual efforts. Subsequent to preprocessing, the data is inferred as the pre-processed data. The data cleaning unit 212 ensures that the pre-processed data is stored appropriately in at least one of, the storage unit 206 and the pre-processed data is ready for subsequent retrieval and further model 220 training. In one embodiment, the data cleaning unit 212 is using the trained model 220 to perform the data cleaning in near real-time. The data cleaning unit 212 identifies and removes/corrects the unwanted data, thereby ensuring that the data used for training remains up-to-date and accurate, leading to more reliable model 220 training.
[0053] Upon performing the data cleaning, a retraining unit 214 of the processor 202 is configured to retrain the model 220 based on updated data received from the one or more data sources 110. In particular, the receiving unit 208 is configured to receive the updated data from the one or more data sources 110 on periodic basis. Herein, the updated data corresponds to latest data received from the one or more data sources 110. Utilizing the updated data the retraining unit 214 trains the model 220 periodically. In particular, the updated data is fed to the model 220 by the retraining unit 214, so that based on the updated data, the model 220 is trained again. Herein, the model 220 keeps on training and updating itself based on the updated data received from the one or more data sources 110.
[0054] Upon retraining the model 220 with the updated data, the retraining unit 214 is further configured to identify the updated patterns, correlations and relationships within the updated data using the retrained model 220. In one embodiment, the updated patterns, correlations and relationships within the updated data are identified by the retrained model 220, similar to how the previous patterns, correlations and relationships were determined by the same trained model 220 without limiting the scope of the invention.
[0055] Upon identifying the updated patterns, correlations and relationships within the updated data, the retraining unit 214 is further configured to refine the data cleaning process using the retrained model 220, based on the updated patterns, correlations and relationships within the updated data. For example, if the retrained model 220 identifies that the patterns, the correlations and the relationships within the updated data is changed based on a comparison with the the previous patterns, the correlations and the relationships, then the retraining unit 214 refines the data cleaning process using the retrained model 220. In other words, the retraining unit 214 and the model 220 dynamically adjusts its ability to identify and handle unwanted data. This adaptability ensures that the system 108 remains effective in identifying and addressing new types of data issues as at least one of, the patterns, the correlations and the relationships related to the data evolves with the time.
[0056] Herein, refining the data cleaning process is related to performing the data cleaning process associated with the updated data. In one embodiment, subsequent to refining the data cleaning process using the retrained model 220, the data cleaning unit 212 is configured to perform data cleaning to remove the unwanted data from the updated data. In an alternate embodiment, the data cleaning unit 212 of the processor 202 is configured to perform data cleaning to correct the unwanted data from the updated data. Upon cleaning the updated data, the data is inferred as the pre-processed data. Further, the data cleaning unit 212 ensures that the pre-processed data is stored appropriately in at least one of, the storage unit 206 and the pre-processed data is ready for subsequent retrieval and further model 220 training.
[0057] Upon performing the data cleaning process on at least one of, the received data and the updated data, the pre-processed data is used by the at least one of, training unit 210 and the retraining unit 214 for training the model 220. In particular, the pre-processed data is directly used for training the model 220 without going through the data cleaning part manually. Further the trained model 220 is utilized for at least one of, but not limited to, detecting one or more anomalies in the network 106 and predictive analysis. Advantageously, the system 108 automates the data cleaning process that eliminates the need for manual data cleaning, thereby saving time and manual effort.
[0058] The receiving unit 208, the training unit 210, the data cleaning unit 212, and the retraining unit 214 in an exemplary embodiment, are implemented as a combination of hardware and programming (for example, programmable instructions) to implement one or more functionalities of the processor 202. In the examples described herein, such combinations of hardware and programming may be implemented in several different ways. For example, the programming for the processor 202 may be processor-executable instructions stored on a non-transitory machine-readable storage medium and the hardware for the processor may comprise a processing resource (for example, one or more processors), to execute such instructions. In the present examples, the memory 204 may store instructions that, when executed by the processing resource, implement the processor 202. In such examples, the system 108 may comprise the memory 204 storing the instructions and the processing resource to execute the instructions, or the memory 204 may be separate but accessible to the system 108 and the processing resource. In other examples, the processor 202 may be implemented by electronic circuitry.
[0059] FIG. 3 illustrates an exemplary architecture for the system 108, according to one or more embodiments of the present invention. More specifically, FIG. 3 illustrates the system 108 for managing the data cleaning. It is to be noted that the embodiment with respect to FIG. 3 will be explained with respect to the UE 102 for the purpose of description and illustration and should nowhere be construed as limited to the scope of the present disclosure.
[0060] FIG. 3 shows communication between the UE 102, the system 108, and the one or more data sources 110. For the purpose of description of the exemplary embodiment as illustrated in FIG. 3, the UE 102, uses network protocol connection to communicate with the system 108, and the one or more data sources 110. In an embodiment, the network protocol connection is the establishment and management of communication between the UE 102, the system 108, and the one or more data sources 110 over the network 106 (as shown in FIG. 1) using a specific protocol or set of protocols. The network protocol connection includes, but not limited to, Session Initiation Protocol (SIP), System Information Block (SIB) protocol, Transmission Control Protocol (TCP), User Datagram Protocol (UDP), File Transfer Protocol (FTP), Hypertext Transfer Protocol (HTTP), Simple Network Management Protocol (SNMP), Internet Control Message Protocol (ICMP), Hypertext Transfer Protocol Secure (HTTPS) and Terminal Network (TELNET).
[0061] In an embodiment, the UE 102 includes a primary processor 302, and a memory 304 and a User Interface (UI) 306. In alternate embodiments, the UE 102 may include more than one primary processor 302 as per the requirement of the network 106. The primary processor 302, may be implemented as one or more microprocessors, microcomputers, microcontrollers, digital signal processors, central processing units, state machines, logic circuitries, single board computers, and/or any devices that manipulate signals based on operational instructions.
[0062] In an embodiment, the primary processor 302 is configured to fetch and execute computer-readable instructions stored in the memory 304. The memory 304 may be configured to store one or more computer-readable instructions or routines in a non-transitory computer-readable storage medium, which may be fetched and executed for managing the data cleaning. The memory 304 may include any non-transitory storage device including, for example, volatile memory such as RAM, or non-volatile memory such as disk memory, EPROMs, FLASH memory, unalterable memory, and the like.
[0063] In an embodiment, the User Interface (UI) 306 includes a variety of interfaces, for example, a graphical user interface, a web user interface, a Command Line Interface (CLI), and the like. The UI 306 of the UE 102 allows the user to transmit data to the system 108 for the data cleaning. Herein, the UE 102 act as at least one data source 110. In one embodiment, the user may be at least one of, but not limited to, a network operator. In an embodiment, the user visualizes the pre-processed or cleaned data on the UI 306.
[0064] As mentioned earlier in FIG.2, the system 108 includes the processors 202, the memory 204, for managing the data cleaning, which are already explained in FIG. 2. For the sake of brevity, a similar description related to the working and operation of the system 108 as illustrated in FIG. 2 has been omitted to avoid repetition.
[0065] Further, as mentioned earlier the processor 202 includes the receiving unit 208, the training unit 210, the data cleaning unit 212, and the retraining unit 214 which are already explained in FIG. 2. Hence, for the sake of brevity, a similar description related to the working and operation of the system 108 as illustrated in FIG. 2 has been omitted to avoid repetition. The limited description provided for the system 108 in FIG. 3, should be read with the description provided for the system 108 in the FIG. 2 above, and should not be construed as limiting the scope of the present disclosure.
[0066] FIG. 4 is an exemplary the system 108 architecture 400 for managing the data cleaning, according to one or more embodiments of the present disclosure.
[0067] The architecture 400 includes the one or more data sources 110 such as NF 402a, NF 402b and a NF 402c. Further, the architecture 400 includes a data integration unit 412, a data pre-processing unit 404, a model training unit 406, a prediction unit 408, a data lake 410 and the UI 306 communicably coupled to each other via the network 106.
[0068] In one embodiment, the NF 402a, the NF 402b and the NF 402c are various origins from which the data is collected, and the collected data is used for at least one of, model 220 training, analysis or other purposes. Herein, the data collection typically involves gathering performance metrics and logs from the NF 402a, the NF 402b and the NF 402c within the network 106.
[0069] In one embodiment, the data integration unit 412 integrates the data received from the at least one of, the NF 402a, the NF 402b and the NF 402c within the network 106 and the one or more data sources 110 outside the network 106. Herein, integrating data involves combining data from at least one of, the NF 402a, the NF 402b and the NF 402c to provide a unified view or to enable comprehensive analysis. The processes of integrating data are essential for gaining insights, improving decision-making, and ensuring consistency across system 108.
[0070] In one embodiment, the data pre-processing unit 404 preprocesses the data received from the at least one of, the NF 402a, the NF 402b and the NF 402c. The data pre-processing unit 404 performs the data cleaning process to identify inaccurate records, inconsistencies, and errors and further removes or corrects the data among the data received from the at least one of, the NF 402a, the NF 402b and the NF 402c. For example, the data undergoes preprocessing to ensure data consistency within the system 108. In particular, the preprocessing involves tasks like data cleaning, normalization, removing unwanted data like outliers, duplicate records and handling missing values.
[0071] In one embodiment, the model training unit 406 trains the model 220 using the pre-processed data stored in the storage unit 112. In particular, the model training unit 406 is trained using the collected data to discover patterns and connections between different variables. Due to training the model 220 using the pre-processed data, the trained model 220 is used for various purposes such as predictive analysis, and the one or more anomaly detection, etc.
[0072] In one embodiment, the prediction unit 408 performs at least one of, but not limited to, the predictive analysis, and the one or more anomaly detection utilizing the trained model 220 and generates an output accordingly.
[0073] In one embodiment, the data lake 410 includes a structured collection of pre-processed data that is managed and organized in a way that allows system 108 for easy access, retrieval, and manipulation. The data lake 410 is used to store, manage, and retrieve large amounts of information efficiently. The data lake 410 stores the pre-processed data and the outputs generated by the model 220 and the prediction unit 408. Further, based on the request provided by the user, the notifications regarding the outputs generated by the model 220 and the prediction unit 408 are provided to the user via the UI 306. Herein, the users are the network operators who analyses the outputs to resolve the significant issues/one or more anomalies.
[0074] FIG. 5 is a signal flow diagram illustrating the flow for managing the data cleaning, according to one or more embodiments of the present disclosure.
[0075] At step 502, the system 108 received data from at least one of, the one or more data sources 110 present within the network 106 and the one or more data sources 110 present outside the network 106. In one embodiment, in order to retrieve data from at least one of, the one or more data sources 110, the system 108 transmits at least one of, but not limited to, a HTTP request to the one or more data sources 110. Herein, the connection between the system 108 and the one or more data sources 110 is established before retrieving the data. In one embodiment, the retrieved data is the raw data or analyzed input from the one or more data sources 110. The retrieved data serves as the foundation for training the model 220.
[0076] At step 504, the system 108 trains the model 220 using the received data. In particular, the model 220 is trained to teach the model 220 to identify the unwanted data among the received data. The trained model 220 learns the patterns, the correlations, and the relationships within the received data to enable data cleaning abilities. Based on training, the trained model 220 identities the unwanted data which includes at least one of, but not limited to, the duplicate data, inaccurate data, inconsistencies, the blank data and the erroneous data.
[0077] At step 506, the system 108 performs the data cleaning process. While performing the data cleaning process, the system removes the unwanted data or corrects the data. Herein, the system 108 makes the data ready for training the model 220. In one embodiment, the process of receiving data from the one or more data sources 110, training the model 220 and data cleaning is iterative process. In particular, the system 108 continues to receive new data, retrain the model 220 periodically, and refine the process of data cleaning as at least one of, the patterns, the correlations, and the relationships within the received new data changes. The system 108 adapts as per change in the at least one of, the patterns, the correlations, and the relationships within the received of the new data. This adaptability ensures that the system 108 remains effective in identifying and addressing new types of data issues as data patterns evolve.
[0078] FIG. 6 is a flow diagram of a method 600 for managing the data cleaning, according to one or more embodiments of the present invention. For the purpose of description, the method 600 is described with the embodiments as illustrated in FIG. 2 and should nowhere be construed as limiting the scope of the present disclosure.
[0079] At step 602, the method 600 includes the step of receiving data from the one or more data sources 110. In one embodiment, the receiving unit 208 retrieves the data from the one or more data sources 110. For example, the receiving unit 208 receives data from at least one of, but not limited to, the network functions periodically. Herein, the data is at least one of, but not limited to, the network performance data, the subscriber data, and the UE 102 data.
[0080] At step 604, the method 600 includes the step of training the model 220 utilizing the data to identify correlations and relationships within the data to enable data cleaning. In one embodiment, the training unit 210 is configured to train the model 220 utilizing the received data to identify correlations and relationships within the data. For example, let us assume that the received data is associated with the one or more network functions and the received data includes at least one of, the duplicate data. Based on the learned patterns, the correlations and the relationships, the trained model identifies the duplicate data within the data received from the one or more network functions such as the data includes the repeated and same entries of the performance metrics of the one or more network functions.
[0081] In another example, the data received from the one or more network functions includes at least one of, the inconsistent data such as data is in form of any case which may be uppercase and lowercase and combination thereof. In yet another example, the data received from the one or more network functions includes at least one of, blank data such as some entries are missing within the received data.
[0082] At step 606, the method 600 includes the step of performing the data cleaning. In one embodiment, the data cleaning unit 212 performs the data cleaning. In particular, the data cleaning is referred to preprocessing of the data which prepares the data with no inconsistencies and errors so that subsequent to preprocessing the data, the preprocessed data is utilized for the model 220 training. In one embodiment, based on training, the model 220 had identified the unwanted data which includes at least one of, the duplicate data, the blank data, the inaccurate records, the inconsistent data, and the erroneous data.
[0083] In one embodiment, while preprocessing, the data cleaning unit 212 removes the identified duplicate data within the data received from the one or more network functions. For example, the repeated or same entries of the performance metrics of the one or more network functions with the data received from the one or more network functions is removed by the data cleaning unit 212. In another embodiment, while preprocessing, the data cleaning unit 212 makes the identified inconsistent data as consistent data. For example, the inconsistent data which is in form of any case is converted into a common case such as all lowercase by the data cleaning unit 212 to make the data consistent. In another embodiment, while preprocessing, the blank data received from the one or more network functions is removed from the data received from the one or more network functions. For example, the entries which are missing within the received data are removed from the data received from data received the one or more network functions.
[0084] In one embodiment, the one or more data sources 110 transmits the updated data to the system 108 periodically. Based on the updated data received from the or more data sources 110, the system 108 retrains the model 220 and identifies the updated correlations and relationships within the updated data. Further, the system 108 refines the data cleaning process based on the updated correlations and relationships within the updated data. Herein, subsequent to cleaning the updated data, the updated data is referred to the pre-processed data. For example, the data cleaning process is iterative in nature. The system 108 continues to collect new data, retrain the model 220 periodically, and refine the data cleaning process as data patterns, the correlations and the relationships change. This adaptability is crucial in maintaining the system's effectiveness.
[0085] Subsequent to cleaning the data, the pre-processed data is used directly for training the model 220 without going through the data cleaning part manually. Further the trained model 220 is used for performing at least one of, but not limited to, predictive analysis and detecting the one or more anomalies in the network 106.
[0086] In yet another aspect of the present invention, a non-transitory computer-readable medium having stored thereon computer-readable instructions that, when executed by a processor 202. The processor 202 is configured to receive, the data from the one or more data sources 110. The processor 202 is further configured to train, the model 220 utilizing the data to identify correlations and relationships within the data to enable data cleaning. Herein, the correlations and the relationships pertains to at least one of, duplicate data, blank data, inaccurate records, inconsistent data, and erroneous data. The processor 202 is further configured to perform, the data cleaning based on the identified correlations and relationships.
[0087] A person of ordinary skill in the art will readily ascertain that the illustrated embodiments and steps in description and drawings (FIG.1-6) are set out to explain the exemplary embodiments shown, and it should be anticipated that ongoing technological development will change the manner in which particular functions are performed. These examples are presented herein for purposes of illustration, and not limitation. Further, the boundaries of the functional building blocks have been arbitrarily defined herein for the convenience of the description. Alternative boundaries can be defined so long as the specified functions and relationships thereof are appropriately performed. Alternatives (including equivalents, extensions, variations, deviations, etc., of those described herein) will be apparent to persons skilled in the relevant art(s) based on the teachings contained herein. Such alternatives fall within the scope and spirit of the disclosed embodiments.
[0088] The present disclosure provides technical advancements of efficient data cleaning by automating the data cleaning process through the model which eliminates the need for manual data cleaning. Due to the automated data cleaning process, the time and human resources are saved. With automated data cleaning, the scope for human errors is reduced which ensures a more consistent and reliable data processing pipeline. The model swiftly identifies and rectifies inaccuracies, inconsistencies in the data, streamlining the data preparation stage. The system utilizes the model to perform the data cleaning in real time which identifies and corrects inaccurate records, ensuring that the dataset used for training remains up to date and accurate, leading to more reliable models. By using well-defined and cleansed data for training, the model achieves a higher level of accuracy and efficiency which leads to more precise predictions and insights, optimizing network and operational data analysis. .
[0089] The present invention offers multiple advantages over the prior art and the above listed are a few examples to emphasize on some of the advantageous features. The listed advantages are to be read in a non-limiting manner.

REFERENCE NUMERALS

[0090] Environment - 100;
[0091] User Equipment (UE) - 102;
[0092] Server - 104;
[0093] Network- 106;
[0094] System -108;
[0095] One or more data sources – 110;
[0096] Processor - 202;
[0097] Memory - 204;
[0098] Storage unit – 206;
[0099] Receiving unit – 208;
[00100] Training unit – 210;
[00101] Data cleaning unit – 212;
[00102] Retraining unit – 214;
[00103] Model – 220;
[00104] Primary Processor – 302;
[00105] Memory – 304;
[00106] User Interface (UI) – 306;
[00107] NF – 402a, 402b, 402c;
[00108] Data integration unit– 412;
[00109] Data Preprocessing unit – 404;
[00110] Model training unit – 406;
[00111] Prediction unit – 408;
[00112] Data lake – 410.
,CLAIMS:CLAIMS
We Claim:
1. A method (600) of managing data cleaning, the method (600) comprising the steps of:
receiving, by one or more processors (202), the data from one or more data sources (110);
training, by the one or more processors (202), a model (220) utilizing the data to identify correlations and relationships within the data to enable data cleaning, wherein the correlations and the relationships pertains to at least one of, duplicate data, blank data, inaccurate records, inconsistent data, and erroneous data; and
performing, by the one or more processors (202), the data cleaning based on the identified correlations and relationships.

2. The method (600) as claimed in claim 1, wherein the one or more data sources (110) includes one or more network functions and wherein the data is received from the one or more data sources (110) on one of a receipt of a request and a continuous basis.

3. The method (600) as claimed in claim 1, wherein the method (600) comprises the steps of:
retraining, by the one or more processors (202), the model (220) based on updated data received from the one or more data sources (110);
identifying, by the one or more processors (202), updated correlations and relationships within the updated data; and
refining, by the one or more processors (202), the data cleaning process based on the updated correlations and relationships within the updated data.

4. The method (600) as claimed in claim 3, wherein the updated data corresponds to a latest data received from the one or more data sources (110).

5. A system (108) for managing data cleaning, the system (108) comprising:
a receiving unit (208) configured to receive, the data from one or more data sources (110);
a training unit (210) configured to train, a model (220) utilizing the data to identify correlations and relationships within the data to enable data cleaning, wherein the correlations and the relationships pertains to at least one of, duplicate data, blank data, inaccurate records, inconsistent data, and erroneous data; and
a data cleaning unit (212) configured to perform, the data cleaning based on the identified correlations and relationships.

6. The system (108) as claimed in claim 5, wherein the one or more data sources (110) includes one or more network functions and wherein the data is received from the one or more data sources (110) on one of a receipt of a request and a continuous basis.

7. The system (108) as claimed in claim 5, wherein the system (108) comprises a retraining unit (214) configured to:
retrain, the model (220) based on updated data received from the one or more data sources (110);
identify, updated correlations and relationships within the updated data; and
refine, the data cleaning process based on the updated correlations and relationships within the updated data.

8. The system (108) as claimed in claim 7, wherein the updated data corresponds to a latest data received from the one or more data sources (110).

Documents

Application Documents

#	Name	Date
1	202321067377-STATEMENT OF UNDERTAKING (FORM 3) [07-10-2023(online)].pdf	2023-10-07
2	202321067377-PROVISIONAL SPECIFICATION [07-10-2023(online)].pdf	2023-10-07
3	202321067377-FORM 1 [07-10-2023(online)].pdf	2023-10-07
4	202321067377-FIGURE OF ABSTRACT [07-10-2023(online)].pdf	2023-10-07
5	202321067377-DRAWINGS [07-10-2023(online)].pdf	2023-10-07
6	202321067377-DECLARATION OF INVENTORSHIP (FORM 5) [07-10-2023(online)].pdf	2023-10-07
7	202321067377-FORM-26 [27-11-2023(online)].pdf	2023-11-27
8	202321067377-Proof of Right [12-02-2024(online)].pdf	2024-02-12
9	202321067377-DRAWING [06-10-2024(online)].pdf	2024-10-06
10	202321067377-COMPLETE SPECIFICATION [06-10-2024(online)].pdf	2024-10-06
11	Abstract.jpg	2024-12-07
12	202321067377-Power of Attorney [24-01-2025(online)].pdf	2025-01-24
13	202321067377-Form 1 (Submitted on date of filing) [24-01-2025(online)].pdf	2025-01-24
14	202321067377-Covering Letter [24-01-2025(online)].pdf	2025-01-24
15	202321067377-CERTIFIED COPIES TRANSMISSION TO IB [24-01-2025(online)].pdf	2025-01-24
16	202321067377-FORM 3 [31-01-2025(online)].pdf	2025-01-31