System And Method For Normalizing Data

< Back

System And Method For Normalizing Data

Abstract: ABSTRACT SYSTEM AND METHOD FOR NORMALIZING DATA The present invention relates to a system (120) and a method (500) for normalizing data is disclosed. The system (120) includes a data source unit (220) configured to retrieve data from one or more data sources. The system (120) includes a preprocessing unit (225) configured to perform an operation on the retrieved data. The system (120) includes a generating unit (230) configured to generate the normalized data based on the performed operation on the retrieved data. Ref. Fig. 2

Get Free WhatsApp Updates!
Notices, Deadlines & Correspondence

Patent Information

Application #

Filing Date

06 October 2023

Publication Number

15/2025

Publication Type

INA

Invention Field

COMPUTER SCIENCE

Status

Email

Parent Application

Applicants

JIO PLATFORMS LIMITED

OFFICE-101, SAFFRON, NR. CENTRE POINT, PANCHWATI 5 RASTA, AMBAWADI, AHMEDABAD 380006, GUJARAT, INDIA

Inventors

1. Aayush Bhatnagar

Reliance Corporate Park, Thane - Belapur Road, Ghansoli, Navi Mumbai, Maharashtra 400701, India

2. Ankit Murarka

Reliance Corporate Park, Thane - Belapur Road, Ghansoli, Navi Mumbai, Maharashtra 400701, India

3. Jugal Kishore

Reliance Corporate Park, Thane - Belapur Road, Ghansoli, Navi Mumbai, Maharashtra 400701, India

4. Chandra Ganveer

Reliance Corporate Park, Thane - Belapur Road, Ghansoli, Navi Mumbai, Maharashtra 400701, India

5. Sanjana Chaudhary

Reliance Corporate Park, Thane - Belapur Road, Ghansoli, Navi Mumbai, Maharashtra 400701, India

6. Gourav Gurbani

Reliance Corporate Park, Thane - Belapur Road, Ghansoli, Navi Mumbai, Maharashtra 400701, India

7. Yogesh Kumar

Reliance Corporate Park, Thane - Belapur Road, Ghansoli, Navi Mumbai, Maharashtra 400701, India

8. Avinash Kushwaha

Reliance Corporate Park, Thane - Belapur Road, Ghansoli, Navi Mumbai, Maharashtra 400701, India

9. Dharmendra Kumar Vishwakarma

Reliance Corporate Park, Thane - Belapur Road, Ghansoli, Navi Mumbai, Maharashtra 400701, India

10. Sajal Soni

Reliance Corporate Park, Thane - Belapur Road, Ghansoli, Navi Mumbai, Maharashtra 400701, India

11. Niharika Patnam

Reliance Corporate Park, Thane - Belapur Road, Ghansoli, Navi Mumbai, Maharashtra 400701, India

12. Harsh Poddar

Reliance Corporate Park, Thane - Belapur Road, Ghansoli, Navi Mumbai, Maharashtra 400701, India

13. Shubham Ingle

Reliance Corporate Park, Thane - Belapur Road, Ghansoli, Navi Mumbai, Maharashtra 400701, India

14. Sanket Kumthekar

Reliance Corporate Park, Thane - Belapur Road, Ghansoli, Navi Mumbai, Maharashtra 400701, India

15. Mohit Bhanwria

Reliance Corporate Park, Thane - Belapur Road, Ghansoli, Navi Mumbai, Maharashtra 400701, India

16. Shashank Bhushan

Reliance Corporate Park, Thane - Belapur Road, Ghansoli, Navi Mumbai, Maharashtra 400701, India

17. Vinay Gayki

Reliance Corporate Park, Thane - Belapur Road, Ghansoli, Navi Mumbai, Maharashtra 400701, India

18. Aniket Khade

Reliance Corporate Park, Thane - Belapur Road, Ghansoli, Navi Mumbai, Maharashtra 400701, India

19. Durgesh Kumar

Reliance Corporate Park, Thane - Belapur Road, Ghansoli, Navi Mumbai, Maharashtra 400701, India

20. Zenith Kumar

Reliance Corporate Park, Thane - Belapur Road, Ghansoli, Navi Mumbai, Maharashtra 400701, India

21. Gaurav Kumar

Reliance Corporate Park, Thane - Belapur Road, Ghansoli, Navi Mumbai, Maharashtra 400701, India

22. Manasvi Rajani

Reliance Corporate Park, Thane - Belapur Road, Ghansoli, Navi Mumbai, Maharashtra 400701, India

23. Kishan Sahu

Reliance Corporate Park, Thane - Belapur Road, Ghansoli, Navi Mumbai, Maharashtra 400701, India

24. Sunil Meena

Reliance Corporate Park, Thane - Belapur Road, Ghansoli, Navi Mumbai, Maharashtra 400701, India

25. Supriya Kaushik De

Reliance Corporate Park, Thane - Belapur Road, Ghansoli, Navi Mumbai, Maharashtra 400701, India

26. Kumar Debashish

Reliance Corporate Park, Thane - Belapur Road, Ghansoli, Navi Mumbai, Maharashtra 400701, India

27. Mehul Tilala

Reliance Corporate Park, Thane - Belapur Road, Ghansoli, Navi Mumbai, Maharashtra 400701, India

28. Satish Narayan

Reliance Corporate Park, Thane - Belapur Road, Ghansoli, Navi Mumbai, Maharashtra 400701, India

29. Rahul Kumar

Reliance Corporate Park, Thane - Belapur Road, Ghansoli, Navi Mumbai, Maharashtra 400701, India

30. Kunal Telgote

Reliance Corporate Park, Thane - Belapur Road, Ghansoli, Navi Mumbai, Maharashtra 400701, India

31. Ralph Lobo

Reliance Corporate Park, Thane - Belapur Road, Ghansoli, Navi Mumbai, Maharashtra 400701, India

32. Girish Dange

Reliance Corporate Park, Thane - Belapur Road, Ghansoli, Navi Mumbai, Maharashtra 400701, India

33. Harshita Garg

Reliance Corporate Park, Thane - Belapur Road, Ghansoli, Navi Mumbai, Maharashtra 400701, India

Specification

DESC:
FORM 2
THE PATENTS ACT, 1970
(39 of 1970)
&
THE PATENTS RULES, 2003

COMPLETE SPECIFICATION
(See section 10 and rule 13)
1. TITLE OF THE INVENTION
SYSTEM AND METHOD FOR NORMALIZING DATA
2. APPLICANT(S)
NAME NATIONALITY ADDRESS
JIO PLATFORMS LIMITED INDIAN OFFICE-101, SAFFRON, NR. CENTRE POINT, PANCHWATI 5 RASTA, AMBAWADI, AHMEDABAD 380006, GUJARAT, INDIA
3.PREAMBLE TO THE DESCRIPTION

THE FOLLOWING SPECIFICATION PARTICULARLY DESCRIBES THE NATURE OF THIS INVENTION AND THE MANNER IN WHICH IT IS TO BE PERFORMED.

FIELD OF THE INVENTION
[0001] The present invention relates to the field of wireless communication networks, more particularly relates to a method and a system for normalizing data.
BACKGROUND OF THE INVENTION
[0002] With the increase in number of users, the network service provisions have been implemented for upgradations to enhance the service quality so as to keep pace with such high demand. For various purposes like improving the quality of a network, managing traffic, delegating node allocation, managing performance of routing device etc. many network elements, network functions and micro-services are integrated into a network. Therefore, a large amount of data is being generated by these services and network functions. Such data needs to be analyzed and assessed for the purpose of determining network health, security, modify management policy etc.
[0003] However, the extracted data may include irrelevant and unwanted data hindered the direct training of the data for Machine Learning (ML). This inconsistency and noise in the data made it challenging to perform effective ML training and analysis. There is a need to clean and normalize the data so that they can be fed to the ML model for training. Presently no mechanism is available where data cleaning can be performed as required.
[0004] Hence there is a need for a system and method for cleaning and normalizing data in order to facilitate machine learning training. The alternate method and system should bring the clean and normalized data by performing operations.
SUMMARY OF THE INVENTION
[0005] One or more embodiments of the present disclosure provide a method and a system for normalizing data.
[0006] In one aspect of the present invention, the method for normalizing the data is disclosed. The method includes the step of retrieving, by one or more processors, data from one or more data sources. The method includes the step of performing, by the one or more processors, an operation on the retrieved data. The method includes the step of generating, by the one or more processors, normalized data based on performing the operation on the retrieved data.
[0007] In one embodiment, the one or more data sources include at least, a file input, a source path, an input stream, a Hyper Text Transfer Protocol 2 (HTTP 2), a Distributed File System (DFS), and a Network Attached Storage (NAS).
[0008] In another embodiment, the retrieved data is stored in a data frame.
[0009] In yet another embodiment, the operation is performed on one of columns and field values of the data frame. The operation is at least one of a substring extraction, a concatenation, a regular expression-based extraction, string pattern matching, and conditional replacements.
[0010] In yet another embodiment, the generated normalized data is stored in a storage unit and used for one or more target models training.
[0011] In yet another embodiment, the operation is selected by a user for each data source from the one or more data sources.
[0012] The step of performing, by the one or more processors, an operation on the retrieved data includes the steps of identifying one or more transformations for the retrieved data using a pre-trained model. The step of performing, by the one or more processors, an operation on the retrieved data includes the steps of selecting the one or more operations using the pre-trained model based on the identified one or more transformation for the retrieved data.
[0013] In another aspect of the present invention, the system for normalizing the data is disclosed. The system includes a data source unit configured to retrieve data from one or more data sources. The system includes a preprocessing unit configured to perform an operation on the retrieved data. The system includes a generating unit configured to generate the normalized data based on performing the operation on the retrieved data.
[0014] In another aspect of the embodiment, a non-transitory computer-readable medium having stored thereon computer-readable instructions that, when executed by a processor is disclosed. The processor is configured to retrieve data from one or more data sources. The processor is configured to perform an operation on the retrieved data. The processor is configured to generate normalized data based on performing the operation on the retrieved data.
[0015] In another aspect of the embodiment, a User Equipment (UE) is disclosed. One or more primary processors are communicatively coupled to the one or more processors. The one or more primary processors are coupled with a memory unit. The memory unit stores instructions which when executed by the one or more primary processors cause the UE to select the operation to perform on each data source from one or more data sources.
[0016] Other features and aspects of this invention will be apparent from the following description and the accompanying drawings. The features and advantages described in this summary and in the following detailed description are not all-inclusive, and particularly, many additional features and advantages will be apparent to one of ordinary skill in the relevant art, in view of the drawings, specification, and claims hereof. Moreover, it should be noted that the language used in the specification has been principally selected for readability and instructional purposes and may not have been selected to delineate or circumscribe the inventive subject matter, resort to the claims being necessary to determine such inventive subject matter.
BRIEF DESCRIPTION OF THE DRAWINGS
[0017] The accompanying drawings, which are incorporated herein, and constitute a part of this disclosure, illustrate exemplary embodiments of the disclosed methods and systems in which like reference numerals refer to the same parts throughout the different drawings. Components in the drawings are not necessarily to scale, emphasis instead being placed upon clearly illustrating the principles of the present disclosure. Some drawings may indicate the components using block diagrams and may not represent the internal circuitry of each component. It will be appreciated by those skilled in the art that disclosure of such drawings includes disclosure of electrical components, electronic components or circuitry commonly used to implement such components.
[0018] FIG. 1 is an exemplary block diagram of an environment for normalizing data, according to one or more embodiments of the present disclosure;
[0019] FIG. 2 is an exemplary block diagram of a system for normalizing the data, according to the one or more embodiments of the present disclosure;
[0020] FIG. 3 is a schematic representation of the system in which various entities operations are explained, according to the one or more embodiments of the present disclosure;
[0021] FIG. 4 is a block diagram of an architecture that can be implemented in the system of FIG.2, according to the one or more embodiments of the present disclosure;
[0022] FIG. 5 is a signal flow diagram illustrating for normalizing the data, according to the one or more embodiments of the present disclosure; and
[0023] FIG. 6 is a flow diagram illustrating the method for normalizing the data, according to the one or more embodiments of the present disclosure.
[0024] The foregoing shall be more apparent from the following detailed description of the invention.
DETAILED DESCRIPTION OF THE INVENTION
[0025] Some embodiments of the present disclosure, illustrating all its features, will now be discussed in detail. It must also be noted that as used herein and in the appended claims, the singular forms "a", "an" and "the" include plural references unless the context clearly dictates otherwise.
[0026] Various modifications to the embodiment will be readily apparent to those skilled in the art and the generic principles herein may be applied to other embodiments. However, one of ordinary skill in the art will readily recognize that the present disclosure including the definitions listed here below are not intended to be limited to the embodiments illustrated but is to be accorded the widest scope consistent with the principles and features described herein.
[0027] A person of ordinary skill in the art will readily ascertain that the illustrated steps detailed in the figures and here below are set out to explain the exemplary embodiments shown, and it should be anticipated that ongoing technological development will change the manner in which particular functions are performed. These examples are presented herein for purposes of illustration, and not limitation. Further, the boundaries of the functional building blocks have been arbitrarily defined herein for the convenience of the description. Alternative boundaries can be defined so long as the specified functions and relationships thereof are appropriately performed. Alternatives (including equivalents, extensions, variations, deviations, etc., of those described herein) will be apparent to persons skilled in the relevant art(s) based on the teachings contained herein. Such alternatives fall within the scope and spirit of the disclosed embodiments.
[0028] Referring to FIG. 1, FIG. 1 illustrates an exemplary block diagram of an environment 100 for normalizing data, according to one or more embodiments of the present invention. The environment 100 includes the network 105, a User Equipment (UE) 110, a server 115, and a system 120. The UE 110 aids a user to interact with the system 120 for normalizing the data. In an embodiment, the user is at least one of, a network operator, and a service provider. Normalizing data refers to cleaning the data by performing, removing, and cleaning any inaccuracies or inconsistencies from a data set.
[0029] For the purpose of description and explanation, the description will be explained with respect to the UE 110, or to be more specific will be explained with respect to a first UE 110a, a second UE 110b, and a third UE 110c, and should nowhere be construed as limiting the scope of the present disclosure. Each of the UE 110 from the first UE 110a, the second UE 110b, and the third UE 110c is configured to connect to the server 115 via the network 105. In an embodiment, each of the first UE 110a, the second UE 110b, and the third UE 110c is one of, but not limited to, any electrical, electronic, electro-mechanical or an equipment and a combination of one or more of the above devices such as smartphones, virtual reality (VR) devices, augmented reality (AR) devices, laptop, a general-purpose computer, desktop, personal digital assistant, tablet computer, mainframe computer, or any other computing device.
[0030] The network 105 includes, by way of example but not limitation, one or more of a wireless network, a wired network, an internet, an intranet, a public network, a private network, a packet-switched network, a circuit-switched network, an ad hoc network, an infrastructure network, a Public-Switched Telephone Network (PSTN), a cable network, a cellular network, a satellite network, a fiber optic network, or some combination thereof. The network 105 may include, but is not limited to, a Third Generation (3G), a Fourth Generation (4G), a Fifth Generation (5G), a Sixth Generation (6G), a New Radio (NR), a Narrow Band Internet of Things (NB-IoT), an Open Radio Access Network (O-RAN), and the like.
[0031] The server 115 may include by way of example but not limitation, one or more of a standalone server, a server blade, a server rack, a bank of servers, a server farm, hardware supporting a part of a cloud service or system, a home server, hardware running a virtualized server, one or more processors executing code to function as a server, one or more machines performing server-side functionality as described herein, at least a portion of any of the above, some combination thereof. In an embodiment, the entity may include, but is not limited to, a vendor, a network operator, a company, an organization, a university, a lab facility, a business enterprise, a defense facility, or any other facility that provides content.
[0032] The environment 100 further includes the system 120 communicably coupled to the server 115 and each of the first UE 110a, the second UE 110b, and the third UE 110c via the network 105. The system 120 is configured for normalizing the data. The system 120 is adapted to be embedded within the server 115 or is embedded as the individual entity, as per multiple embodiments of the present invention.
[0033] Operational and construction features of the system 120 will be explained in detail with respect to the following figures.
[0034] FIG. 2 is an exemplary block diagram of a system 120 for normalizing the data, according to one or more embodiments of the present disclosure.
[0035] The system 120 includes a processor 205, a memory 210, a user interface 215, and a storage unit 235. For the purpose of description and explanation, the description will be explained with respect to one or more processors 205, or to be more specific will be explained with respect to the processor 205 and should nowhere be construed as limiting the scope of the present disclosure. The one or more processor 205, hereinafter referred to as the processor 205 may be implemented as one or more microprocessors, microcomputers, microcontrollers, digital signal processors, central processing units, state machines, logic circuitries, single board computers, and/or any devices that manipulate signals based on operational instructions.
[0036] As per the illustrated embodiment, the processor 205 is configured to fetch and execute computer-readable instructions stored in the memory 210. The memory 210 may be configured to store one or more computer-readable instructions or routines in a non-transitory computer-readable storage medium, which may be fetched and executed to create or share data packets over a network service. The memory 210 may include any non-transitory storage device including, for example, volatile memory such as RAM, or non-volatile memory such as EPROM, flash memory, and the like.
[0037] The User Interface (UI) 215 includes a variety of interfaces, for example, interfaces for a Graphical User Interface (GUI), a web user interface, a Command Line Interface (CLI), and the like. The user interface 215 facilitates communication of the system 120. In one embodiment, the user interface 215 provides a communication pathway for one or more components of the system 120. Examples of the one or more components include, but are not limited to, the UE 110, and the storage unit 235. The term “storage unit” and “database” are used interchangeably hereinafter, without limiting the scope of the disclosure.
[0038] The storage unit 235 is one of, but not limited to, a centralized database, a cloud-based database, a commercial database, an open-source database, a distributed database, an end-user database, a graphical database, a No-Structured Query Language (NoSQL) database, an object-oriented database, a personal database, an in-memory database, a document-based database, a time series database, a wide column database, a key value database, a search database, a cache databases, and so forth. The foregoing examples of storage unit 235 types are non-limiting and may not be mutually exclusive e.g., a database can be both commercial and cloud-based, or both relational and open-source, etc.
[0039] Further, the processor 205, in an embodiment, may be implemented as a combination of hardware and programming (for example, programmable instructions) to implement one or more functionalities of the processor 205. In the examples described herein, such combinations of hardware and programming may be implemented in several different ways. For example, the programming for the processor 205 may be processor-executable instructions stored on a non-transitory machine-readable storage medium and the hardware for processor 205 may comprise a processing resource (for example, one or more processors), to execute such instructions. In the present examples, the memory 210 may store instructions that, when executed by the processing resource, implement the processor 205. In such examples, the system 120 may comprise the memory 210 storing the instructions and the processing resource to execute the instructions, or the memory 210 may be separate but accessible to the system 120 and the processing resource. In other examples, the processor 205 may be implemented by electronic circuitry.
[0040] In order for the system 120 to normalize the data, the processor 205 includes a data source unit 220, a preprocessing unit 225, and a generating unit 230 communicably coupled to each other. In an embodiment, operations and functionalities of the data source unit 220, a preprocessing unit 225, and a generating unit 230 can be used in combination or interchangeably.
[0041] Initially, a request is transmitted by the user via the UI 215 for selecting an operation to perform on each data from a data frame. The data frame includes one of columns and field values. The data frame refers to a structured data set that is organized into rows and columns, similar to a table in the storage unit 235. Each column represents an attribute or field values (e.g., name, age, transaction ID), and each row contains a record with corresponding values for these attributes. The data frame allows for easy data analysis and processing, especially in machine learning tasks. Each column in the data frame represents a variable, and each row represents a data point. The data frame contains data to be used for machine learning training.
[0042] Upon receiving the request, the data source unit 220 is configured to retrieve the data from one or more data sources. In an embodiment, the one or more data sources include at least, a file input, a source path, an input stream, a Hyper Text Transfer Protocol 2 (HTTP 2), a Distributed File System (DFS), and a Network Attached Storage (NAS). The file input refers to reading data from the files stored locally or on the server 115. The files can be in different formats, including, but not limited to, Comma Separated Values (CSV), JavaScript Object Notation (JSON), extensible Markup Language (XML), or text files. In an exemplary embodiment, the data is stored in the CSV file, and the data source unit 220 fetches the data for processing. The data source unit 220 retrieves the data from the file and loads the data into the memory 210 for further processing.
[0043] The source path typically refers to the directory or network location where the data files are stored. The data source unit 220 fetches the data by following the provided file path. In an exemplary embodiment for the source path, the system 120 stores images in a specific directory. The data source unit 220 navigates to a designated source path and retrieves all files that match the required criteria (e.g., .jpg images). The input stream refers to continuous data that is read in real-time from a stream of data (e.g., data being transmitted over the network 105 or generated by sensors). In an exemplary embodiment, the data is being received from an Application Programming Interface (API) or a live data stream, and the data source unit 220 fetches in real-time. The HTTP 2 is a protocol used for communication over the web, which improves upon HTTP/1.1 by offering multiplexing and better performance for handling multiple requests. In an exemplary embodiment, the data source unit 220 retrieves the data from the web server using the HTTP 2. The data source unit 220 uses the HTTP 2 to fetch the data from remote web servers or APIs.
[0044] The DFS is a distributed file system used to store large datasets across multiple machines. The DFS is commonly used in big data environments to store and retrieve large amounts of data. The data source unit 220 connects to the DFS to retrieve the files for processing. The NAS is a dedicated file storage system that provides Local Area Network (LAN) access to the data. The NAS allows multiple users or systems to access the data from a centralized storage device. The data source unit 220 fetches the data from a NAS device over the network 105. In an exemplary embodiment, if the data is stored on the NAS, the data source unit 220 fetches the data via network protocols. Upon retrieving the data from the one or more data sources, the retrieved data is stored in the data frame for further processing.
[0045] Upon retrieving the data and stored in the data frame, the preprocessing unit 225 is configured to perform an operation on the retrieved data. In an embodiment, the operation is performed on one of columns and field values of the data frame. The preprocessing unit 225 retrieves the data, either the entire data frame or specific columns and field values that need to be processed. In another embodiment, the operation is at least one of a substring extraction, a concatenation, a regular expression-based extraction, string pattern matching, and conditional replacements. If the operation is the substring extraction, the preprocessing unit 225 defines a start position and an end position of the substring to be extracted from the field value. The preprocessing unit 225 locates a portion of text within the field values (e.g., extracting part of an ID, date, or name). In an exemplary embodiment, if the field value is a full date string (e.g., "2024-09-19"), the preprocessing unit 225 extracts the month or day. The month is located at positions 5 and 6 in the date string. The name and ID columns are concatenated to create a new column. The month is extracted from the date column and stored the month in the new column. In another exemplary embodiment, to extract the first octet from the IP Address, the user wants to extract characters from a start position (0) to a position of a first dot (.) in 192.168.1.10, which is found using the string operations by using various programming languages. After performing the substring extraction, the data frame generates the first octet such as, 192 is used to classify the IP addresses, aiding in network segmentation analysis. By extracting and analyzing the first octet, the user can assess whether the UE 110 is operating within expected ranges and identify potential misconfigurations. In an exemplary embodiment, the username is extracted from the email such as David Johnson@example.com (the part before the @ symbol). The last name from the full name column is extracted, which is Johnson. The first name (David) is concatenate with a department (HR) to create a new column, which is a first name department (David-HR).
[0046] If the operation is the concatenation, the preprocessing unit 225 combines two or more field values or strings together (e.g., merging data from multiple columns). In an exemplary embodiment, the users want to create an identifier for each UE 110 by concatenating the multiple fields. The user needs to create a new column which is a device identifier. The device identifier includes combination of a device ID, a device type, and a location into a single string. In an embodiment, the concatenation is performed using various programming languages. The device ID is a unique identifier for the device. The device type is a type of device such as mobile, router, switch. The location of the device includes a data center, and a branch office. The unique device identifier for each network device is created by concatenating the fields. In an exemplary embodiment, the device ID is NET001, the device type is mobile, and the location branch office 1. A concatenation result of the device identifier is NET001_Mobile_Branch office 1. After executing the concatenation, the device identifier displays the identifier for each UE 110 with combination of the device ID, the device type, and the location. The device identifier is used in monitoring dashboards for easy identification of the UEs 110.
[0047] Upon performing the operation of the substring extraction and the concatenation, the preprocessing unit 225 updates the data frame with the modified data and prepares the modified data for pre-processing. The data frame with the modified data pre-processes the one or more data sources on the field values. The pre-processing involves preparing and transforming the data into a format suitable for further analysis, manipulation, or display. The pre-processing includes data cleaning, and data transformation. The data cleaning refers to removing or correcting any errors or inconsistencies in the data, which involves handling missing values, removing duplicates, or correcting inaccuracies. The data transformation refers to converting the data into a suitable format or structure. If the operation is concatenation, combining different data elements into a unified format. The pre-processing ensures that the data is clean, structured, and ready for subsequent operations, which includes analysis, machine learning, reporting, or other tasks. The pre-processing of the data also enhances a Machine Learning (ML) model to understand the structure and learn patterns at a more granular level, which helps in generating more accurate and contextually appropriate responses.
[0048] The preprocessing unit 225 is further configured to detect and execute the operation by using Artificial Intelligence/ Machine Learning (AI/ML) technique. The system 120 utilizes AI/ML algorithms such as support vector machines (SVM) for language detection and tokenization, enabling operations like transliteration (e.g., converting "??????" from Cyrillic script to "Moskva" in Latin script) or segmenting the sentence "Machine learning is fun" into individual tokens: ['Machine', 'learning', 'is', 'fun']. The system 120 automatically identifies missing or incomplete data and suggests optimal replacement values or patterns (e.g., recommending "Unknown" for missing names) using imputation algorithms like K-Nearest Neighbours (KNN) or mean/mode imputation. Additionally, the system 120 enriches the dataset by auto-detecting named entities using Named Entity Recognition (NER) algorithms, such as the NER model (e.g., recognizing "New York" as a location) and parsing dates from text using rule-based or machine learning-based date extraction methods. The AI/ML can also suggest multi-column transformations, such as concatenating the "First Name" and "Last Name" fields to create a "Full Name" column, based on its analysis of the data's relationships and interactions using graph-based algorithms or relational data models. This ensures that preprocessing is efficient, relevant, and dynamically tailored to the dataset's specific needs.
[0049] The preprocessing unit 225 is further configured to identify one or more transformations for the retrieved data using a pre-trained model. The preprocessing unit 225 is further configured to select the one or more operations using the pre-trained model based on the identified one or more transformation for the retrieved data. The identification is based on contextual information associated with the retrieved data or a target model selected by the user using the UI 215. The pre-trained model is trained on historical patterns of the data's relationships and interactions using graph-based algorithms or relational data models. The preprocessing unit 225 employs the AI/ML algorithms, such as decision trees and clustering algorithms, to automatically detect and recommend dynamic operations tailored to the data’s structure and content. The preprocessing unit 225 intelligently identifies the most relevant operations, such as substring extraction and concatenation, and suggests advanced transformations like regex-based extraction, string pattern matching, and conditional replacements based on the contextual analysis of field values. In an exemplary embodiment, using natural language processing (NLP) techniques, the system 120 can analyze the patterns in the data to recommend extracting the first 5 characters or concatenating columns based on the identified naming conventions. By leveraging contextual information through algorithms such as Long Short-Term Memory (LSTM) networks, the AI/ML can autonomously determine when to apply specific operation such as executing substring extraction based on the surrounding content. The A/ML algorithms ensure that the most effective transformation is applied with minimal user intervention.
[0050] Upon pre-processing the one or more data sources, the generating unit 230 is configured to generate the normalized data based on performing the operation on the retrieved data. The generating unit 230 is responsible for transforming the pre-processed data into a normalized format. The normalization is a process that adjusts or removes the data to a standard format or scale, making the data easier to analyze and compare. The normalization is a process to standardize the data, which is the one of column and field values of the data frame. The one or more data sources are converted into a token into its base form. The token refers to a text which is split into individual tokens, typically words or phrases. Each token represents a unit of meaning in the text. In the normalization process, the inflectional form of word is converted to the base form. In an exemplary embodiment, the words can appear in various forms due to tense, plurality, or other grammatical structures. The normalization process removes the inflectional form to return each token to the base form. By converting words to the base form, which significantly reduces the number of distinct tokens in the dataset. In an example, the words such as run, running, and ran get normalized to run, which removes the variations in the text and also cleans the text by removing redundant data. Upon completing the normalization, the generated normalized data is stored in the storage unit 235 and used for one or more target models training. The one or more target models refer to the specific application or task to be achieved by the trained model in real time.
[0051] Upon cleaning the data, the valid data is stored and transmitted to the ML training unit 415 (as shown in FIG.4) to train the data. Further, the data normalization reduces the number of unique tokens present in the text, removing the variations in the text and also cleaning the text by removing redundant data. It helps to improve the accuracy, reduce the time and resources required to train the model, prevent overfitting, and improves the interpretability of the trained model. The normalization process ensures that the valid data is retained for model training. The normalized data enhances the performance of the ML training unit 415. The trained data is used for efficient information retrieval and knowledge extraction, enhancing the effectiveness of the system 120.
[0052] FIG. 3 is a schematic representation of the system 120 in which various entities operations are explained, according to one or more embodiments of the present disclosure. Referring to FIG. 3, it describes the system 120 for normalizing the data. It is to be noted that the embodiment with respect to FIG. 3 will be explained with respect to the first UE 110a for the purpose of description and illustration and should nowhere be construed as limited to the scope of the present disclosure.
[0053] As mentioned earlier in FIG.1, In an embodiment, the first UE 110a may encompass electronic apparatuses. These devices are illustrative of, but not restricted to, personal computers, laptops, tablets, smartphones (including phones), or other devices enabled for web connectivity. The scope of the first UE 110a explicitly extends to a broad spectrum of electronic devices capable of executing computing operations and accessing networked resources, thereby providing users with a versatile range of functionalities for both personal and professional applications. This embodiment acknowledges the evolving nature of electronic devices and their integral role in facilitating access to digital services and platforms. In an embodiment, the first UE 110a can be associated with multiple users. Each UE 110 is communicatively coupled with the processor 205 via the network 105.
[0054] The first UE 110a includes one or more primary processors 305 communicably coupled to the one or more processors 205 of the system 120. The one or more primary processors 305 are coupled with a memory unit 310 storing instructions which are executed by the one or more primary processors 305. Execution of the stored instructions by the one or more primary processors 305 enables the first UE 110a to select the operation to perform on each data from the one or more data sources.
[0055] Furthermore, the one or more primary processors 305 within the UE 110 are uniquely configured to execute a series of steps as described herein. This configuration underscores the processor 205 capability to normalize the data. The operational synergy between the one or more primary processors 305 and the additional processors, guided by the executable instructions stored in the memory unit 310, facilitates the normalized data.
[0056] As mentioned earlier in FIG.2, the system 120 includes the one or more processors 205, the memory 210, and the user interface 215. The operations and functions of the one or more processors 205, the memory 210, and the user interface 215 are already explained in FIG. 2. For the sake of brevity, a similar description related to the working and operation of the system 120 as illustrated in FIG. 2 has been omitted to avoid repetition.
[0057] Further, the processor 205 includes the data source unit 220, the preprocessing unit 225, and the generating unit 230. The operations and functions of the data source unit 220, the preprocessing unit 225, and the generating unit 230 are already explained in FIG. 2. Hence, for the sake of brevity, a similar description related to the working and operation of the system 120 as illustrated in FIG. 2 has been omitted to avoid repetition. The limited description provided for the system 120 in FIG. 3, should be read with the description provided for the system 120 in the FIG. 2 above, and should not be construed as limiting the scope of the present disclosure.
[0058] FIG. 4 is a block diagram of an architecture 300 that can be implemented in the system of FIG.2, according to one or more embodiments of the present disclosure. The architecture 300 of the system 120 includes an integrated system 405, a load balancer 410, and the processor 205. The processor 205 includes the preprocessing unit 225, the data source unit 220, the Machine Learning (ML) training unit 415.
[0059] The architecture 300 of the system 120 is configured to interact with the integrated system 405 and the load balancer 410. The integrated system 405 is configured to access the data in the network 105 and is capable of interacting with the server 115, the storage unit 235 to collect the data. In an embodiment, the integrated system 405 includes, but not limited to, the one or more data sources, from where the data can be retrieved. In an embodiment, the data can be retrieved as the file input, the source path, the input stream, the HTTP2, the DFS and the NAS.
[0060] The load balancer 410 includes distributing the one or more data sources request traffic across the one or more processors 205. The distribution of the one or more data source request traffic helps in managing and optimizing the workload, ensuring that no single processor is overwhelmed while improving overall system performance and reliability.
[0061] The processor 205 normalizes the one or more data sources by converting the token into its base form. In the normalization process, the inflectional form of word is removed so that the base form is obtained.
[0062] The preprocessing unit 225 is configured to perform the operation on the retrieved data. In an embodiment, the operation is performed on one of columns and field values of the data frame. The preprocessing unit 225 retrieves the data, either the entire data frame or specific columns and field values that need to be processed. In another embodiment, the operation is at least one of the substring extractions, a concatenation, a regular expression-based extraction, string pattern matching, and conditional replacements. If the operation is the substring extraction, the preprocessing unit 225 defines a start position and an end position of the substring to be extracted from the field value. The preprocessing unit 225 locates the portion of text within the field values (e.g., extracting part of an ID, date, or name). If the operation is the concatenation, the preprocessing unit 225 combines two or more field values or strings together (e.g., merging data from multiple columns).
[0063] Upon performing the operation of the substring extraction and the concatenation, the pre-processing unit 225 updates the data frame with the modified data and prepares the modified data for pre-processing. The data frame with the modified data pre-processes the one or more data sources on the field values. The pre-processing involves preparing and transforming the data into a format suitable for further analysis, manipulation, or display. The pre-processing includes data cleaning, and data transformation. If the operation is concatenation, combining different data elements into a unified format. The pre-processing ensures that the data is clean, structured, and ready for subsequent operations, which includes analysis, machine learning, reporting, or other tasks. The pre-processing of the data also enhances a Machine Learning (ML) model to understand the structure and learn patterns at a more granular level, which helps in generating more accurate and contextually appropriate responses.
[0064] The data source unit 220 is configured to update the data retrieved post preprocessing and store the updated data source in the data frame. The ML training unit 415 is configured to train the one or more data sources using one or more machine learning algorithms. Due to the operation of the substring extraction and concatenation, the ML training unit 415 generates high quality data with better consistency. Further, the ML training unit 415 understands the structure of the data and learns the patterns, which helps in generating more accurate and contextually appropriate responses.
[0065] FIG. 5 is a signal flow diagram illustrating for normalizing the data, according to one or more embodiments of the present disclosure.
[0066] At 505, initially, a user is transmitted the request via the UI 215 for selecting the operation to perform on each data from a data frame that includes one of columns and field values of the data frame. The data frame refers to a structured data set that is organized into rows and columns, similar to a table in the storage unit 235. Each column represents an attribute or field values (e.g., name, age, transaction ID), and each row contains a record with corresponding values for these attributes. The data frame allows for easy data analysis and processing, especially in machine learning tasks. Each column in the data frame represents a variable, and each row represents a data point. The data frame contains data to be used for machine learning training.
[0067] At 510, the data source unit 220 is configured to retrieve the data from one or more data sources. In an embodiment, the one or more data sources include at least, file input, source path, input stream, Hyper Text Transfer Protocol 2 (HTTP 2), Distributed File System (DFS), and Network Attached Storage (NAS). The file input refers to reading data from the files stored locally or on the server 115. The files can be in different formats, including, but not limited to, Comma Separated Values (CSV), JavaScript Object Notation (JSON), extensible Markup Language (XML), or text files. In an exemplary embodiment, the data is stored in the CSV file, and the data source unit 220 fetches the data for processing. The data source unit 220 retrieves the data from the file and loads the data into the memory 210 for further processing.
[0068] The source path typically refers to the directory or network location where the data files are stored. The data source unit 220 fetches the data by following the provided file path. In an exemplary embodiment for the source path, the system 120 stores images in a specific directory. The data source unit 220 navigates to the designated source path and retrieves all files that match the required criteria (e.g., .jpg images). The input stream refers to continuous data that is read in real-time from a stream of data (e.g., data being transmitted over the network 105 or generated by sensors). In an exemplary embodiment, the data is being received from an Application Programming Interface (API) or a live data stream, and the data source unit 220 fetches it in real-time. The HTTP 2 is a protocol used for communication over the web, which improves upon HTTP/1.1 by offering multiplexing and better performance for handling multiple requests. In an exemplary embodiment, the data source unit 220 retrieves the data from the web server using the HTTP 2. The data source unit 220 uses HTTP 2 to fetch the data from remote web servers or APIs.
[0069] At 515, the preprocessing unit 225 is configured to perform the operation on the retrieved data. In an embodiment, the operation is performed on one of columns and field values of the data frame. The preprocessing unit 225 retrieves the data, either the entire data frame or specific columns and field values that need to be processed. In another embodiment, the operation is at least one of the substring extraction and the concatenation. If the operation is the substring extraction, the preprocessing unit 225 defines the start position and the end position of the substring to be extracted from the field value. The preprocessing unit 225 locates a portion of text within the field values (e.g., extracting part of an ID, date, or name). If the operation is the concatenation, the preprocessing unit 225 combines two or more field values or strings together (e.g., merging data from multiple columns). Upon performing the operation of the substring extraction and the concatenation, the preprocessing unit 225 updates the data frame with the modified data and prepares the modified data for pre-processing.
[0070] The pre-processing unit 225 pre-processes the one or more data sources on the field values. The pre-processing involves preparing and transforming the data into a format suitable for further analysis, manipulation, or display. The pre-processing includes data cleaning, and data transformation. The data cleaning refers to removing or correcting any errors or inconsistencies in the data. This could involve handling missing values, removing duplicates, or correcting inaccuracies. The data transformation refers to converting the data into a suitable format or structure. If the operation is concatenation, combining different data elements into a unified format. The pre-processing ensures that the data is clean, structured, and ready for subsequent operations, which includes analysis, machine learning, reporting, or other tasks. The pre-processing of the data also enhances a Machine Learning (ML) model to understand the structure and learn patterns at a more granular level, which helps in generating more accurate and contextually appropriate responses.
[0071] At 520, upon pre-processing the one or more data sources, the generating unit 230 is configured to generate the normalized data based on performing the operation on the retrieved data. The generating unit 230 is responsible for transforming the pre-processed data into the normalized format. The normalization is a process that adjusts or removes the data to the standard format or scale, making the data easier to analyze and compare. The generated normalized data is stored in the storage unit 235 and used for one or more target models training. Upon cleaning the data, the valid data is stored and transmitted to the ML training unit 415 (as shown in FIG.3) to train the data. Further, the data normalization reduces the number of unique tokens present in the text, removing the variations in the text and also cleaning the text by removing redundant data. It helps to improve the accuracy, reduce the time and resources required to train the ML model, prevent overfitting, and improves the interpretability of the ML model. The normalization process ensures that the valid data is retained for ML training. The normalized data enhances the performance of the ML training unit 415. The trained data is used for efficient information retrieval and knowledge extraction, enhancing the effectiveness of the system 120.
[0072] FIG. 6 is a flow diagram illustrating a method 600 for normalizing the data, according to one or more embodiments of the present disclosure.
[0073] At step 605, the method 600 includes the step of retrieving the data from the one or more data sources by the data source unit 220. In an embodiment, the one or more data sources include at least, the file input, the source path, the input stream, the Hyper Text Transfer Protocol 2 (HTTP 2), the Distributed File System (DFS), and the Network Attached Storage (NAS). Upon retrieving the data, the retrieved data is stored in the data frame for further processing.
[0074] At step 610, the method 600 includes the step of performing the operation on the retrieved data by the preprocessing unit 225. In an embodiment, the operation is performed on one of columns and field values of the data frame. The preprocessing unit 225 retrieves the data, either the entire data frame or specific columns and field values that need to be processed. In another embodiment, the operation is at least one of the substring extraction, the concatenation, the regular expression-based extraction, the string pattern matching, and the conditional replacements. If the operation is the substring extraction, the preprocessing unit 225 defines the start position and the end position of the substring to be extracted from the field value. The preprocessing unit 225 locates the portion of text within the field values (e.g., extracting part of the ID, date, or name). If the operation is the concatenation, the preprocessing unit 225 combines two or more field values or strings together (e.g., merging data from multiple columns).
[0075] Upon performing the operation of the substring extraction and the concatenation, the preprocessing unit 225 updates the data frame with the modified data and prepares the modified data for pre-processing. The data frame with the modified data pre-processes the one or more data sources on the field values. The pre-processing involves preparing and transforming the data into a format suitable for further analysis, manipulation, or display. The pre-processing includes data cleaning, and data transformation. The data cleaning refers to removing or correcting any errors or inconsistencies in the data. This could involve handling missing values, removing duplicates, or correcting inaccuracies. The data transformation refers to converting the data into a suitable format or structure. If the operation is concatenation, combining different data elements into a unified format. The pre-processing ensures that the data is clean, structured, and ready for subsequent operations, which includes analysis, machine learning, reporting, or other tasks. The pre-processing of the data also enhances a Machine Learning (ML) model to understand the structure and learn patterns at a more granular level, which helps in generating more accurate and contextually appropriate responses.
[0076] At step 615, the method 600 includes the step of generating the normalized data based on performing the operation on the retrieved data by the generating unit 230. The generating unit 230 is responsible for transforming the pre-processed data into a normalized format. The normalization is a process that adjusts or removes the data to a standard format or scale, making the data easier to analyze and compare. The normalization is a process to clean the data, which is the one of column and field values of the data frame. The one or more data sources are normalized by converting the token into a common base form. The token refers to a text which is split into individual tokens, typically words or phrases. Each token represents a unit of meaning in the text. In the normalization process, the inflectional form of word is removed so that the base form is obtained. In an exemplary embodiment, the words can appear in various forms due to tense, plurality, or other grammatical structures (e.g., running ? run or cars ? car). The normalization process removes the inflections to return each token to its base form. By converting words to the base form, which significantly reduces the number of distinct tokens in the dataset. In an example, the words such as run, running, and ran get normalized to run, which removes the variations in the text and also cleans the text by removing redundant data. Upon completing the normalization, the generated normalized data is stored in the storage unit 235 and used for one or more target models training.
[0077] Upon cleaning the data, the valid data is stored and transmitted to the ML training unit 415 to train the data. Further, the data normalization reduces the number of unique tokens present in the text, removing the variations in the text and also cleaning the text by removing redundant data. It helps to improve the accuracy, reduce the time and resources required to train the ML model, prevent overfitting, and improves the interpretability of the ML model. The normalization process ensures that the valid data is retained for ML training. The normalized data enhances the performance of the ML training unit 415. The trained data is used for efficient information retrieval and knowledge extraction, enhancing the effectiveness of the system 120.
[0078] In another aspect of the embodiment, a non-transitory computer-readable medium having stored thereon computer-readable instructions that, when executed by a processor 205. The processor 205 is configured to retrieve data from one or more data sources. The processor 205 is configured to perform an operation on the retrieved data. The processor 205 is configured to generate the normalized data based on performing the operation on the retrieved data.
[0079] A person of ordinary skill in the art will readily ascertain that the illustrated embodiments and steps in description and drawings (FIGS.1-6) are set out to explain the exemplary embodiments shown, and it should be anticipated that ongoing technological development will change the manner in which particular functions are performed. These examples are presented herein for purposes of illustration, and not limitation. Further, the boundaries of the functional building blocks have been arbitrarily defined herein for the convenience of the description. Alternative boundaries can be defined so long as the specified functions and relationships thereof are appropriately performed. Alternatives (including equivalents, extensions, variations, deviations, etc., of those described herein) will be apparent to persons skilled in the relevant art(s) based on the teachings contained herein. Such alternatives fall within the scope and spirit of the disclosed embodiments.
[0080] The present disclosure provides technical advancement for normalizing the data, which removes unwanted rows and helps in increasing the quality, quantity, and diversity of the training data, reduces inflectional forms of the text and sometimes derivationally related forms of the word to a common base form. The present disclosure performs data normalization using the operations, which reduces the number of unique tokens present in the text, removes the variations in the text and also cleans the text by removing redundant data. If more tokens are generated the ML training unit can be trained more efficiently, which helps to improve the accuracy, reduce time and resources required to train the model, prevent overfitting, and improves the interpretability of the ML model. The normalization process ensures that the valid data is retained for ML training.
[0081] The present invention offers multiple advantages over the prior art and the above listed are a few examples to emphasize on some of the advantageous features. The listed advantages are to be read in a non-limiting manner.

REFERENCE NUMERALS

[0082] Environment - 100
[0083] Network-105
[0084] User equipment- 110
[0085] Server - 115
[0086] System -120
[0087] Processor - 205
[0088] Memory - 210
[0089] User interface-215
[0090] Data source unit – 220
[0091] Preprocessing unit– 225
[0092] Generating unit- 230
[0093] Storage unit– 235
[0094] Integrated system-405
[0095] Load balancer- 410
[0096] Data source unit-315
[0097] ML training unit- 415
,CLAIMS:CLAIMS
We Claim:
1. A method (500) of normalizing data, the method (500) comprising the steps of:
retrieving, by one or more processors (205), data from one or more data sources;
performing, by the one or more processors (205), an operation on the retrieved data; and
generating, by the one or more processors (205), normalized data based on performing the operation on the retrieved data.

2. The method (500) as claimed in claim 1, wherein the one or more data sources include at least, a file input, a source path, an input stream, a Hyper Text Transfer Protocol 2 (HTTP 2), a Distributed File System (DFS), and a Network Attached Storage (NAS).

3. The method (500) as claimed in claim 1, wherein the retrieved data is stored in a data frame.

4. The method (500) as claimed in claim 1, wherein the operation is performed on one of columns and field values of the data frame, wherein the operation is at least one of a substring extraction, a concatenation, a regular expression-based extraction, string pattern matching, and conditional replacements.

5. The method (500) as claimed in claim 1, wherein the generated normalized data is stored in a storage unit (235) and used for one or more target models training.

6. The method (500) as claimed in claim 1, wherein the operation is selected by a user for each data source from the one or more data sources.

7. The method (500) as claimed in claim 1, wherein the step of performing, by the one or more processors (205), an operation on the retrieved data comprising the steps of:
identifying one or more transformations for the retrieved data using a pre-trained model; and
selecting the one or more operations using the pre-trained model based on the identified one or more transformation for the retrieved data.

8. A system (120) for normalizing data, the system (120) comprises:
a data source unit (220) configured to retrieve, data from one or more data sources;
a preprocessing unit (225) configured to perform, an operation on the retrieved data; and
a generating unit (230) configured to generate, normalized data based on performing the operation on the retrieved data.

9. The system (120) as claimed in claim 8, wherein the one or more data sources include at least, a file input, a source path, an input stream, a Hyper Text Transfer Protocol 2 (HTTP 2), a Distributed File System (DFS), and a Network Attached Storage (NAS).

10. The system (120) as claimed in claim 8, wherein the retrieved data is stored in a data frame.

11. The system (120) as claimed in claim 8, wherein the operation is performed on one of columns and field values of the data frame, wherein the operation is at least one of a substring extraction, concatenation, a regular expression-based extraction, string pattern matching, and conditional replacements.

12. The system (120) as claimed in claim 8, wherein the generated normalized data is stored in a storage unit (235) and used for one or more target models training.

13. The system (120) as claimed in claim 8, wherein the operation is selected by a user for each data source from the one or more data sources.

14. The system (120) as claimed in claim 8, wherein the preprocessing unit (225) is further configured to:
identify one or more transformations for the retrieved data using a pre-trained model; and
select the one or more operations using the pre-trained model based on the identified one or more transformation for the retrieved data.

15. A User Equipment (UE) (110), comprising:
one or more primary processors (305) communicatively coupled to the one or more processors (205), the one or more primary processors (305) coupled with a memory unit (310), wherein said memory unit (310) stores instructions which when executed by the one or more primary processors (305) cause the UE (110) to:
select an operation to perform on each data source from the one or more data sources; and
wherein the one or more processors (205) are configured to perform the steps as claimed in claim 1.

Documents

Application Documents

#	Name	Date
1	202321067263-STATEMENT OF UNDERTAKING (FORM 3) [06-10-2023(online)].pdf	2023-10-06
2	202321067263-PROVISIONAL SPECIFICATION [06-10-2023(online)].pdf	2023-10-06
3	202321067263-FORM 1 [06-10-2023(online)].pdf	2023-10-06
4	202321067263-FIGURE OF ABSTRACT [06-10-2023(online)].pdf	2023-10-06
5	202321067263-DRAWINGS [06-10-2023(online)].pdf	2023-10-06
6	202321067263-DECLARATION OF INVENTORSHIP (FORM 5) [06-10-2023(online)].pdf	2023-10-06
7	202321067263-FORM-26 [27-11-2023(online)].pdf	2023-11-27
8	202321067263-Proof of Right [12-02-2024(online)].pdf	2024-02-12
9	202321067263-DRAWING [06-10-2024(online)].pdf	2024-10-06
10	202321067263-COMPLETE SPECIFICATION [06-10-2024(online)].pdf	2024-10-06
11	Abstract.jpg	2024-12-07