System And Method For Normalizing Data

< Back

System And Method For Normalizing Data

Abstract: ABSTRACT SYSTEM AND METHOD FOR NORMALIZING DATA The present invention relates to a system (120) and a method (500) for normalizing data is disclosed. The system (120) includes a retrieving unit (220) configured to retrieve data from one or more data sources. The system (120) includes an applying unit (225) configured to apply string operation filter on the retrieved data to clean the retrieved data. The system (120) includes a normalizing unit (230) configured to normalize the retrieved data via at least one normalization technique. Ref. Fig. 2

Get Free WhatsApp Updates!
Notices, Deadlines & Correspondence

Patent Information

Application #

Filing Date

11 October 2023

Publication Number

16/2025

Publication Type

INA

Invention Field

COMPUTER SCIENCE

Status

Email

Parent Application

Applicants

JIO PLATFORMS LIMITED

JIO PLATFORMS LIMITED, WHOSE ADDRESS IS: OFFICE-101, SAFFRON, NR. CENTRE POINT, PANCHWATI 5 RASTA, AMBAWADI, AHMEDABAD 380006, GUJARAT, INDIA

Inventors

1. Aayush Bhatnagar

Reliance Corporate Park, Thane - Belapur Road, Ghansoli, Navi Mumbai, Maharashtra 400701, India

2. Ankit Murarka

Reliance Corporate Park, Thane - Belapur Road, Ghansoli, Navi Mumbai, Maharashtra 400701, India

3. Jugal Kishore

Reliance Corporate Park, Thane - Belapur Road, Ghansoli, Navi Mumbai, Maharashtra 400701, India

4. Chandra Ganveer

Reliance Corporate Park, Thane - Belapur Road, Ghansoli, Navi Mumbai, Maharashtra 400701, India

5. Sanjana Chaudhary

Reliance Corporate Park, Thane - Belapur Road, Ghansoli, Navi Mumbai, Maharashtra 400701, India

6. Gourav Gurbani

Reliance Corporate Park, Thane - Belapur Road, Ghansoli, Navi Mumbai, Maharashtra 400701, India

7. Yogesh Kumar

Reliance Corporate Park, Thane - Belapur Road, Ghansoli, Navi Mumbai, Maharashtra 400701, India

8. Avinash Kushwaha

Reliance Corporate Park, Thane - Belapur Road, Ghansoli, Navi Mumbai, Maharashtra 400701, India

9. Dharmendra Kumar Vishwakarma

Reliance Corporate Park, Thane - Belapur Road, Ghansoli, Navi Mumbai, Maharashtra 400701, India

10. Sajal Soni

Reliance Corporate Park, Thane - Belapur Road, Ghansoli, Navi Mumbai, Maharashtra 400701, India

11. Niharika Patnam

Reliance Corporate Park, Thane - Belapur Road, Ghansoli, Navi Mumbai, Maharashtra 400701, India

12. Shubham Ingle

Reliance Corporate Park, Thane - Belapur Road, Ghansoli, Navi Mumbai, Maharashtra 400701, India

13. Harsh Poddar

Reliance Corporate Park, Thane - Belapur Road, Ghansoli, Navi Mumbai, Maharashtra 400701, India

14. Sanket Kumthekar

Reliance Corporate Park, Thane - Belapur Road, Ghansoli, Navi Mumbai, Maharashtra 400701, India

15. Mohit Bhanwria

Reliance Corporate Park, Thane - Belapur Road, Ghansoli, Navi Mumbai, Maharashtra 400701, India

16. Shashank Bhushan

Reliance Corporate Park, Thane - Belapur Road, Ghansoli, Navi Mumbai, Maharashtra 400701, India

17. Vinay Gayki

Reliance Corporate Park, Thane - Belapur Road, Ghansoli, Navi Mumbai, Maharashtra 400701, India

18. Aniket Khade

Reliance Corporate Park, Thane - Belapur Road, Ghansoli, Navi Mumbai, Maharashtra 400701, India

19. Durgesh Kumar

Reliance Corporate Park, Thane - Belapur Road, Ghansoli, Navi Mumbai, Maharashtra 400701, India

20. Zenith Kumar

Reliance Corporate Park, Thane - Belapur Road, Ghansoli, Navi Mumbai, Maharashtra 400701, India

21. Gaurav Kumar

Reliance Corporate Park, Thane - Belapur Road, Ghansoli, Navi Mumbai, Maharashtra 400701, India

22. Manasvi Rajani

Reliance Corporate Park, Thane - Belapur Road, Ghansoli, Navi Mumbai, Maharashtra 400701, India

23. Kishan Sahu

Reliance Corporate Park, Thane - Belapur Road, Ghansoli, Navi Mumbai, Maharashtra 400701, India

24. Sunil meena

Reliance Corporate Park, Thane - Belapur Road, Ghansoli, Navi Mumbai, Maharashtra 400701, India

25. Supriya Kaushik De

Reliance Corporate Park, Thane - Belapur Road, Ghansoli, Navi Mumbai, Maharashtra 400701, India

26. Kumar Debashish

Reliance Corporate Park, Thane - Belapur Road, Ghansoli, Navi Mumbai, Maharashtra 400701, India

27. Mehul Tilala

Reliance Corporate Park, Thane - Belapur Road, Ghansoli, Navi Mumbai, Maharashtra 400701, India

28. Satish Narayan

Reliance Corporate Park, Thane - Belapur Road, Ghansoli, Navi Mumbai, Maharashtra 400701, India

29. Rahul Kumar

Reliance Corporate Park, Thane - Belapur Road, Ghansoli, Navi Mumbai, Maharashtra 400701, India

30. Harshita Garg

Reliance Corporate Park, Thane - Belapur Road, Ghansoli, Navi Mumbai, Maharashtra 400701, India

31. Kunal Telgote

Reliance Corporate Park, Thane - Belapur Road, Ghansoli, Navi Mumbai, Maharashtra 400701, India

32. Ralph Lobo

Reliance Corporate Park, Thane - Belapur Road, Ghansoli, Navi Mumbai, Maharashtra 400701, India

33. Girish Dange

Reliance Corporate Park, Thane - Belapur Road, Ghansoli, Navi Mumbai, Maharashtra 400701, India

Specification

DESC:
FORM 2
THE PATENTS ACT, 1970
(39 of 1970)
&
THE PATENTS RULES, 2003

COMPLETE SPECIFICATION
(See section 10 and rule 13)
1. TITLE OF THE INVENTION
SYSTEM AND METHOD FOR NORMALIZING DATA
2. APPLICANT(S)
NAME NATIONALITY ADDRESS
JIO PLATFORMS LIMITED INDIAN OFFICE-101, SAFFRON, NR. CENTRE POINT, PANCHWATI 5 RASTA, AMBAWADI, AHMEDABAD 380006, GUJARAT, INDIA
3.PREAMBLE TO THE DESCRIPTION

THE FOLLOWING SPECIFICATION PARTICULARLY DESCRIBES THE NATURE OF THIS INVENTION AND THE MANNER IN WHICH IT IS TO BE PERFORMED.

FIELD OF THE INVENTION
[0001] The present invention relates to the field of wireless communication networks, more particularly relates to a method and a system for normalizing data.
BACKGROUND OF THE INVENTION
[0002] Machine Learning (ML) plays a significant role in various technology applications, including telecommunications. During training of a Machine Learning (ML) model, it is preferable to avoid training the ML model on irrelevant or unwanted data. For example, the irrelevant or unwanted data may include data which is not clean, or data having some errors. As a result of the irrelevant or unwanted data used during the training, the ML model is unable to deliver optimal results. Further, the ML model may require training for longer durations, thereby consuming more system resources. As such, the overall processing capabilities of the system are affected.
[0003] Hence there is a need for a system and method for cleaning and normalizing data in order to facilitate machine learning training. The proposed method and system should ensure the clean and normalization of data by implementing a string operation filter.
SUMMARY OF THE INVENTION
[0004] One or more embodiments of the present disclosure provide a method and a system for normalizing data.
[0005] In one aspect of the present invention, the method for normalizing the data is disclosed. The method includes the step of retrieving, by one or more processors, data from one or more data sources. The method includes the step of applying, by the one or more processors, a string operation filter on the retrieved data to clean the retrieved data. The method includes the step of normalizing, by the one or more processors, the retrieved data via at least one normalization technique.
[0006] In one embodiment, the one or more data sources include at least, a file input, a source path, an input stream, a Hyper Text Transfer Protocol 2 (HTTP 2), a Distributed File System (DFS), and a Network Attached Storage (NAS).
[0007] In yet another embodiment, the string operation filter includes substring operation and concatenation operation.
[0008] In yet another embodiment, on retrieving the data, the method comprises the step of performing, by the one or more processors, operations pertaining to scaling, normalization, cleaning, and lowercasing of the data to ensure uniformity and consistency of the data.
[0009] In yet another embodiment, the at least one normalization technique is one of a stemming technique and a lemmatization technique. The normalization technique is identified utilizing historic and current data retrieved from the one or more data sources.
[0010] In yet another embodiment, the normalized data is stored in a storage unit and used for training a model.
[0011] In another aspect of the present invention, the system for normalizing the data is disclosed. The system includes a retrieving unit configured to retrieve data from one or more data sources. The system includes an applying unit configured to apply a string operation filter on the retrieved data to clean the retrieved data. The system includes a normalizing unit configured to normalize the retrieved data via at least one normalization technique.
[0012] In another aspect of the embodiment, a non-transitory computer-readable medium having stored thereon computer-readable instructions that, when executed by a processor is disclosed. The processor is configured to retrieve data from one or more data sources. The processor is configured to apply a string operation filter on the retrieved data to clean the retrieved data. The processor is configured to normalize the retrieved data via at least one normalization technique.
[0013] Other features and aspects of this invention will be apparent from the following description and the accompanying drawings. The features and advantages described in this summary and in the following detailed description are not all-inclusive, and particularly, many additional features and advantages will be apparent to one of ordinary skill in the relevant art, in view of the drawings, specification, and claims hereof. Moreover, it should be noted that the language used in the specification has been principally selected for readability and instructional purposes and may not have been selected to delineate or circumscribe the inventive subject matter, resort to the claims being necessary to determine such inventive subject matter.
BRIEF DESCRIPTION OF THE DRAWINGS
[0014] The accompanying drawings, which are incorporated herein, and constitute a part of this disclosure, illustrate exemplary embodiments of the disclosed methods and systems in which like reference numerals refer to the same parts throughout the different drawings. Components in the drawings are not necessarily to scale, emphasis instead being placed upon clearly illustrating the principles of the present disclosure. Some drawings may indicate the components using block diagrams and may not represent the internal circuitry of each component. It will be appreciated by those skilled in the art that disclosure of such drawings includes disclosure of electrical components, electronic components or circuitry commonly used to implement such components.
[0015] FIG. 1 is an exemplary block diagram of an environment for normalizing data, according to one or more embodiments of the present disclosure;
[0016] FIG. 2 is an exemplary block diagram of a system for normalizing the data, according to the one or more embodiments of the present disclosure;
[0017] FIG. 3 is a block diagram of an architecture that can be implemented in the system of FIG.2, according to the one or more embodiments of the present disclosure;
[0018] FIG. 4 is a signal flow diagram illustrating for normalizing the data, according to the one or more embodiments of the present disclosure; and
[0019] FIG. 5 is a flow diagram illustrating the method for normalizing the data, according to the one or more embodiments of the present disclosure.
[0020] The foregoing shall be more apparent from the following detailed description of the invention.
DETAILED DESCRIPTION OF THE INVENTION
[0021] Some embodiments of the present disclosure, illustrating all its features, will now be discussed in detail. It must also be noted that as used herein and in the appended claims, the singular forms "a", "an" and "the" include plural references unless the context clearly dictates otherwise.
[0022] Various modifications to the embodiment will be readily apparent to those skilled in the art and the generic principles herein may be applied to other embodiments. However, one of ordinary skill in the art will readily recognize that the present disclosure including the definitions listed here below are not intended to be limited to the embodiments illustrated but is to be accorded the widest scope consistent with the principles and features described herein.
[0023] A person of ordinary skill in the art will readily ascertain that the illustrated steps detailed in the figures and here below are set out to explain the exemplary embodiments shown, and it should be anticipated that ongoing technological development will change the manner in which particular functions are performed. These examples are presented herein for purposes of illustration, and not limitation. Further, the boundaries of the functional building blocks have been arbitrarily defined herein for the convenience of the description. Alternative boundaries can be defined so long as the specified functions and relationships thereof are appropriately performed. Alternatives (including equivalents, extensions, variations, deviations, etc., of those described herein) will be apparent to persons skilled in the relevant art(s) based on the teachings contained herein. Such alternatives fall within the scope and spirit of the disclosed embodiments.
[0024] Referring to FIG. 1, FIG. 1 illustrates an exemplary block diagram of an environment 100 for normalizing data, according to one or more embodiments of the present invention. The environment 100 includes a network 105, a User Equipment (UE) 110, a server 115, and a system 120. The UE 110 aids a user to interact with the system 120 for normalizing the data. In an embodiment, the user is at least one of, a network operator, and a service provider. The normalization of data refers to the process of organizing data within a storage unit 240 (as shown in FIG.2) to reduce redundancy such as reducing duplicate data across tables, and improving data integrity, which ensures accuracy and consistency of the data.
[0025] For the purpose of description and explanation, the description will be explained with respect to the UE 110, or to be more specific will be explained with respect to a first UE 110a, a second UE 110b, and a third UE 110c, and should nowhere be construed as limiting the scope of the present disclosure. Each of the UE 110 from the first UE 110a, the second UE 110b, and the third UE 110c is configured to connect to the server 115 via the network 105. In an embodiment, each of the first UE 110a, the second UE 110b, and the third UE 110c is one of, but not limited to, any electrical, electronic, electro-mechanical or an equipment and a combination of one or more of the above devices such as smartphones, virtual reality (VR) devices, augmented reality (AR) devices, laptop, a general-purpose computer, desktop, personal digital assistant, tablet computer, mainframe computer, or any other computing device.
[0026] The network 105 includes, by way of example but not limitation, one or more of a wireless network, a wired network, an internet, an intranet, a public network, a private network, a packet-switched network, a circuit-switched network, an ad hoc network, an infrastructure network, a Public-Switched Telephone Network (PSTN), a cable network, a cellular network, a satellite network, a fiber optic network, or some combination thereof. The network 105 may include, but is not limited to, a Third Generation (3G), a Fourth Generation (4G), a Fifth Generation (5G), a Sixth Generation (6G), a New Radio (NR), a Narrow Band Internet of Things (NB-IoT), an Open Radio Access Network (O-RAN), and the like.
[0027] The server 115 may include by way of example but not limitation, one or more of a standalone server, a server blade, a server rack, a bank of servers, a server farm, hardware supporting a part of a cloud service or system, a home server, hardware running a virtualized server, one or more processors executing code to function as a server, one or more machines performing server-side functionality as described herein, at least a portion of any of the above, some combination thereof. In an embodiment, the entity may include, but is not limited to, a vendor, a network operator, a company, an organization, a university, a lab facility, a business enterprise, a defense facility, or any other facility that provides content.
[0028] The environment 100 further includes the system 120 communicably coupled to the server 115 and each of the first UE 110a, the second UE 110b, and the third UE 110c via the network 105. The system 120 is configured for normalizing the data. The system 120 is adapted to be embedded within the server 115 or is embedded as the individual entity, as per multiple embodiments of the present invention.
[0029] Operational and construction features of the system 120 will be explained in detail with respect to the following figures.
[0030] FIG. 2 is an exemplary block diagram of the system 120 for normalizing the data, according to one or more embodiments of the present disclosure.
[0031] The system 120 includes a processor 205, a memory 210, a user interface 215, and a storage unit 240. For the purpose of description and explanation, the description will be explained with respect to one or more processors 205, or to be more specific will be explained with respect to the processor 205 and should nowhere be construed as limiting the scope of the present disclosure. The one or more processor 205, hereinafter referred to as the processor 205 may be implemented as one or more microprocessors, microcomputers, microcontrollers, digital signal processors, central processing units, state machines, logic circuitries, single board computers, and/or any devices that manipulate signals based on operational instructions.
[0032] As per the illustrated embodiment, the processor 205 is configured to fetch and execute computer-readable instructions stored in the memory 210. The memory 210 may be configured to store one or more computer-readable instructions or routines in a non-transitory computer-readable storage medium, which may be fetched and executed to create or share data packets over a network service. The memory 210 may include any non-transitory storage device including, for example, volatile memory such as RAM, or non-volatile memory such as EPROM, flash memory, and the like.
[0033] The User Interface (UI) 215 includes a variety of interfaces, for example, interfaces for a Graphical User Interface (GUI), a web user interface, a Command Line Interface (CLI), and the like. The user interface 215 facilitates communication of the system 120. In one embodiment, the user interface 215 provides a communication pathway for one or more components of the system 120. Examples of the one or more components include, but are not limited to, the UE 110, and the storage unit 240.
[0034] The storage unit 240 is one of, but not limited to, a centralized database, a cloud-based database, a commercial database, an open-source database, a distributed database, an end-user database, a graphical database, a No-Structured Query Language (NoSQL) database, an object-oriented database, a personal database, an in-memory database, a document-based database, a time series database, a wide column database, a key value database, a search database, a cache databases, and so forth. The foregoing examples of storage unit 240 types are non-limiting and may not be mutually exclusive, e.g., a database can be both commercial and cloud-based, or both relational and open-source, etc.
[0035] Further, the processor 205, in an embodiment, may be implemented as a combination of hardware and programming (for example, programmable instructions) to implement one or more functionalities of the processor 205. In the examples described herein, such combinations of hardware and programming may be implemented in several different ways. For example, the programming for the processor 205 may be processor-executable instructions stored on a non-transitory machine-readable storage medium and the hardware for processor 205 may comprise a processing resource (for example, one or more processors), to execute such instructions. In the present examples, the memory 210 may store instructions that, when executed by the processing resource, implement the processor 205. In such examples, the system 120 may comprise the memory 210 storing the instructions and the processing resource to execute the instructions, or the memory 210 may be separate but accessible to the system 120 and the processing resource. In other examples, the processor 205 may be implemented by electronic circuitry.
[0036] In order for the system 120 to normalize the data, the processor 205 includes a retrieving unit 220, an applying unit 225, a normalizing unit 230, a handling unit 235 communicably coupled to each other. In an embodiment, operations and functionalities of the retrieving unit 220, the applying unit 225, the normalizing unit 230, and the handling unit 235 can be used in combination or interchangeably.
[0037] Initially, a request is transmitted by the user via the UI 215 for cleaning and normalization of the data from a data frame. The data frame includes one of columns and field values. The data frame refers to a structured data set that is organized into rows and columns, similar to a table in the storage unit 240. Each column represents an attribute or field values (e.g., name, age, transaction ID), and each row contains a record with corresponding values for these attributes. The data frame allows for easy data analysis and processing, especially in machine learning tasks. Each column in the data frame represents a variable, and each row represents a data point. The data frame contains data to be used for training a model.
[0038] Upon receiving the request, the retrieving unit 220 is configured to retrieve the data from one or more data sources. In an embodiment, the one or more data sources include at least, a file input, a source path, an input stream, a Hyper Text Transfer Protocol 2 (HTTP 2), a Distributed File System (DFS), and a Network Attached Storage (NAS). The file input refers to reading data from the files stored locally or on the server 115. The files can be in different formats, including, but not limited to, Comma Separated Values (CSV), JavaScript Object Notation (JSON), extensible Markup Language (XML), or text files. In an exemplary embodiment, the data is stored in the CSV file, and the retrieving unit 220 fetches the data for processing. The retrieving unit 220 retrieves the data from the file and loads the data into the memory 210 for further processing.
[0039] The source path typically refers to the directory or network location where the data files are stored. The retrieving unit 220 fetches the data by following the provided file path. In an exemplary embodiment for the source path, the system 120 stores images in a specific directory. The retrieving unit 220 navigates to a designated source path and retrieves all files that match the required criteria (e.g., .jpg images). The input stream refers to continuous data that is read in real-time from a stream of data (e.g., data being transmitted over the network 105 or generated by sensors). In an exemplary embodiment, the data is being received from an Application Programming Interface (API) or a live data stream, and the retrieving unit 220 fetches in real-time. The HTTP 2 is a protocol used for communication over the web, which improves upon HTTP/1.1 by offering multiplexing and better performance for handling multiple requests. In an exemplary embodiment, the retrieving unit 220 retrieves the data from the web server using the HTTP 2. The retrieving unit 220 uses the HTTP 2 to fetch the data from remote web servers or APIs.
[0040] The DFS is a distributed file system used to store large datasets across multiple machines. The DFS is commonly used in big data environments to store and retrieve large amounts of data. The retrieving unit 220 connects to the DFS to retrieve the files for processing. The NAS is a dedicated file storage system that provides Local Area Network (LAN) access to the data. The NAS allows multiple users or systems to access the data from a centralized storage device. The retrieving unit 220 fetches the data from a NAS device over the network 105. In an exemplary embodiment, if the data is stored on the NAS, the retrieving unit 220 fetches the data via network protocols. Upon retrieving the data from the one or more data sources, the retrieved data is stored in the data frame for further processing.
[0041] In another embodiment, the retrieved data includes historic and current data. Upon retrieving the data and stored in the data frame, the identification unit 315 is configured to identify normalization and cleaning process on the retrieved data with the help of historic data and current data. The historic data refers to the data that has been collected over a period of time in the past. The historic data typically includes records of events, transactions, measurements, or observations. The historic data involves timestamping, trend analysis, and model training. The timestamping refers to each entry usually includes a timestamp indicating when the data was collected. The trend analysis is useful for analysing trends, patterns, and changes over time. The model training is used in training predictive models to identify the historical trends that might influence future outcomes. The current data refers to the data that is actively being collected and reflects the most recent observations or events such as real time data. The current data refers to real-time data, which is collected continuously or at regular intervals, providing up-to-date information. The current data is used for monitoring and managing ongoing processes or operations of the data.
[0042] Upon identifying the cleaning and normalization technique utilizing the historic and current data retrieved from the one or more data sources, the applying unit 225 is configured to apply a string operation filter on the retrieved data. In an embodiment, the string operation filter includes a substring operation and a concatenation. The applying unit 225 retrieves the data, either the entire data frame or specific columns and field values that need to be processed. In another embodiment, the operation is at least one of the substring operation, the concatenation, a regular expression-based extraction, a string pattern matching, and conditional replacements. If the operation is the substring extraction, the preprocessing unit 225 defines a start position and an end position of the substring to be extracted from the field value. The preprocessing unit 225 locates a portion of text within the field values (e.g., extracting part of an ID, date, or name). In an exemplary embodiment, if the field value is a full date string (e.g., "2024-09-19"), the preprocessing unit 225 extracts the month or day. The month is located at positions 5 and 6 in the date string. The name and ID columns are concatenated to create a new column. The month is extracted from the date column and stored the month in the new column. In another exemplary embodiment, to extract the first octet from the IP Address, the user wants to extract characters from a start position (0) to a position of a first dot (.) in 192.168.1.10, which is found using the string operations by using various programming languages. After performing the substring extraction, the data frame generates the first octet such as, 192 is used to classify the IP addresses, aiding in network segmentation analysis. By extracting and analyzing the first octet, the user can assess whether the UE 110 is operating within expected ranges and identify potential misconfigurations.
[0043] If the operation is the concatenation, the applying unit 225 combines two or more field values or strings together (e.g., merging data from multiple columns). In an exemplary embodiment, the users want to create an identifier for each UE 110 by concatenating the multiple fields. The user needs to create a new column which is a device identifier. The device identifier includes combination of a device ID, a device type, and a location into a single string. In an embodiment, the concatenation is performed using various programming languages. The device ID is a unique identifier for the device. The device type is a type of device such as mobile, router, switch. The location of the device includes a data center, and a branch office. The unique device identifier for each network device is created by concatenating the fields. In an exemplary embodiment, the device ID is NET001, the device type is mobile, and the location branch office 1. A concatenation result of the device identifier is NET001_Mobile_Branch office 1. After executing the concatenation, the device identifier displays the identifier for each UE 110 with combination of the device ID, the device type, and the location. The device identifier is used in monitoring dashboards for easy identification of the UEs 110.
[0044] Upon performing the string operation filter to the retrieved data, the preprocessing unit 320 (as shown in FIG.3) updates the data frame with the modified data and prepares the modified data for pre-processing. The data frame with the modified data pre-processes the one or more data sources on the field values. The pre-processing involves preparing and transforming the data into a format suitable for further analysis, manipulation, or display. The pre-processing includes data scaling/filtering, data cleaning, and data transformation, and lowercasing of the data to ensure uniformity and consistency of the data. The data scaling/filtering is automatically identified, which is applied to the dataset based on historic data and current data. The data cleaning refers to removing or correcting any errors or inconsistencies in the data, which involves handling missing values, removing duplicates, or correcting inaccuracies. The data transformation refers to converting the data into a suitable format or structure. If the operation is concatenation, combining different data elements into a unified format. The pre-processing ensures that the data is clean, structured, and ready for subsequent operations, which includes analysis, machine learning, reporting, or other tasks. The pre-processing of the data also enhances the model to understand the structure and learn patterns at a more granular level, which helps in generating more accurate and contextually appropriate responses.
[0045] The preprocessing unit 225 is further configured to detect and execute the operation by using Artificial Intelligence/ Machine Learning (AI/ML) technique. The system 120 utilizes AI/ML algorithms such as support vector machines (SVM) for language detection and tokenization, enabling operations like transliteration (e.g., converting "??????" from Cyrillic script to "Moskva" in Latin script) or segmenting the sentence "Machine learning is fun" into individual tokens: ['Machine', 'learning', 'is', 'fun']. The system 120 automatically identifies missing or incomplete data and suggests optimal replacement values or patterns (e.g., recommending "Unknown" for missing names) using imputation algorithms like K-Nearest Neighbours (KNN) or mean/mode imputation. The AI/ML can also suggest multi-column transformations, such as concatenating the "First Name" and "Last Name" fields to create a "Full Name" column, based on its analysis of the data's relationships and interactions using graph-based algorithms or relational data models. This ensures that preprocessing is efficient, relevant, and dynamically tailored to the dataset's specific needs.
[0046] Upon pre-processing the one or more data sources, the normalizing unit 230 is configured to normalize the retrieved data via at least one normalization technique. In an embodiment, the at least one normalization technique include, but not limited to, a stemming technique, a lemmatization technique, a lowercasing, a tokenization, a removing punctuation, removing stop words, and text encoding. The normalization technique is identified utilizing historic and current data retrieved from the one or more data sources. The normalization technique to be used for normalizing a given dataset is identified automatically through a data-driven approach. This process utilizes machine learning algorithms that analyses the characteristics of the data, such as its distribution, variability, and structure. By evaluating these factors, the system 120 can select the appropriate normalization technique, such as stemming, lemmatization, or scaling/tailored to the one or more attributes of the dataset. The automatic identification improves efficiency but also enhances the effectiveness of data preprocessing, ensuring that the appropriate normalization technique aligns with the data's unique requirements.
[0047] The normalizing unit 230 is responsible for transforming the pre-processed data into a normalized format. The normalization is a process that adjusts or removes the data to a standard format or scale, making the data easier to analyze and compare. The normalization is a process to standardize the data, which is the one of column and field values of the data frame. The one or more data sources are converted into a token into its base form. The token refers to a text which is split into individual tokens, typically words or phrases. Each token represents a unit of meaning in the text. In the normalization process, the inflectional form of word is converted to the base form.
[0048] The stemming technique is the process of reducing words to their base or root form by removing suffixes and prefixes. In an exemplary embodiment, the words can appear in various forms due to tense, plurality, or other grammatical structures. The stemming technique removes the inflectional form to return each token to the base form. By converting words to the base form, which significantly reduces the number of distinct tokens in the dataset. In an example, the words such as running, runner and ran are all normalized to run, which removes the variations in the text and also cleans the text by removing redundant data. The lemmatization is the process of reducing the words to their base form (lemma) based on their intended meaning and grammatical context. The lemmatization process ensures understanding the context of the word, including its part of speech. In an exemplary embodiment, the input data is a sample set of log entries related to network activities. The input data is “the routers are functioning well”. The lowercasing refers to converts all characters in the text to lowercase to ensure uniformity (e.g., “Data” to “data”). The tokenization involves splitting a log entry into individual words (such as tokens). The log entry is referred to as “The”, “routers”, “are”, “functioning”, “well”. Upon splitting the log entry, the part of speech is identified for each token. Based on the exemplary embodiment, routers is noun, and functioning is verb. Upon identifying the part of the speech, the lemmatization technique is applied to convert each token to its base form based on the part of speech. As a result of the lemmatization tokens are such as “the”, “router”, “be”,” function”, “well”. The removing punctuation refers to eliminating punctuation marks from the text to focus on the words themselves (e.g., "Hello, world!" to "Hello world").
[0049] Upon completing the normalization, the normalized data is stored in the storage unit 240 and used for training the model. The normalized data is intended for use in training the models. The normalized data ensures that the ML model learns effectively from consistent data, extracts relevant features without noise from variations or inconsistencies in the data, and aids the ML model in generalizing better to unseen data by reducing overfitting to noise in the original dataset.
[0050] . Further, the data normalization reduces the number of unique tokens present in the text, removing the variations in the text and also cleaning the text by removing redundant data. It helps to improve the accuracy, reduce the time and resources required to train the model, prevent overfitting, and improves the interpretability of the trained model. The normalization process ensures that the valid data is retained for model training. The normalized data enhances the performance of the ML training unit 330. The trained data is used for efficient information retrieval and knowledge extraction, enhancing the effectiveness of the system 120.
[0051] In an alternate embodiment, the normalizing unit 230 is configured to normalize the retrieved data from the one or more data sources. The normalizing unit 230 is configured to notify the normalized data to at least one of, the service, the microservice, the application and the like. For example, the normalizing unit 230 is configured to notify the normalized data by transmitting an acknowledgment to one of, the service, the microservice, the application using the handling unit 235. The handling unit 235 is configured to keep a record of mappings of interaction of the entities (such as the service, microservice, application, component) with the system 120. Mappings of the interaction of the entities with the system 120 pertains to at least one of, the entities transmitting commands and/or requests to the system 120 to normalize the data. Based on the mapping, the handling unit 235 informs the normalizing unit 230 to which entity the acknowledgment has to be transmitted pertaining to the normalized data. For example, let us consider that the microservice 1 had transmitted the command at the outset to normalize the data, the handling unit 235 keeps a track of this event. Basis which, the handling unit 235 informs the normalizing unit 230 to transmit the acknowledgment (response) to the microservice 1 pertaining to the normalized data.
[0052] FIG. 3 is a block diagram of an architecture 300 that can be implemented in the system of FIG.2, according to one or more embodiments of the present disclosure. The architecture 300 of the system 120 includes an integrated system 305, a load balancer 310, and the processor 205. The processor 205 includes the identification unit 315, the preprocessing unit 320, the data source unit 325, and the Machine Learning (ML) training unit 330.
[0053] The architecture 300 of the system 120 is configured to interact with the integrated system 305and the load balancer 310. The integrated system 305 is connected to the processor 205 via the load balancer 310. The integrated system 305 is configured to access the data in the network 105 and is capable of interacting with the server 115, the storage unit 240 to collect the data. In an embodiment, the integrated system 305 includes, but not limited to, the one or more data sources, from where the data can be retrieved. In an embodiment, the data can be retrieved as the file input, the source path, the input stream, the HTTP2, the DFS and the NAS. In another embodiment, the retrieved data includes historic and current data.
[0054] The load balancer 310 includes distributing the one or more data sources request traffic across the one or more processors 205. The distribution of the one or more data source request traffic helps in managing and optimizing the workload, ensuring that no single processor is overwhelmed while improving overall system performance and reliability. The processor 205 normalizes the one or more data sources by converting the token into its base form. In the normalization process, the inflectional form of word is removed so that the base form is obtained.
[0055] Upon retrieving the data and stored in the data frame, the identification unit 315 is configured to identify normalization and cleaning process on the retrieved data with the help of historic data and current data. The historic data refers to the data that has been collected over a period of time in the past. The historic data typically includes records of events, transactions, measurements, or observations. The historic data involves timestamping, trend analysis, and model training. The timestamping refers to each entry usually includes a timestamp indicating when the data was collected. The trend analysis is useful for analysing trends, patterns, and changes over time. The model training is used in training predictive models to identify the historical trends that might influence future outcomes. The current data refers to the data that is actively being collected and reflects the most recent observations or events such as real time data. The current data refers to real-time data, which is collected continuously or at regular intervals, providing up-to-date information. The current data is used for monitoring and managing ongoing processes or operations of the data.
[0056] Upon identifying the cleaning and normalization technique utilizing the historic and current data retrieved from the one or more data sources, the identification unit 315 is configured to apply the string operation filter on the retrieved data. In an embodiment, the string operation filter includes the substring operation and the concatenation. The identification unit 315 retrieves the data, either the entire data frame or specific columns and field values that need to be processed. In another embodiment, the operation is at least one of the substring operation, the concatenation, the regular expression-based extraction, the string pattern matching, and conditional replacements. If the operation is the substring extraction, the preprocessing unit 320 defines the start position and the end position of the substring to be extracted from the field value. The preprocessing unit 320 locates a portion of text within the field values (e.g., extracting part of an ID, date, or name).
[0057] If the operation is the concatenation, the identification unit 315 combines two or more field values or strings together (e.g., merging data from multiple columns). In an exemplary embodiment, the users want to create an identifier for each UE 110 by concatenating the multiple fields. The user needs to create a new column which is a device identifier. The device identifier includes combination of a device ID, a device type, and a location into a single string. In an embodiment, the concatenation is performed using various programming languages. The device ID is a unique identifier for the device. The device type is a type of device such as mobile, router, switch. The location of the device includes a data center, and a branch office. The unique device identifier for each network device is created by concatenating the fields.
[0058] Upon performing the string operation filter to the retrieved data, the preprocessing unit 320 updates the data frame with the modified data and prepares the modified data for pre-processing. The data frame with the modified data pre-processes the one or more data sources on the field values. The pre-processing involves preparing and transforming the data into a format suitable for further analysis, manipulation, or display. The pre-processing includes data cleaning, and data transformation. The data cleaning refers to removing or correcting any errors or inconsistencies in the data, which involves handling missing values, removing duplicates, or correcting inaccuracies. The data transformation refers to converting the data into a suitable format or structure. If the operation is concatenation, combining different data elements into a unified format. The pre-processing ensures that the data is clean, structured, and ready for subsequent operations, which includes analysis, machine learning, reporting, or other tasks. The pre-processing of the data also enhances a Machine Learning (ML) model to understand the structure and learn patterns at a more granular level, which helps in generating more accurate and contextually appropriate responses.
[0059] The preprocessing unit 320 is further configured to detect and execute the operation by using Artificial Intelligence/ Machine Learning (AI/ML) technique. The system 120 utilizes AI/ML algorithms such as support vector machines (SVM) for language detection and tokenization, enabling operations like transliteration (e.g., converting "??????" from Cyrillic script to "Moskva" in Latin script) or segmenting the sentence "Machine learning is fun" into individual tokens: ['Machine', 'learning', 'is', 'fun']. The system 120 automatically identifies missing or incomplete data and suggests optimal replacement values or patterns (e.g., recommending "Unknown" for missing names) using imputation algorithms like K-Nearest Neighbours (KNN) or mean/mode imputation. The AI/ML can also suggest multi-column transformations, such as concatenating the "First Name" and "Last Name" fields to create a "Full Name" column, based on its analysis of the data's relationships and interactions using graph-based algorithms or relational data models. This ensures that preprocessing is efficient, relevant, and dynamically tailored to the dataset's specific needs.
[0060] Upon pre-processing the one or more data sources, the preprocessing unit 320 is configured to normalize the retrieved data via at least one normalization technique. In an embodiment, the at least one normalization technique is one of a stemming technique and a lemmatization technique. The normalization technique is identified utilizing historic and current data retrieved from the one or more data sources. The normalizing unit 320 is responsible for transforming the pre-processed data into a normalized format. The normalization is a process that adjusts or removes the data to a standard format or scale, making the data easier to analyze and compare. The normalization is a process to standardize the data, which is the one of column and field values of the data frame. The one or more data sources are converted into a token into its base form. The token refers to a text which is split into individual tokens, typically words or phrases. Each token represents a unit of meaning in the text. In the normalization process, the inflectional form of word is converted to the base form.
[0061] The stemming technique is the process of reducing words to their base or root form by removing suffixes and prefixes. In an exemplary embodiment, the words can appear in various forms due to tense, plurality, or other grammatical structures. The stemming technique removes the inflectional form to return each token to the base form. By converting words to the base form, which significantly reduces the number of distinct tokens in the dataset. The lemmatization is the process of reducing the words to their base form (lemma) based on their intended meaning and grammatical context. The lemmatization process ensures understanding the context of the word, including its part of speech.
[0062] Upon preprocessing the data, the data source unit 325 is configured to update the preprocessed data and store the updated data source in the data frame. The ML training unit 330 is configured to train the one or more data sources using one or more machine learning algorithms. Due to the string operation filter such as the substring extraction and concatenation, the ML training unit 330 generates high quality data with better consistency. Further, the ML training unit 330 understands the structure of the data and learns the patterns, which helps in generating more accurate and contextually appropriate responses.
[0063] FIG. 4 is a signal flow diagram illustrating for normalizing the data, according to one or more embodiments of the present disclosure.
[0064] At 405, initially, a user is transmitted the request via the UI 215 for selecting the string operation filter to perform on each data from a data frame that includes one of columns and field values of the data frame. The data frame refers to a structured data set that is organized into rows and columns, similar to a table in the storage unit 240. Each column represents an attribute or field values (e.g., name, age, transaction ID), and each row contains a record with corresponding values for these attributes. The data frame allows for easy data analysis and processing, especially in machine learning tasks. Each column in the data frame represents a variable, and each row represents a data point. The data frame contains data to be used for machine learning training.
[0065] At 410, the retrieving unit 220 is configured to retrieve the data from one or more data sources. In an embodiment, the one or more data sources include at least, file input, source path, input stream, Hyper Text Transfer Protocol 2 (HTTP 2), Distributed File System (DFS), and Network Attached Storage (NAS). The file input refers to reading data from the files stored locally or on the server 115. The files can be in different formats, including, but not limited to, Comma Separated Values (CSV), JavaScript Object Notation (JSON), extensible Markup Language (XML), or text files. In an exemplary embodiment, the data is stored in the CSV file, and the retrieving unit 220 fetches the data for processing. The retrieving unit 220 retrieves the data from the file and loads the data into the memory 210 for further processing.
[0066] At 415, upon retrieving the data from the one or more data sources, the applying unit 225 is configured to apply the string operation filter on the retrieved data. In an embodiment, the string operation filter includes the contains, substring operation and the concatenation. The applying unit 225 retrieves the data, either the entire data frame or specific columns and field values that need to be processed. In another embodiment, the operation is at least one of the substring operation, the concatenation, the regular expression-based extraction, the string pattern matching, and conditional replacements. If the operation is the substring extraction, the applying unit 225 defines the start position and the end position of the substring to be extracted from the field value. The applying unit 225 locates the portion of text within the field values (e.g., extracting part of an ID, date, or name).
[0067] If the operation is the concatenation, the applying unit 225 combines two or more field values or strings together (e.g., merging data from multiple columns). In an exemplary embodiment, the users want to create an identifier for each UE 110 by concatenating the multiple fields. The user needs to create a new column which is a device identifier. The device identifier includes combination of a device ID, a device type, and a location into a single string. In an embodiment, the concatenation is performed using various programming languages. The device ID is a unique identifier for the device. The device type is a type of device such as mobile, router, switch. The location of the device includes a data center, and a branch office. The unique device identifier for each network device is created by concatenating the fields.
[0068] Upon performing the string operation filter to the retrieved data, the preprocessing unit 320 updates the data frame with the modified data and prepares the modified data for pre-processing. The data frame with the modified data pre-processes the one or more data sources on the field values. The pre-processing involves preparing and transforming the data into a format suitable for further analysis, manipulation, or display. The pre-processing includes data cleaning, and data transformation. The data cleaning refers to removing or correcting any errors or inconsistencies in the data, which involves handling missing values, removing duplicates, or correcting inaccuracies. The data transformation refers to converting the data into a suitable format or structure. If the operation is concatenation, combining different data elements into a unified format. The pre-processing ensures that the data is clean, structured, and ready for subsequent operations, which includes analysis, machine learning, reporting, or other tasks. The pre-processing of the data also enhances a Machine Learning (ML) model to understand the structure and learn patterns at a more granular level, which helps in generating more accurate and contextually appropriate responses.
[0069] At 420, upon pre-processing the one or more data sources, the normalizing unit 230 is configured to normalize the retrieved data via at least one normalization technique. In an embodiment, the at least one normalization technique is one of a stemming technique and a lemmatization technique. The normalization technique is identified utilizing historic and current data retrieved from the one or more data sources. The normalizing unit 230 is responsible for transforming the pre-processed data into a normalized format. The normalization is a process that adjusts or removes the data to the standard format or scale, making the data easier to analyze and compare. The normalization is a process to standardize the data, which is the one of column and field values of the data frame. The one or more data sources are converted into a token into its base form. The token refers to a text which is split into individual tokens, typically words or phrases. Each token represents a unit of meaning in the text. In the normalization process, the inflectional form of word is converted to the base form.
[0070] The stemming technique is the process of reducing words to their base or root form by removing suffixes and prefixes. In an exemplary embodiment, the words can appear in various forms due to tense, plurality, or other grammatical structures. The stemming technique removes the inflectional form to return each token to the base form. By converting words to the base form, which significantly reduces the number of distinct tokens in the dataset. In an example, the words such as running, runner and ran are all normalized to run, which removes the variations in the text and also cleans the text by removing redundant data. The lemmatization is the process of reducing the words to their base form (lemma) based on their intended meaning and grammatical context. The lemmatization process ensures understanding the context of the word, including its part of speech. In an exemplary embodiment, the input data is a sample set of log entries related to network activities. The input data is “the routers are functioning well”. The tokenization involves splitting a log entry into individual words (such as tokens). The log entry is referred to as “The”, “routers”, “are”, “functioning”, “well”. Upon splitting the log entry, the part of speech is identified for each token. Based on the exemplary embodiment, routers is noun, and functioning is verb. Upon identifying the part of the speech, the lemmatization technique is applied to convert each token to its base form based on the part of speech. As a result of the lemmatization tokens are such as “the”, “router”, “be”,” function”, “well”.
[0071] At 425, upon completing the normalization, the normalized data is stored in the storage unit 240 and used for training the model. The normalized data is intended for use in training machine learning models. The normalized data ensures that the model learns effectively from consistent data, extracts relevant features without noise from variations or inconsistencies in the data, and aids the model in generalizing better to unseen data by reducing overfitting to noise in the original dataset.
[0072] In an embodiment, the raw data is stored and transmitted to the ML training unit 330 to train the data. Further, the data normalization reduces the number of unique tokens present in the text, removing the variations in the text and also cleaning the text by removing redundant data. It helps to improve the accuracy, reduce the time and resources required to train the model, prevent overfitting, and improves the interpretability of the trained model. The normalization process ensures that the valid data is retained for model training. The normalized data enhances the performance of the ML training unit 330. The trained data is used for efficient information retrieval and knowledge extraction, enhancing the effectiveness of the system 120.
[0073] FIG. 5 is a flow diagram illustrating a method 500 for normalizing the data, according to one or more embodiments of the present disclosure.
[0074] At step 505, the method 500 includes the step of retrieving the data from the one or more data sources by the retrieving unit 220. In an embodiment, the one or more data sources include at least, the file input, the source path, the input stream, the Hyper Text Transfer Protocol 2 (HTTP 2), the Distributed File System (DFS), and the Network Attached Storage (NAS). Upon retrieving the data, the retrieved data is stored in the data frame for further processing. Upon receiving the request, the retrieving unit 220 is configured to retrieve the data from one or more data sources. In an embodiment, the one or more data sources include at least, a file input, a source path, an input stream, a Hyper Text Transfer Protocol 2 (HTTP 2), a Distributed File System (DFS), and a Network Attached Storage (NAS). The file input refers to reading data from the files stored locally or on the server 115. The files can be in different formats, including, but not limited to, Comma Separated Values (CSV), JavaScript Object Notation (JSON), extensible Markup Language (XML), or text files. In an exemplary embodiment, the data is stored in the CSV file, and the retrieving unit 220 fetches the data for processing. The retrieving unit 220 retrieves the data from the file and loads the data into the memory 210 for further processing.
[0075] In another embodiment, the retrieved data includes historic and current data. Upon retrieving the data and stored in the data frame, the identification unit 315 is configured to identify normalization and cleaning process on the retrieved data with the help of historic data and current data. The historic data refers to the data that has been collected over a period of time in the past. The historic data typically includes records of events, transactions, measurements, or observations. The historic data involves timestamping, trend analysis, and model training. The timestamping refers to each entry usually includes a timestamp indicating when the data was collected. The trend analysis is useful for analysing trends, patterns, and changes over time. The model training is used in training predictive models to identify the historical trends that might influence future outcomes. The current data refers to the data that is actively being collected and reflects the most recent observations or events such as real time data. The current data refers to real-time data, which is collected continuously or at regular intervals, providing up-to-date information. The current data is used for monitoring and managing ongoing processes or operations of the data.
[0076] At step 510, the method 500 includes the step of applying the string operation filter on the retrieved data by the applying unit 225. In an embodiment, the string operation filter includes a substring operation and a concatenation. The applying unit 225 retrieves the data, either the entire data frame or specific columns and field values that need to be processed. In another embodiment, the operation is at least one of the substring operation, the concatenation, a regular expression-based extraction, a string pattern matching, and conditional replacements. If the operation is the substring extraction, the preprocessing unit 225 defines a start position and an end position of the substring to be extracted from the field value. The preprocessing unit 225 locates a portion of text within the field values (e.g., extracting part of an ID, date, or name). In an exemplary embodiment, if the field value is a full date string (e.g., "2024-09-19"), the preprocessing unit 225 extracts the month or day. The month is located at positions 5 and 6 in the date string. The name and ID columns are concatenated to create a new column. The month is extracted from the date column and stored the month in the new column. In another exemplary embodiment, to extract the first octet from the IP Address, the user wants to extract characters from a start position (0) to a position of a first dot (.) in 192.168.1.10, which is found using the string operations by using various programming languages. After performing the substring extraction, the data frame generates the first octet such as, 192 is used to classify the IP addresses, aiding in network segmentation analysis. By extracting and analyzing the first octet, the user can assess whether the UE 110 is operating within expected ranges and identify potential misconfigurations.
[0077] If the operation is the concatenation, the applying unit 225 combines two or more field values or strings together (e.g., merging data from multiple columns). In an exemplary embodiment, the users want to create an identifier for each UE 110 by concatenating the multiple fields. The user needs to create a new column which is a device identifier. The device identifier includes combination of a device ID, a device type, and a location into a single string. In an embodiment, the concatenation is performed using various programming languages. The device ID is a unique identifier for the device. The device type is a type of device such as mobile, router, switch. The location of the device includes a data center, and a branch office. The unique device identifier for each network device is created by concatenating the fields. In an exemplary embodiment, the device ID is NET001, the device type is mobile, and the location branch office 1. A concatenation result of the device identifier is NET001_Mobile_Branch office 1. After executing the concatenation, the device identifier displays the identifier for each UE 110 with combination of the device ID, the device type, and the location. The device identifier is used in monitoring dashboards for easy identification of the UEs 110.
[0078] Upon performing the string operation filter to the retrieved data, the preprocessing unit 320 updates the data frame with the modified data and prepares the modified data for pre-processing. The data frame with the modified data pre-processes the one or more data sources on the field values. The pre-processing involves preparing and transforming the data into a format suitable for further analysis, manipulation, or display. The pre-processing includes data cleaning, and data transformation. The data cleaning refers to removing or correcting any errors or inconsistencies in the data, which involves handling missing values, removing duplicates, or correcting inaccuracies. The data transformation refers to converting the data into a suitable format or structure. If the operation is concatenation, combining different data elements into a unified format. The pre-processing ensures that the data is clean, structured, and ready for subsequent operations, which includes analysis, machine learning, reporting, or other tasks. The pre-processing of the data also enhances a Machine Learning (ML) model to understand the structure and learn patterns at a more granular level, which helps in generating more accurate and contextually appropriate responses.
[0079] The preprocessing unit 225 is further configured to detect and execute the operation by using Artificial Intelligence/ Machine Learning (AI/ML) technique. The system 120 utilizes AI/ML algorithms such as support vector machines (SVM) for language detection and tokenization, enabling operations like transliteration (e.g., converting "??????" from Cyrillic script to "Moskva" in Latin script) or segmenting the sentence "Machine learning is fun" into individual tokens: ['Machine', 'learning', 'is', 'fun']. The system 120 automatically identifies missing or incomplete data and suggests optimal replacement values or patterns (e.g., recommending "Unknown" for missing names) using imputation algorithms like K-Nearest Neighbours (KNN) or mean/mode imputation. The AI/ML can also suggest multi-column transformations, such as concatenating the "First Name" and "Last Name" fields to create a "Full Name" column, based on its analysis of the data's relationships and interactions using graph-based algorithms or relational data models. This ensures that preprocessing is efficient, relevant, and dynamically tailored to the dataset's specific needs.
[0080] At step 515, the method 500 includes the step of normalizing the retrieved data via at least one normalization technique by the normalizing unit 230. In an embodiment, the at least one normalization technique is one of the stemming technique and the lemmatization technique. The normalization technique is identified utilizing historic and current data retrieved from the one or more data sources. The normalizing unit 230 is responsible for transforming the pre-processed data into a normalized format. The normalization is a process that adjusts or removes the data to a standard format or scale, making the data easier to analyze and compare. The normalization is a process to standardize the data, which is the one of column and field values of the data frame. The data from the one or more data sources are converted into a token into its base form. The token refers to a text which is split into individual tokens, typically words or phrases. Each token represents a unit of meaning in the text. In the normalization process, the inflectional form of word is converted to the base form.
[0081] The stemming technique is the process of reducing words to their base or root form by removing suffixes and prefixes. In an exemplary embodiment, the words can appear in various forms due to tense, plurality, or other grammatical structures. The stemming technique removes the inflectional form to return each token to the base form. By converting words to the base form, which significantly reduces the number of distinct tokens in the dataset. The lemmatization is the process of reducing the words to their base form (lemma) based on their intended meaning and grammatical context. The lemmatization process ensures understanding the context of the word, including its part of speech.
[0082] Upon completing the normalization, the normalized data is stored in the storage unit 240 and used for training the model. The normalized data is intended for use in training machine learning models. The normalized data ensures that the model learns effectively from consistent data, extracts relevant features without noise from variations or inconsistencies in the data, and aids the model in generalizing better to unseen data by reducing overfitting to noise in the original dataset.
[0083] In an embodiment, the raw data is stored and transmitted to the ML training unit 330 (as shown in FIG.3) to train the data. Further, the data normalization reduces the number of unique tokens present in the text, removing the variations in the text and also cleaning the text by removing redundant data. It helps to improve the accuracy, reduce the time and resources required to train the model, prevent overfitting, and improves the interpretability of the trained model. The normalization process ensures that the valid data is retained for model training. The normalized data enhances the performance of the ML training unit 330. The trained data is used for efficient information retrieval and knowledge extraction, enhancing the effectiveness of the system 120.
[0084] In another aspect of the embodiment, a non-transitory computer-readable medium stored thereon computer-readable instructions that, when executed by a processor 205. The processor 205 is configured to retrieve data from one or more data sources. The processor 205 is configured to perform a string operation filter on the retrieved data to clean the retrieved data. The processor 205 is configured to normalize the retrieved data via at least one normalization technique.
[0085] A person of ordinary skill in the art will readily ascertain that the illustrated embodiments and steps in description and drawings (FIGS.1-5) are set out to explain the exemplary embodiments shown, and it should be anticipated that ongoing technological development will change the manner in which particular functions are performed. These examples are presented herein for purposes of illustration, and not limitation. Further, the boundaries of the functional building blocks have been arbitrarily defined herein for the convenience of the description. Alternative boundaries can be defined so long as the specified functions and relationships thereof are appropriately performed. Alternatives (including equivalents, extensions, variations, deviations, etc., of those described herein) will be apparent to persons skilled in the relevant art(s) based on the teachings contained herein. Such alternatives fall within the scope and spirit of the disclosed embodiments.
[0086] The present disclosure provides technical advancement for normalizing the data, which removes unwanted rows and helps in increasing the quality, quantity, and diversity of the training data, reduces inflectional forms of the text and sometimes derivationally related forms of the word to a common base form. The present disclosure performs data normalization using the operations, which reduces the number of unique tokens present in the text, removes the variations in the text and also cleans the text by removing redundant data. If more tokens are generated the ML training unit can be trained more efficiently, which helps to improve the accuracy, reduce time and resources required to train the model, prevent overfitting, and improves the interpretability of the ML model. The normalization process ensures that the valid data is retained for ML training.
[0087] The present invention offers multiple advantages over the prior art and the above listed are a few examples to emphasize on some of the advantageous features. The listed advantages are to be read in a non-limiting manner.

REFERENCE NUMERALS

[0088] Environment - 100
[0089] Network-105
[0090] User equipment- 110
[0091] Server - 115
[0092] System -120
[0093] Processor - 205
[0094] Memory - 210
[0095] User interface-215
[0096] Retrieving unit – 220
[0097] Applying unit– 225
[0098] Normalizing unit- 230
[0099] Handling unit- 235
[00100] Storage unit– 240
[00101] Integrated system- 305
[00102] Load balancer- 310
[00103] Identification unit- 315
[00104] Preprocessing unit- 320
[00105] Data source unit-325
[00106] ML training unit- 330
,CLAIMS:CLAIMS
We Claim:
1. A method (500) of normalizing data, the method (500) comprising the steps of:
retrieving, by one or more processors (205), data from one or more data sources;
applying, by the one or more processors (205), a string operation filter on the retrieved data to clean the retrieved data; and
normalizing, by the one or more processors (205), the retrieved data via at least one normalization technique.

2. The method (500) as claimed in claim 1, wherein the one or more data sources include at least, a file input, a source path, an input stream, a Hyper Text Transfer Protocol 2 (HTTP 2), a Distributed File System (DFS), and a Network Attached Storage (NAS).

3. The method (500) as claimed in claim 1, wherein the string operation filter comprises contains operation, substring operation and concatenation.

4. The method (500) as claimed in claim 1, wherein on retrieving the data, the method comprises the step of performing, by the one or more processors, operations pertaining to scaling, normalization, cleaning, and lowercasing of the data to ensure uniformity and consistency of the data.

5. The method (500) as claimed in claim 1, wherein the at least one normalization technique is one of a stemming technique and a lemmatization technique and wherein the normalization technique is identified utilizing historic and current data retrieved from the one or more data sources.

6. The method (500) as claimed in claim 1, wherein the normalized data is stored in a storage unit (235) and used for training a model.

7. A system (120) for normalizing data, the system (120) comprises:
a retrieving unit (220) configured to retrieve, data from one or more data sources;
an applying unit (225) configured to apply, a string operation filter on the retrieved data to clean the retrieved data; and
a normalizing unit (230) configured to normalize, the retrieved data via at least one normalization technique.

8. The system (120) as claimed in claim 7, wherein the one or more data sources include at least, a file input, a source path, an input stream, a Hyper Text Transfer Protocol 2 (HTTP 2), a Distributed File System (DFS), and a Network Attached Storage (NAS).

9. The system (120) as claimed in claim 7, wherein the string operation filter comprises contains operation, substring operation and concatenation.

10. The system (120) as claimed in claim 7, wherein on retrieving the data, the system comprises the step of performing, by the one or more processors, operations pertaining to scaling, normalization, cleaning, and lowercasing of the data to ensure uniformity and consistency of the data.

11. The system (120) as claimed in claim 7, wherein the at least one normalization technique is one of a stemming technique and a lemmatization technique and wherein the normalization technique is identified utilizing historic and current data retrieved from the one or more data sources.

12. The system (120) as claimed in claim 7, wherein the normalized data is stored in a storage unit (235) and used for training a model.

Documents

Application Documents

#	Name	Date
1	202321068460-STATEMENT OF UNDERTAKING (FORM 3) [11-10-2023(online)].pdf	2023-10-11
2	202321068460-PROVISIONAL SPECIFICATION [11-10-2023(online)].pdf	2023-10-11
3	202321068460-FORM 1 [11-10-2023(online)].pdf	2023-10-11
4	202321068460-FIGURE OF ABSTRACT [11-10-2023(online)].pdf	2023-10-11
5	202321068460-DRAWINGS [11-10-2023(online)].pdf	2023-10-11
6	202321068460-DECLARATION OF INVENTORSHIP (FORM 5) [11-10-2023(online)].pdf	2023-10-11
7	202321068460-FORM-26 [28-11-2023(online)].pdf	2023-11-28
8	202321068460-Proof of Right [12-02-2024(online)].pdf	2024-02-12
9	202321068460-DRAWING [11-10-2024(online)].pdf	2024-10-11
10	202321068460-COMPLETE SPECIFICATION [11-10-2024(online)].pdf	2024-10-11
11	Abstract.jpg	2025-01-06