System And Method For Normalizing Data

< Back

System And Method For Normalizing Data

Abstract: ABSTRACT SYSTEM AND METHOD FOR NORMALIZING DATA The present invention relates to a system (120) and a method (500) for normalizing the data is disclosed. The system (120) includes a retrieving unit (220) configured to retrieve the data from one or more data sources. The data represents a plurality of rows. The system (120) includes an identifying unit (225) configured to identify one or more rows including one or more invalid row values from the plurality of rows. The system (120) includes a comparing unit (230) configured to compare, the one or more invalid row values with a predefined threshold value. The system (120) includes a normalizing unit (235) configured to normalize the data in response to determining that the one or more invalid row values exceed the predefined threshold value based on the comparison. Ref. Fig. 2

Get Free WhatsApp Updates!
Notices, Deadlines & Correspondence

Patent Information

Application #

Filing Date

09 November 2023

Publication Number

20/2025

Publication Type

INA

Invention Field

COMPUTER SCIENCE

Status

Email

Parent Application

Applicants

JIO PLATFORMS LIMITED

OFFICE-101, SAFFRON, NR. CENTRE POINT, PANCHWATI 5 RASTA, AMBAWADI, AHMEDABAD 380006, GUJARAT, INDIA

Inventors

1. Aayush Bhatnagar

Reliance Corporate Park, Thane - Belapur Road, Ghansoli, Navi Mumbai, Maharashtra 400701, India

2. Ankit Murarka

Reliance Corporate Park, Thane - Belapur Road, Ghansoli, Navi Mumbai, Maharashtra 400701, India

3. Jugal Kishore

Reliance Corporate Park, Thane - Belapur Road, Ghansoli, Navi Mumbai, Maharashtra 400701, India

4. Chandra Ganveer

Reliance Corporate Park, Thane - Belapur Road, Ghansoli, Navi Mumbai, Maharashtra 400701, India

5. Sanjana Chaudhary

Reliance Corporate Park, Thane - Belapur Road, Ghansoli, Navi Mumbai, Maharashtra 400701, India

6. Gourav Gurbani

Reliance Corporate Park, Thane - Belapur Road, Ghansoli, Navi Mumbai, Maharashtra 400701, India

7. Yogesh Kumar

Reliance Corporate Park, Thane - Belapur Road, Ghansoli, Navi Mumbai, Maharashtra 400701, India

8. Avinash Kushwaha

Reliance Corporate Park, Thane - Belapur Road, Ghansoli, Navi Mumbai, Maharashtra 400701, India

9. Dharmendra Kumar Vishwakarma

Reliance Corporate Park, Thane - Belapur Road, Ghansoli, Navi Mumbai, Maharashtra 400701, India

10. Sajal Soni

Reliance Corporate Park, Thane - Belapur Road, Ghansoli, Navi Mumbai, Maharashtra 400701, India

11. Niharika Patnam

Reliance Corporate Park, Thane - Belapur Road, Ghansoli, Navi Mumbai, Maharashtra 400701, India

12. Shubham Ingle

Reliance Corporate Park, Thane - Belapur Road, Ghansoli, Navi Mumbai, Maharashtra 400701, India

13. Harsh Poddar

Reliance Corporate Park, Thane - Belapur Road, Ghansoli, Navi Mumbai, Maharashtra 400701, India

14. Sanket Kumthekar

Reliance Corporate Park, Thane - Belapur Road, Ghansoli, Navi Mumbai, Maharashtra 400701, India

15. Mohit Bhanwria

Reliance Corporate Park, Thane - Belapur Road, Ghansoli, Navi Mumbai, Maharashtra 400701, India

16. Shashank Bhushan

Reliance Corporate Park, Thane - Belapur Road, Ghansoli, Navi Mumbai, Maharashtra 400701, India

17. Vinay Gayki

Reliance Corporate Park, Thane - Belapur Road, Ghansoli, Navi Mumbai, Maharashtra 400701, India

18. Aniket Khade

Reliance Corporate Park, Thane - Belapur Road, Ghansoli, Navi Mumbai, Maharashtra 400701, India

19. Durgesh Kumar

Reliance Corporate Park, Thane - Belapur Road, Ghansoli, Navi Mumbai, Maharashtra 400701, India

20. Zenith Kumar

Reliance Corporate Park, Thane - Belapur Road, Ghansoli, Navi Mumbai, Maharashtra 400701, India

21. Gaurav Kumar

Reliance Corporate Park, Thane - Belapur Road, Ghansoli, Navi Mumbai, Maharashtra 400701, India

22. Manasvi Rajani

Reliance Corporate Park, Thane - Belapur Road, Ghansoli, Navi Mumbai, Maharashtra 400701, India

23. Kishan Sahu

Reliance Corporate Park, Thane - Belapur Road, Ghansoli, Navi Mumbai, Maharashtra 400701, India

24. Sunil Meena

Reliance Corporate Park, Thane - Belapur Road, Ghansoli, Navi Mumbai, Maharashtra 400701, India

25. Supriya Kaushik De

Reliance Corporate Park, Thane - Belapur Road, Ghansoli, Navi Mumbai, Maharashtra 400701, India

26. Kumar Debashish

Reliance Corporate Park, Thane - Belapur Road, Ghansoli, Navi Mumbai, Maharashtra 400701, India

27. Mehul Tilala

Reliance Corporate Park, Thane - Belapur Road, Ghansoli, Navi Mumbai, Maharashtra 400701, India

28. Satish Narayan

Reliance Corporate Park, Thane - Belapur Road, Ghansoli, Navi Mumbai, Maharashtra 400701, India

29. Rahul Kumar

Reliance Corporate Park, Thane - Belapur Road, Ghansoli, Navi Mumbai, Maharashtra 400701, India

30. Harshita Garg

Reliance Corporate Park, Thane - Belapur Road, Ghansoli, Navi Mumbai, Maharashtra 400701, India

31. Kunal Telgote

Reliance Corporate Park, Thane - Belapur Road, Ghansoli, Navi Mumbai, Maharashtra 400701, India

32. Ralph Lobo

Reliance Corporate Park, Thane - Belapur Road, Ghansoli, Navi Mumbai, Maharashtra 400701, India

33. Girish Dange

Reliance Corporate Park, Thane - Belapur Road, Ghansoli, Navi Mumbai, Maharashtra 400701, India

Specification

DESC: FORM 2
THE PATENTS ACT, 1970
(39 of 1970)
&
THE PATENTS RULES, 2003

COMPLETE SPECIFICATION
(See section 10 and rule 13)
1. TITLE OF THE INVENTION
SYSTEM AND METHOD FOR NORMALIZING DATA
2. APPLICANT(S)
NAME NATIONALITY ADDRESS
JIO PLATFORMS LIMITED INDIAN OFFICE-101, SAFFRON, NR. CENTRE POINT, PANCHWATI 5 RASTA, AMBAWADI, AHMEDABAD 380006, GUJARAT, INDIA
3.PREAMBLE TO THE DESCRIPTION

THE FOLLOWING SPECIFICATION PARTICULARLY DESCRIBES THE NATURE OF THIS INVENTION AND THE MANNER IN WHICH IT IS TO BE PERFORMED.

FIELD OF THE INVENTION
[0001] The present invention relates to the field of wireless communication networks, more particularly relates to a method and a system for normalizing data.
BACKGROUND OF THE INVENTION
[0002] Telecommunication networks include processing systems for executing a diverse range of algorithms and predictive tasks, including anomaly detection. These processing systems may be powered by Large Language Models (LLMs) and their functions may include conducting thorough analysis of network and operational data using Machine Learning (ML) techniques to extract deep insights into the network data.
[0003] Input network data utilized for training the ML models is expected to be well-defined and cleansed. Generally, datasets are fed to the ML model using a data frame, which includes rows and columns. As such, data cleaning is a crucial step that includes identification and rectification or removal of inaccurate records, inconsistencies, errors, and other noise in the dataset. The quality and reliability of the dataset significantly impacts the performance and dependability of ML models.
[0004] There may be instances when the data stored in the data frame may have different values. For example, each row may have a different value compared to the other rows. Due to the inconsistency and variations in data values of the rows, it may be a cumbersome and time-consuming task for a user to filter relevant and irrelevant data based on values as a data frame/data set for training purpose may include large number of data values. Even after filtering relevant and irrelevant data is performed, there may be high chances of errors, since the task may be performed manually.
[0005] There is, therefore, a need for normalizing a data for training Machine Learning (ML) models to make the data values consistent. There is requirement of a system and a method thereof to normalize the data based on a specific threshold value and wherein the user has the provision to set thresholds for different data-values for different parameters which are incorporated in a dataframe.
SUMMARY OF THE INVENTION
[0006] One or more embodiments of the present disclosure provide a method and a system for normalizing data.
[0007] In one aspect of the present invention, the method for normalizing the data is disclosed. The method includes the step of retrieving, by one or more processors, data from one or more data sources. The data represents a plurality of rows. The method includes the step of identifying, by the one or more processors, one or more rows including one or more invalid row values from the plurality of rows. The method includes the step of comparing, by the one or more processors, the one or more invalid row values with a predefined threshold value. The method includes the step of normalizing, by the one or more processors, the data in response to determining that the one or more invalid row values exceed the predefined threshold value based on the comparison.
[0008] In one embodiment, the one or more data sources include at least, a file input, a source path, an input stream, a Hyper Text Transfer Protocol 2 (HTTP 2), a Distributed File System (DFS), and a Network Attached Storage (NAS).
[0009] In another embodiment, the retrieved data is stored in a data frame.
[0010] In yet another embodiment, the step of identifying, by the one or more processors, one or more rows including one or more invalid row values, includes the step of checking, by the one or more processors, whether one or more row values of the plurality of rows are compliant with an acceptable format. The step of identifying, by the one or more processors, one or more rows including one or more invalid row values, includes the step of identifying, by the one or more processors, that the one or more row values are one of, the one or more invalid row values in response to determining that the one or more row values are not compliant with the acceptable format based on the checking. The one or more invalid row values includes, at least one of, a Not a Number (NaN) value, a zero value, a null value, and an empty string.
[0011] In yet another embodiment, the step of normalizing, the data in response to determining that the one or more invalid row values exceed the predefined threshold value based on the comparison, includes the steps of analyzing, by the one or more processors, a dataset to learn a structure of the data, including, at least one of, data fields and data types, identifying, by the one or more processors, the one or more invalid rows based on the analyzed dataset, removing, by the one or more processors, the one or more invalid rows based on the identified one or more invalid rows, and formatting, by the one or more processors, the dataset into a standard format subsequent to the removed one or more invalid rows.
[0012] In yet another embodiment, the normalized data is stored in a storage unit and used to train a model for normalizing the data.
[0013] In another aspect of the present invention, the system for normalizing the data is disclosed. The system includes a retrieving unit configured to retrieve data from one or more data sources. The data represents a plurality of rows. The system includes an identifying unit configured to identify, one or more rows including one or more invalid row values from the plurality of rows. The system includes a comparing unit configured to compare, the one or more invalid row values with a predefined threshold value. The system includes a normalizing unit configured to normalize the data in response to determining that the one or more invalid row values exceed the predefined threshold value based on the comparison.
[0014] In another aspect of the embodiment, a non-transitory computer-readable medium stored thereon computer-readable instructions that, when executed by a processor, are disclosed. The processor is configured to retrieve data from one or more data sources. The data represents a plurality of rows. The processor is configured to identify one or more rows including one or more invalid row values from the plurality of rows. The processor is configured to compare the one or more invalid row values with a predefined threshold value. The processor is configured to normalize the data in response to determining that the one or more invalid row values exceed the predefined threshold value based on the comparison.
[0015] Other features and aspects of this invention will be apparent from the following description and the accompanying drawings. The features and advantages described in this summary and in the following detailed description are not all-inclusive, and particularly, many additional features and advantages will be apparent to one of ordinary skill in the relevant art, in view of the drawings, specification, and claims hereof. Moreover, it should be noted that the language used in the specification has been principally selected for readability and instructional purposes and may not have been selected to delineate or circumscribe the inventive subject matter, resort to the claims being necessary to determine such inventive subject matter.
BRIEF DESCRIPTION OF THE DRAWINGS
[0016] The accompanying drawings, which are incorporated herein, and constitute a part of this disclosure, illustrate exemplary embodiments of the disclosed methods and systems in which like reference numerals refer to the same parts throughout the different drawings. Components in the drawings are not necessarily to scale, emphasis instead being placed upon clearly illustrating the principles of the present disclosure. Some drawings may indicate the components using block diagrams and may not represent the internal circuitry of each component. It will be appreciated by those skilled in the art that disclosure of such drawings includes disclosure of electrical components, electronic components or circuitry commonly used to implement such components.
[0017] FIG. 1 is an exemplary block diagram of an environment for normalizing data, according to one or more embodiments of the present disclosure;
[0018] FIG. 2 is an exemplary block diagram of a system for normalizing the data, according to the one or more embodiments of the present disclosure;
[0019] FIG. 3 is a block diagram of an architecture that can be implemented in the system of FIG.2, according to the one or more embodiments of the present disclosure;
[0020] FIG. 4 is a signal flow diagram illustrating for normalizing the data, according to the one or more embodiments of the present disclosure; and
[0021] FIG. 5 is a flow diagram illustrating the method for normalizing the data, according to the one or more embodiments of the present disclosure.
[0022] The foregoing shall be more apparent from the following detailed description of the invention.
DETAILED DESCRIPTION OF THE INVENTION
[0023] Some embodiments of the present disclosure, illustrating all its features, will now be discussed in detail. It must also be noted that as used herein and in the appended claims, the singular forms "a", "an" and "the" include plural references unless the context clearly dictates otherwise.
[0024] Various modifications to the embodiment will be readily apparent to those skilled in the art and the generic principles herein may be applied to other embodiments. However, one of ordinary skill in the art will readily recognize that the present disclosure including the definitions listed here below are not intended to be limited to the embodiments illustrated but is to be accorded the widest scope consistent with the principles and features described herein.
[0025] A person of ordinary skill in the art will readily ascertain that the illustrated steps detailed in the figures and here below are set out to explain the exemplary embodiments shown, and it should be anticipated that ongoing technological development will change the manner in which particular functions are performed. These examples are presented herein for purposes of illustration, and not limitation. Further, the boundaries of the functional building blocks have been arbitrarily defined herein for the convenience of the description. Alternative boundaries can be defined so long as the specified functions and relationships thereof are appropriately performed. Alternatives (including equivalents, extensions, variations, deviations, etc., of those described herein) will be apparent to persons skilled in the relevant art(s) based on the teachings contained herein. Such alternatives fall within the scope and spirit of the disclosed embodiments.
[0026] Referring to FIG. 1, FIG. 1 illustrates an exemplary block diagram of an environment 100 for normalizing data, according to one or more embodiments of the present invention. The environment 100 includes a network 105, a User Equipment (UE) 110, a server 115, and a system 120. The UE 110 aids a user to interact with the system 120 for normalizing the data. In an embodiment, the user is at least one of, a network operator, and a service provider. The normalizing data refers to cleaning the data by identifying, pre-processing, and removing any inaccuracies or inconsistencies of the data from a dataset.
[0027] For the purpose of description and explanation, the description will be explained with respect to the UE 110, or to be more specific will be explained with respect to a first UE 110a, a second UE 110b, and a third UE 110c, and should nowhere be construed as limiting the scope of the present disclosure. Each of the UE 110 from the first UE 110a, the second UE 110b, and the third UE 110c is configured to connect to the server 115 via the network 105. In an embodiment, each of the first UE 110a, the second UE 110b, and the third UE 110c is one of, but not limited to, any electrical, electronic, electro-mechanical or an equipment and a combination of one or more of the above devices such as smartphones, virtual reality (VR) devices, augmented reality (AR) devices, laptop, a general-purpose computer, desktop, personal digital assistant, tablet computer, mainframe computer, or any other computing device.
[0028] The network 105 includes, by way of example but not limitation, one or more of a wireless network, a wired network, an internet, an intranet, a public network, a private network, a packet-switched network, a circuit-switched network, an ad hoc network, an infrastructure network, a Public-Switched Telephone Network (PSTN), a cable network, a cellular network, a satellite network, a fiber optic network, or some combination thereof. The network 105 may include, but is not limited to, a Third Generation (3G), a Fourth Generation (4G), a Fifth Generation (5G), a Sixth Generation (6G), a New Radio (NR), a Narrow Band Internet of Things (NB-IoT), an Open Radio Access Network (O-RAN), and the like.
[0029] The server 115 may include by way of example but not limitation, one or more of a standalone server, a server blade, a server rack, a bank of servers, a server farm, hardware supporting a part of a cloud service or system, a home server, hardware running a virtualized server, one or more processors executing code to function as a server, one or more machines performing server-side functionality as described herein, at least a portion of any of the above, some combination thereof. In an embodiment, the entity may include, but is not limited to, a vendor, a network operator, a company, an organization, a university, a lab facility, a business enterprise, a defense facility, or any other facility that provides content.
[0030] The environment 100 further includes the system 120 communicably coupled to the server 115 and each of the first UE 110a, the second UE 110b, and the third UE 110c via the network 105. The system 120 is configured for normalizing the data. The system 120 is adapted to be embedded within the server 115 or is embedded as the individual entity, as per multiple embodiments of the present invention.
[0031] Operational and construction features of the system 120 will be explained in detail with respect to the following figures.
[0032] FIG. 2 is an exemplary block diagram of a system 120 for normalizing the data, according to one or more embodiments of the present disclosure.
[0033] The system 120 includes a processor 205, a memory 210, a user interface 215, and a storage unit 240. For the purpose of description and explanation, the description will be explained with respect to one or more processors 205, or to be more specific will be explained with respect to the processor 205 and should nowhere be construed as limiting the scope of the present disclosure. The one or more processor 205, hereinafter referred to as the processor 205 may be implemented as one or more microprocessors, microcomputers, microcontrollers, digital signal processors, central processing units, state machines, logic circuitries, single board computers, and/or any devices that manipulate signals based on operational instructions.
[0034] As per the illustrated embodiment, the processor 205 is configured to fetch and execute computer-readable instructions stored in the memory 210. The memory 210 may be configured to store one or more computer-readable instructions or routines in a non-transitory computer-readable storage medium, which may be fetched and executed to create or share data packets over a network service. The memory 210 may include any non-transitory storage device including, for example, volatile memory such as RAM, or non-volatile memory such as EPROM, flash memory, and the like.
[0035] The User Interface (UI) 215 includes a variety of interfaces, for example, interfaces for a Graphical User Interface (GUI), a web user interface, a Command Line Interface (CLI), and the like. The user interface 215 facilitates communication of the system 120. In one embodiment, the user interface 215 provides a communication pathway for one or more components of the system 120. Examples of the one or more components include, but are not limited to, the UE 110, and the storage unit 240.
[0036] The storage unit 240 is one of, but not limited to, a centralized database, a cloud-based database, a commercial database, an open-source database, a distributed database, an end-user database, a graphical database, a No-Structured Query Language (NoSQL) database, an object-oriented database, a personal database, an in-memory database, a document-based database, a time series database, a wide column database, a key value database, a search database, a cache databases, and so forth. The foregoing examples of storage unit types are non-limiting and may not be mutually exclusive e.g., a database can be both commercial and cloud-based, or both relational and open-source, etc.
[0037] Further, the processor 205, in an embodiment, may be implemented as a combination of hardware and programming (for example, programmable instructions) to implement one or more functionalities of the processor 205. In the examples described herein, such combinations of hardware and programming may be implemented in several different ways. For example, the programming for the processor 205 may be processor-executable instructions stored on a non-transitory machine-readable storage medium and the hardware for processor 205 may comprise a processing resource (for example, one or more processors), to execute such instructions. In the present examples, the memory 210 may store instructions that, when executed by the processing resource, implement the processor 205. In such examples, the system 120 may comprise the memory 210 storing the instructions and the processing resource to execute the instructions, or the memory 210 may be separate but accessible to the system 120 and the processing resource. In other examples, the processor 205 may be implemented by electronic circuitry.
[0038] In order for the system 120 to normalize the data, the processor 205 includes a retrieving unit 220, an identifying unit 225, a comparing unit 230, and a normalizing unit 235 communicably coupled to each other. In an embodiment, operations and functionalities of the retrieving unit 220, the identifying unit 225, the comparing unit 230, and the normalizing unit 235 can be used in combination or interchangeably.
[0039] Initially, a request is transmitted by the user via the UI 215 to extract one or more rows in a data frame that contains one or more invalid row values. The data frame is a two-dimensional, tabular data structure used to store and manipulate the data in rows and columns. The data frame allows for easy data analysis and processing, especially in machine learning tasks. Each column in the data frame represents a variable, and each row represents a data point. The data frame contains data to be used for machine learning training.
[0040] Upon receiving the request, the retrieving unit 220 is configured to retrieve the data from one or more data sources. In an embodiment, the one or more data sources include at least, a file input, a source path, a Hyper Text Transfer Protocol 2 (HTTP 2), a Distributed File System (DFS), and a Network Attached Storage (NAS). The file input refers to reading data from the files stored locally or on the server 115. The files can be in different formats, including, but not limited to, Comma Separated Values (CSV), JavaScript Object Notation (JSON), extensible Markup Language (XML), or text files. In an exemplary embodiment, the data is stored in the CSV file, and the retrieving unit 220 fetches the data for processing. The retrieving unit 220 retrieves the data from the file and loads the data into the memory 210 for further processing.
[0041] The source path typically refers to the directory or network location where the data files are stored. The retrieving unit 220 fetches the data by following the provided file path. In an exemplary embodiment for the source path, the system 120 stores images in a specific directory. The retrieving unit 220 navigates to a designated source path and retrieves all files that match the required criteria (e.g., .jpg images). The input stream refers to continuous data that is read in real-time from a stream of data (e.g., data being transmitted over the network 105 or generated by sensors). In an exemplary embodiment, the data is being received from an Application Programming Interface (API) or a live data stream, and the retrieving unit 220 fetches the data in real-time. The HTTP 2 is a protocol used for communication over the web, which improves upon HTTP/1.1 by offering multiplexing and better performance for handling multiple requests. In an exemplary embodiment, the retrieving unit 220 retrieves the data from the web server using the HTTP 2. The retrieving unit 220 uses HTTP 2 to fetch the data from remote web servers or APIs.
[0042] The DFS is a distributed file system used to store large datasets across multiple machines. The DFS is commonly used in big data environments to store and retrieve large amounts of data. The retrieving unit 220 connects to the DFS to retrieve the files for processing. The NAS is a dedicated file storage system that provides Local Area Network (LAN) access to the data. The NAS allows multiple users or systems to access the data from a centralized storage device. The retrieving unit 220 fetches the data from a NAS device over the network 105. In an exemplary embodiment, if the data is stored on the NAS, the retrieving unit 220 fetches the data via network protocols. Upon retrieving the data from the one or more data sources, the retrieved data is stored in the data frame.
[0043] Upon retrieving the data and stored in the data frame, the identifying unit 225 is configured to identify the one or more rows including the one or more invalid row values from the plurality of rows. The one or more invalid row values refer to entries within a dataset that do not meet predefined criteria or standards for validity. In an embodiment, the one or more invalid row values includes, at least one of, a Not a Number (NaN) value, a zero value, a null value, and an empty string. The identifying unit 225 is configured to scan the dataset which contains the retrieved data. The identifying unit 225 is configured to check whether one or more row values of the plurality of rows are compliant with an acceptable format. In an exemplary embodiment, the dataset includes one or more rows and columns. The one or more rows and columns include, but not limited to, name, age, and email. The identifying unit 225 scans the dataset to identify the one or more rows that include the one or more row values of the plurality of rows are compliant with the acceptable format. The acceptable format refers to the specific criteria or standards that each value in the dataset must meet to be considered valid and usable. The identifying unit 225 is further configured to identify that the one or more row values are one of, the one or more invalid row values in response to determining that the one or more row values are not compliant with the acceptable format based on the checking. In an embodiment, the acceptable format refers to the name should be a string with only letters and spaces, the age should be a non-negative integer, and the email should follow a standard email format (e.g., user@domain.com). In a first row, the name refers as john doe, the age refers as 30, and the email refers as john.doe@example.com. In a second row, the name refers as jane smith, the age refers as 25, and the email refers as jane.smith@example. In a third row, the name refers as bob brown, the age refers as -5, and the email refers as bob.brown@.com. Based on the exemplary embodiment, in the first row, the name, age, and the email are valid, in the second row, the name and the age are valid, the email is invalid, and in the third row, the name is valid, and the age and the email are invalid. Based on the exemplary embodiment, the second and third row from the plurality of rows contains the invalid row values. Therefore, the one or more invalid row values are flagged for further action, such as correction or removal.
[0044] Upon identifying the one or more rows including the one or more invalid row values from the plurality of rows, the comparing unit 230 is configured to compare the one or more invalid row values with a predefined threshold value of one or more invalid rows. In an embodiment, the predefined threshold value of the one or more invalid rows is defined by the user. The comparison process can help determine the severity of the invalid data and inform subsequent actions, such as data correction or deletion. In an exemplary embodiment, the user might set the predefined threshold value where any age below 0 is considered invalid (e.g., age should be between 0 and 120). The predefined threshold value for email validation might include rules such as containing "@" and ending with a valid domain.
[0045] The comparing unit 230 will analyze the flagged one or more invalid row values against the predefined threshold value. Based on the exemplary embodiment, row-by-row analysis process is performed. In the second row, the email lacks the ".com" extension. The predefined threshold for the valid email format requires a complete address (user@domain.com). Since it doesn’t meet the predefined threshold, it remains flagged as the one or more invalid row values. In the third row, the email lacks the “domain”, and the age value of -5 is checked against the predefined threshold (0-120). Since -5 is below the predefined threshold, it is confirmed as the one or more invalid row values. By comparing the one or more invalid values to the predefined threshold values, the system 120 can determine how critical each issue is. In an instance, an invalid age may require immediate correction, while a minor formatting error in an email might be less urgent. The users are allowed to define the thresholds, which provide flexibility, enabling different applications or industries to set relevant criteria based on specific requirements.
[0046] Upon comparing the one or more invalid row values with the predefined threshold value, the subsequent data is transmitted for data pre-processing. The data pre-processing includes data cleaning, data normalization, and data transformation. The data cleaning is the process of identifying and correcting inaccuracies, inconsistencies, and errors in a dataset to improve its quality and reliability for analysis. The data normalization is the process of organizing and structuring data to reduce redundancy and improve data integrity within the storage unit 240 or the dataset. The data normalization involves transforming data into a standard format, making the data consistent and easier to analyze. The data transformation is the process of converting the data from one format or structure into another to make it suitable for analysis, integration, or storage. The data transformation process is essential in data preparation, allowing organizations to clean, standardize, and manipulate the data to meet specific analytical or operational requirements.
[0047] Upon preprocessing the data, the normalizing unit 235 is configured to normalize the data in response to determining that the one or more invalid row values exceed the predefined threshold value based on the comparison. Normalization is a data preprocessing technique used to adjust the scales of numeric data to ensure they are comparable and to improve the performance of machine learning algorithms. The nnormalization allows different features to contribute equally to the analysis. If one feature has a much larger range than others, it can dominate the distance calculations in algorithms like k-nearest neighbours or gradient descent. In optimization algorithms (e.g., gradient descent), the normalized data can lead to faster convergence and improved model training.
[0048] The normalization is a process that adjusts the data to a standard format or scale, making the data easier to analyze and compare. The normalization is a process to clean the data, which is the one or more invalid column values. The normalizing unit 235 analyzes the dataset and counts the number of invalid values in each row or column to learn a structure of the data. The structure of the data includes, but not limited to, data fields such as rows and columns, and data types, such as strings, and integers. The one or more invalid rows are identified based on the analyzed dataset. If the count of one or more invalid row values exceeds the predefined threshold, the normalizing unit 235 apply normalization techniques, such as removing one or more invalid rows, filling missing values, and standardizing formats. The dataset is formatted into a standard format subsequent to the removed one or more invalid rows. The normalizing unit 235 ensures all data is in a consistent format to reduce errors. After normalization, the system 120 reassesses the dataset to ensure that the one or more invalid row values are now within acceptable limits. The system 120 ensures to maintain records of what changes were made during the normalization process for future reference and auditing purposes. Furthermore, the system 120 continuously monitors the data for new redundancies or inconsistencies and revisits the normalization process as needed.
[0049] Upon normalizing the data, the normalized data is stored in the storage unit 240 and used to train a model for normalizing the data. In an embodiment, the model includes, but not limited to, an Artificial Intelligence/Machine Learning (AI/ML) model. The normalized data enhances the performance of the model. Further, the model utilizes a variety of ML techniques, such as supervised learning, unsupervised learning, and reinforcement learning. In one embodiment, the supervised learning is a type of machine learning algorithm, which is trained on a labeled dataset. The supervised learning refers to each training example paired with an output label. The supervised learning algorithm learns to map inputs to a correct output. The supervised learning uses various algorithms (such as linear regression, decision trees, or neural networks) to learn the mapping from inputs to outputs. It minimizes the error between predicted outputs and actual labels through techniques like gradient descent. In one embodiment, the unsupervised learning is a type of machine learning algorithm, which is trained on data without any labels. The unsupervised learning algorithm tries to learn the underlying structure or distribution in the data in order to discover patterns or groupings. The unsupervised learning algorithm uses various techniques such as k-means clustering, and hierarchical clustering are used to group similar data points. In one embodiment, the reinforcement learning is a type of machine learning where an agent learns to make decisions by performing actions in an environment to maximize cumulative reward. The agent receives feedback in the form of rewards or penalties based on the actions it takes, and it learns a path that maps state the environment to the best actions.
[0050] In an exemplary embodiment, in the first row, the name refers as john doe, the age refers as 30, and the email refers as john.doe@example.com. In the second row, the name refers as jane smith, the age refers as 25, and the email refers as jane.smith@example. In the third row, the name refers as bob brown, the age refers as -5, and the email refers as bob.brown@.com. Based on the exemplary embodiment, in the first row, the name, age, and the email are valid, in the second row, the name and the age are valid, the email is invalid, and in the third row, the name is valid, and the age and the email are invalid. In this regard, the second and third row from the plurality of rows contains the invalid row values. Therefore, the system 120 needs to perform data normalization process to remove the one or more invalid row values. Upon applying normalization process to the first, the second, and the third rows, the one or more invalid row values (such as second and third row) from the plurality of rows are removed or filled the missing values from the one or more rows and standardized the data.
[0051] By doing normalization of the data, the system 120 is able to, advantageously, remove the one or more invalid row values help in increasing the quality, quantity, and diversity of the training data. Hence, the disclosed system ensures that the raw data is retained for machine learning training, which improves processing speed, and reduces memory space requirement. Furthermore, the system 120 enhances the data clarity by deleting the one or more invalid rows which helps AI/ML models to generate more accurate and contextually appropriate responses.
[0052] FIG. 3 is a block diagram of an architecture 300 that can be implemented in the system of FIG.2, according to one or more embodiments of the present disclosure. The architecture 300 of the system 120 includes an integrated system 305, a load balancer 310, and the processor 205. The processor 205 includes an input unit 315, a pre-processing unit 320, a data source unit 325, and a Machine Learning (ML) training unit 330.
[0053] The architecture 300 of the system 120 is configured to interact with the integrated system 305 and the load balancer 310. The integrated system 305 is configured to access the data in the network 105 and is capable of interacting with the server 115, the storage unit 240 to collect the data. The function of the integrated system 305 is to facilitate seamless interaction and coordination among various components, enabling them to work together effectively to achieve the normalization of data. The integrated system 305 reduces communication barriers, facilitating quicker information flow and collaboration among various components. In an embodiment, the integrated system 305 includes, but not limited to, the one or more data sources, from where the data can be retrieved. In an embodiment, the data can be retrieved as the file input, the source path, the input stream, the HTTP2, the HDFS and the NAS.
[0054] The load balancer 310 is a device that distributes network traffic across multiple servers or resources to ensure optimal resource utilization, minimize response times, and prevent any single server from becoming overwhelmed. The load balancer 310 includes distributing the one or more data sources requesting traffic across the one or more processors 205. The distribution of the one or more data source requests that traffic helps in managing and optimizing the workload, ensuring that no single processor is overwhelmed while improving overall system performance and reliability.
[0055] The input unit 315 is configured to provide an input of the predefined threshold value for comparing the one or more rows including the one or more invalid row values. Upon receiving the input from the input unit 315, the one or more invalid row values are compared with the predefined threshold value. In an embodiment, the predefined threshold value is defined by the user. The comparison process can help determine the severity of the invalid data and inform subsequent actions, such as data correction or deletion. In an exemplary embodiment, the user might set the predefined threshold value where any age below 0 is considered invalid (e.g., age should be between 0 and 120). The predefined threshold value for email validation might include rules such as containing "@" and ending with the valid domain.
[0056] Upon comparing the one or more invalid row values with the predefined threshold value, the pre-processing unit 320 is configured to pre-process the data. The pre-processing includes data cleaning, data normalization, and data transformation. The data cleaning is the process of identifying and correcting inaccuracies, inconsistencies, and errors in a dataset to improve its quality and reliability for analysis. The data normalization is the process of organizing and structuring data to reduce redundancy and improve data integrity within the storage unit 240 or the dataset. The data normalization involves transforming data into a standard format, making the data consistent and easier to analyze. The data transformation is the process of converting the data from one format or structure into another to make it suitable for analysis, integration, or storage. The data transformation process is essential in data preparation, allowing organizations to clean, standardize, and manipulate the data to meet specific analytical or operational requirements.
[0057] Upon pre-processing the data, the data source unit 325 is configured to update the preprocessed data, store the updated data, and further transmitted to the ML training unit 330 to train the model for normalizing the data. In an embodiment, the model includes, but not limited to, an Artificial Intelligence/Machine Learning (AI/ML) model. The ML training unit 330 is configured to train the updated data using one or more machine learning models and applies various machine learning algorithms to the preprocessed data. Based on the nature of the data and the problem at hand, different ML algorithms include, but are not limited to, linear regression, decision trees, neural networks, clustering algorithms, etc.
[0058] FIG. 4 is a signal flow diagram illustrating for normalizing the data, according to one or more embodiments of the present disclosure.
[0059] At step 405, initially, the request is transmitted by the user via the UI 215 to extract the one or more rows in the data frame that contains the one or more invalid row values. The data frame is a two-dimensional, tabular data structure used to store and manipulate the data in rows and columns. The data frame allows for easy data analysis and processing, especially in machine learning tasks.
[0060] At step 410, the retrieving unit 220 is configured to retrieve the data from one or more data sources based on receiving the request. In an embodiment, the one or more data sources include at least, the file input, the source path, the Hyper Text Transfer Protocol 2 (HTTP 2), the Distributed File System (DFS), and the Network Attached Storage (NAS). The file input refers to reading data from the files stored locally or on the server 115. The files can be in different formats, including, but not limited to, Comma Separated Values (CSV), JavaScript Object Notation (JSON), extensible Markup Language (XML), or text files. In an exemplary embodiment, the data is stored in the CSV file, and the retrieving unit 220 fetches the data for processing. The retrieving unit 220 retrieves the data from the file and loads the data into the memory 210 for further processing.
[0061] At step 415, the identifying unit 225 is configured to identify the one or more rows including the one or more invalid row values from the plurality of rows based on retrieved data. The retrieved data is stored in the data frame. The identifying unit 225 is configured to scan the dataset which contains the retrieved data. The identifying unit 225 is configured to check whether one or more row values of the plurality of rows are compliant with the acceptable format. The identifying unit 225 scans the dataset to identify the one or more rows that include the one or more row values of the plurality of rows are compliant with the acceptable format. The identifying unit 225 is further configured to identify that the one or more row values are one of, the one or more invalid row values in response to determining that the one or more row values are not compliant with the acceptable format based on the checking.
[0062] At step 420, the comparing unit 230 is configured to compare the one or more invalid row values with the predefined threshold value based on identifying the one or more rows including the one or more invalid row values from the plurality of rows. In an embodiment, the predefined threshold value is defined by the user. The comparison process can help determine the severity of the invalid data and inform subsequent actions, such as data correction or deletion. In an exemplary embodiment, the user might set the predefined threshold value where any age below 0 is considered invalid (e.g., age should be between 0 and 120). The predefined threshold value for email validation might include rules such as containing "@" and ending with a valid domain.
[0063] At step 425, upon comparing the one or more invalid row values with the predefined threshold value, the preprocessing unit 320 transmits the data for pre-processing. The data pre-processing includes data cleaning, data normalization, and data transformation. The data cleaning is the process of identifying and correcting inaccuracies, inconsistencies, and errors in a dataset to improve its quality and reliability for analysis. The data normalization is the process of organizing and structuring data to reduce redundancy and improve data integrity within the storage unit 240 or the dataset. The data normalization involves transforming data into a standard format, making the data consistent and easier to analyze. The data transformation is the process of converting the data from one format or structure into another to make it suitable for analysis, integration, or storage. The data transformation process is essential in data preparation, allowing organizations to clean, standardize, and manipulate the data to meet specific analytical or operational requirements.
[0064] At step 425, the normalizing unit 235 is configured to normalize the data in response to determining that the one or more invalid row values exceed the predefined threshold value based on the comparison. The normalization is a data preprocessing technique used to adjust the scales of numeric data to ensure they are comparable and to improve the performance of machine learning algorithms. The normalization allows different features to contribute equally to the analysis. If one feature has a much larger range than others, it can dominate the distance calculations in algorithms like k-nearest neighbours or gradient descent. In optimization algorithms (e.g., gradient descent), the normalized data can lead to faster convergence and improved model training.
[0065] The normalization is a process that adjusts or removes the data to a standard format or scale, making the data easier to analyze and compare. The normalization is the process of cleaning the data, which is the one or more invalid column values. The normalizing unit 235 analyzes the dataset and counts the number of invalid values in each row or column to learn a structure of the data. The structure of the data includes, but not limited to, data fields such as rows and columns, and data types, such as strings, and integers. The one or more invalid rows are identified based on the analyzed dataset. If the count of one or more invalid row values exceeds the predefined threshold, the normalizing unit 235 apply normalization techniques, such as removing one or more invalid rows, filling missing values, and standardizing formats. The dataset is formatted into a standard format subsequent to the removed one or more invalid rows. The normalizing unit 235 ensures all data is in a consistent format to reduce errors. After normalization, the normalizing unit 235 reassesses the dataset to ensure that the one or more invalid row values are now within acceptable limits. The system 120 ensures to maintain records of what changes were made during the normalization process for future reference and auditing purposes. Furthermore, the system 120 continuously monitors the data for new redundancies or inconsistencies and revisits the normalization process as needed.
[0066] At step 430, the normalized data is stored in the storage unit 240 and used to train a model for normalizing the data. In an embodiment, the model includes, but not limited to, an Artificial Intelligence/Machine Learning (AI/ML) model. The normalized data enhances the performance of the model. Further, the model utilizes a variety of ML techniques, such as supervised learning, unsupervised learning, and reinforcement learning.
[0067] In one embodiment, the supervised learning is a type of machine learning algorithm, which is trained on a labeled dataset. The supervised learning refers to each training example paired with an output label. The supervised learning algorithm learns to map inputs to a correct output. The supervised learning uses various algorithms (such as linear regression, decision trees, or neural networks) to learn the mapping from inputs to outputs. It minimizes the error between predicted outputs and actual labels through techniques like gradient descent. In one embodiment, the unsupervised learning is a type of machine learning algorithm, which is trained on data without any labels. The unsupervised learning algorithm tries to learn the underlying structure or distribution in the data in order to discover patterns or groupings. The unsupervised learning algorithm uses various techniques such as k-means clustering, and hierarchical clustering are used to group similar data points. In one embodiment, the reinforcement learning is a type of machine learning where an agent learns to make decisions by performing actions in an environment to maximize cumulative reward. The agent receives feedback in the form of rewards or penalties based on the actions it takes, and it learns a path that maps state the environment to the best actions.
[0068] FIG. 5 is a flow diagram illustrating a method 500 for normalizing the data, according to one or more embodiments of the present disclosure.
[0069] At step 505, the method 500 includes the step of retrieving the data from the one or more data sources by the retrieving unit 220. In an embodiment, the one or more data sources include at least, the file input, the source path, the input stream, the Hyper Text Transfer Protocol 2 (HTTP 2), the Distributed File System (HDFS), and the Network Attached Storage (NAS). The file input refers to reading data from the files stored locally or on the server 115. The files can be in different formats, including, but not limited to, Comma Separated Values (CSV), JavaScript Object Notation (JSON), extensible Markup Language (XML), or text files. In an exemplary embodiment, the data is stored in the CSV file, and the retrieving unit 220 fetches the data for processing. The retrieving unit 220 retrieves the data from the file and loads it into the memory 210 for further processing.
[0070] The source path typically refers to the directory or network location where the data files are stored. The retrieving unit 220 fetches the data by following the provided file path. In an exemplary embodiment for the source path, the system 120 stores images in a specific directory. The retrieving unit 220 navigates to a designated source path and retrieves all files that match the required criteria (e.g., .jpg images). The input stream refers to continuous data that is read in real-time from a stream of data (e.g., data being transmitted over the network 105 or generated by sensors). In an exemplary embodiment, the data is being received from an Application Programming Interface (API) or a live data stream, and the retrieving unit 220 fetches the data in real-time. The HTTP 2 is a protocol used for communication over the web, which improves upon HTTP/1.1 by offering multiplexing and better performance for handling multiple requests. In an exemplary embodiment, the retrieving unit 220 retrieves the data from the web server using the HTTP 2. The retrieving unit 220 uses HTTP 2 to fetch the data from remote web servers or APIs.
[0071] The DFS is a distributed file system used to store large datasets across multiple machines. The DFS is commonly used in big data environments to store and retrieve large amounts of data. The retrieving unit 220 connects to the DFS to retrieve the files for processing. The NAS is a dedicated file storage system that provides Local Area Network (LAN) access to the data. The NAS allows multiple users or systems to access the data from a centralized storage device. The retrieving unit 220 fetches the data from a NAS device over the network 105. In an exemplary embodiment, if the data is stored on the NAS, the retrieving unit 220 fetches the data via network protocols. Upon retrieving the data from the one or more data sources, the retrieved data is stored in the data frame.
[0072] At step 510, the method 500 includes the step of identifying the one or more rows including the one or more invalid row values from the plurality of rows by the identifying unit 225. The identifying unit 225 is configured to scan the dataset which contains the retrieved data. The identifying unit 225 is configured to check whether one or more row values of the plurality of rows are compliant with an acceptable format. In an exemplary embodiment, the dataset includes one or more rows and columns. The one or more rows and columns include, but not limited to, name, age, and email. The identifying unit 225 scans the dataset to identify the one or more rows that include the one or more row values of the plurality of rows are compliant with the acceptable format. The identifying unit 225 is further configured to identify that the one or more row values are one of, the one or more invalid row values in response to determining that the one or more row values are not compliant with the acceptable format based on the checking.
[0073] At step 515, the method 500 includes the step of comparing the one or more invalid row values with a predefined threshold value by the comparing unit 230 based on identifying the one or more rows including the one or more invalid row values from the plurality of rows. In an embodiment, the predefined threshold value is defined by the user. The comparison process can help determine the severity of the invalid data and inform subsequent actions, such as data correction or deletion. In an exemplary embodiment, the user might set the predefined threshold value where any age below 0 is considered invalid (e.g., age should be between 0 and 120). The predefined threshold value for email validation might include rules such as containing "@" and ending with a valid domain.
[0074] Upon comparing the one or more invalid row values with the predefined threshold value, the subsequent data is transmitted for data pre-processing. The data pre-processing includes data cleaning, data normalization, and data transformation. The data cleaning is the process of identifying and correcting inaccuracies, inconsistencies, and errors in a dataset to improve its quality and reliability for analysis. The data normalization is the process of organizing and structuring data to reduce redundancy and improve data integrity within the storage unit 240 or the dataset. The data normalization involves transforming data into a standard format, making the data consistent and easier to analyze. The data transformation is the process of converting the data from one format or structure into another to make it suitable for analysis, integration, or storage. The data transformation process is essential in data preparation, allowing organizations to clean, standardize, and manipulate the data to meet specific analytical or operational requirements.
[0075] At step 520, the method 500 includes the step of normalizing the data based on the preprocessed data in response to determining that the one or more invalid row values exceed the predefined threshold value based on the comparison by the normalizing unit 235. The normalization is a data preprocessing technique used to adjust the scales of numeric data to ensure they are comparable and to improve the performance of machine learning algorithms. The normalization allows different features to contribute equally to the analysis. If one feature has a much larger range than others, it can dominate the distance calculations in algorithms like k-nearest neighbours or gradient descent. In optimization algorithms (e.g., gradient descent), the normalized data can lead to faster convergence and improved model training.
[0076] The normalization is a process that adjusts or removes the data to a standard format or scale, making the data easier to analyze and compare. The normalization is a process to clean the data, which is the one or more invalid column values. The normalizing unit 235 analyzes the dataset and counts the number of invalid values in each row or column to learn a structure of the data. The structure of the data includes, but not limited to, data fields such as rows and columns, and data types, such as strings, and integers. The one or more invalid rows are identified based on the analyzed dataset. If the count of one or more invalid row values exceeds the predefined threshold, the normalizing unit 235 apply normalization techniques, such as removing one or more invalid rows, filling missing values, and standardizing formats. The dataset is formatted into a standard format subsequent to the removed one or more invalid rows. The normalizing unit 235 ensures all data is in a consistent format to reduce errors. After normalization, the system 120 reassesses the dataset to ensure that the one or more invalid row values are now within acceptable limits. The system 120 ensures to maintain records of what changes were made during the normalization process for future reference and auditing purposes. Furthermore, the system 120 continuously monitors the data for new redundancies or inconsistencies and revisits the normalization process as needed.
[0077] Upon normalizing the data, the normalized data is stored in the storage unit 240 and used to train a model. In an embodiment, the model includes, but not limited to, an Artificial Intelligence/Machine Learning (AI/ML) model. The normalized data enhances the performance of the model. Further, the model utilizes a variety of ML techniques, such as supervised learning, unsupervised learning, and reinforcement learning. In one embodiment, the supervised learning is a type of machine learning algorithm, which is trained on a labeled dataset. The supervised learning refers to each training example paired with an output label. The supervised learning algorithm learns to map inputs to a correct output. The supervised learning uses various algorithms (such as linear regression, decision trees, or neural networks) to learn the mapping from inputs to outputs. It minimizes the error between predicted outputs and actual labels through techniques like gradient descent. In one embodiment, the unsupervised learning is a type of machine learning algorithm, which is trained on data without any labels. The unsupervised learning algorithm tries to learn the underlying structure or distribution in the data in order to discover patterns or groupings. The unsupervised learning algorithm uses various techniques such as k-means clustering, and hierarchical clustering are used to group similar data points. In one embodiment, the reinforcement learning is a type of machine learning where an agent learns to make decisions by performing actions in an environment to maximize cumulative reward. The agent receives feedback in the form of rewards or penalties based on the actions it takes, and it learns a path that maps state the environment to the best actions.
[0078] In another aspect of the embodiment, a non-transitory computer-readable medium stored thereon computer-readable instructions that, when executed by a processor 205. The processor 205 is configured to retrieve data from one or more data sources. The data represents a plurality of rows. The processor 205 is configured to identify one or more rows including one or more invalid row values from the plurality of rows. The processor 205 is configured to compare the one or more invalid row values with a predefined threshold value. The processor 205 is configured to normalize the data in response to determining that the one or more invalid row values exceed the predefined threshold value based on the comparison.
[0079] A person of ordinary skill in the art will readily ascertain that the illustrated embodiments and steps in description and drawings (FIGS.1-5) are set out to explain the exemplary embodiments shown, and it should be anticipated that ongoing technological development will change the manner in which particular functions are performed. These examples are presented herein for purposes of illustration, and not limitation. Further, the boundaries of the functional building blocks have been arbitrarily defined herein for the convenience of the description. Alternative boundaries can be defined so long as the specified functions and relationships thereof are appropriately performed. Alternatives (including equivalents, extensions, variations, deviations, etc., of those described herein) will be apparent to persons skilled in the relevant art(s) based on the teachings contained herein. Such alternatives fall within the scope and spirit of the disclosed embodiments.
[0080] The present disclosure provides technical advancement for normalizing the data based on removal of the one or more invalid rows helps in increasing the quality, quantity, and diversity of the training data. Hence, the disclosed system and method ensures that the normalized data is retained for machine learning training. Further, the system enables the user to extract the one or more invalid rows in the data frame based on the predefined threshold. The system enables faster data analysis and processing, removes the one or more invalid rows based on the predefined threshold. By doing so, the system enhances the data clarity by deleting the irrelevant data which helps AI/ML models to generate more accurate and contextually appropriate responses.
[0081] The present invention offers multiple advantages over the prior art and the above listed are a few examples to emphasize on some of the advantageous features. The listed advantages are to be read in a non-limiting manner.

REFERENCE NUMERALS

[0082] Environment - 100
[0083] Network-105
[0084] User equipment- 110
[0085] Server - 115
[0086] System -120
[0087] Processor - 205
[0088] Memory - 210
[0089] User interface-215
[0090] Retrieving unit – 220
[0091] Identifying unit– 225
[0092] Comparing unit – 230
[0093] Normalizing unit- 235
[0094] Storage unit– 240
[0095] Integrated system-305
[0096] Load balancer- 310
[0097] Input unit- 315
[0098] Pre-processing unit- 320
[0099] Data source unit-325
[00100] ML training unit- 330
,CLAIMS:CLAIMS
We Claim:
1. A method (500) of normalizing data, the method (500) comprising the steps of:
retrieving, by one or more processors (205), the data from one or more data sources, the data representing a plurality of rows;
identifying, by the one or more processors (205), from the plurality of rows, one or more rows including one or more invalid row values;
comparing, by the one or more processors (205), the one or more invalid row values with a predefined threshold value; and
in response to determining that the one or more invalid row values exceed the predefined threshold value based on the comparison, normalizing, by the one or more processors (205), the data.

2. The method (500) as claimed in claim 1, wherein the one or more data sources include at least, a file input, a source path, an input stream, a Hyper Text Transfer Protocol 2 (HTTP 2), a Distributed File System (DFS), and a Network Attached Storage (NAS).

3. The method (500) as claimed in claim 1, wherein the retrieved data is stored in a data frame.

4. The method (500) as claimed in claim 1, wherein the step of, identifying, one or more rows including one or more invalid row values, includes the steps of:
checking, by the one or more processors (205), whether one or more row values of the plurality of rows are compliant with an acceptable format; and
in response to determining that the one or more row values are not compliant with the acceptable format based on the checking, identifying, by the one or more processors (205), that the one or more row values are one of, the one or more invalid row values, wherein the one or more invalid row values includes, at least one of, a Not a Number (NaN) value, a zero value, a null value, and an empty string.
5. The method (500) as claimed in claim 1, wherein the step of normalizing, the data in response to determining that the one or more invalid row values exceed the predefined threshold value based on the comparison, includes the steps of:
analyzing, by the one or more processors (205), a dataset to learn a structure of the data, including, at least one of, data fields and data types;
identifying, by the one or more processors (205), the one or more invalid rows based on the analyzed dataset;
removing, by the one or more processors (205), the one or more invalid rows based on the identified one or more invalid rows; and
formatting, by the one or more processors (205), the dataset into a standard format subsequent to the removed one or more invalid rows.

6. The method (500) as claimed in claim 1, wherein the normalized data is stored in a storage unit (240) and used to train a model for normalizing the data.

7. A system (120) of normalizing data, the system (120) comprising:
a retrieving unit (220) configured to retrieve, the data from one or more data sources, the data representing a plurality of rows;
an identifying unit (225) configured to identify, from the plurality of rows, one or more rows including one or more invalid row values;
a comparing unit (230) configured to compare, the one or more invalid row values with a predefined threshold value; and
a normalizing unit (235) configured to normalize, the data in response to determining that the one or more invalid row values exceed the predefined threshold value based on the comparison.

8. The system (120) as claimed in claim 6, wherein the one or more data sources include at least, a file input, a source path, an input stream, a Hyper Text Transfer Protocol 2 (HTTP 2), a Distributed File System (DFS), and a Network Attached Storage (NAS).

9. The system (120) as claimed in claim 6, wherein the retrieved data is stored in a data frame.

10. The system (120) as claimed in claim 6, wherein the identifying unit (225) is further configured to:
check, whether one or more row values of the plurality of rows are compliant with an acceptable format; and
identify, that the one or more row values are one of, the one or more invalid row values in response to determining that the one or more row values are not compliant with the acceptable format based on the checking, wherein the one or more invalid row values includes, at least one of, a Not a Number (NaN) value, a zero value, a null value, and an empty string..

11. The system (120) as claimed in claim 6, wherein the normalizing unit (235) is further configured to:
analyze, a dataset to learn a structure of the data, including, at least one of, data fields and data types;
identify, the one or more invalid rows based on the analyzed dataset;
remove, the one or more invalid rows based on the identified one or more invalid rows; and
format, the dataset into a standard format subsequent to the removed one or more invalid rows.

12. The system (120) as claimed in claim 6, wherein the normalized data is stored in a storage unit (240) and used to train a model for normalizing the data.

Documents

Application Documents

#	Name	Date
1	202321076737-STATEMENT OF UNDERTAKING (FORM 3) [09-11-2023(online)].pdf	2023-11-09
2	202321076737-PROVISIONAL SPECIFICATION [09-11-2023(online)].pdf	2023-11-09
3	202321076737-FORM 1 [09-11-2023(online)].pdf	2023-11-09
4	202321076737-FIGURE OF ABSTRACT [09-11-2023(online)].pdf	2023-11-09
5	202321076737-DRAWINGS [09-11-2023(online)].pdf	2023-11-09
6	202321076737-DECLARATION OF INVENTORSHIP (FORM 5) [09-11-2023(online)].pdf	2023-11-09
7	202321076737-FORM-26 [27-11-2023(online)].pdf	2023-11-27
8	202321076737-Proof of Right [12-02-2024(online)].pdf	2024-02-12
9	202321076737-DRAWING [08-11-2024(online)].pdf	2024-11-08
10	202321076737-COMPLETE SPECIFICATION [08-11-2024(online)].pdf	2024-11-08
11	202321076737-FORM-5 [26-11-2024(online)].pdf	2024-11-26
12	Abstract-1.jpg	2024-12-27
13	202321076737-Power of Attorney [24-01-2025(online)].pdf	2025-01-24
14	202321076737-Form 1 (Submitted on date of filing) [24-01-2025(online)].pdf	2025-01-24
15	202321076737-Covering Letter [24-01-2025(online)].pdf	2025-01-24
16	202321076737-CERTIFIED COPIES TRANSMISSION TO IB [24-01-2025(online)].pdf	2025-01-24
17	202321076737-FORM 3 [31-01-2025(online)].pdf	2025-01-31