Abstract: ABSTRACT SYSTEM AND METHOD FOR NORMALIZING DATA The present invention relates to a system (120) and a method (500) for normalizing data is disclosed. The system (120) includes a retrieving unit (220) configured to retrieve data from one or more data sources. The system (120) includes an identification unit (225) configured to identify, one or more rows containing one or more invalid column values. The system (120) includes a pre-processing unit (235) configured to pre-process, the data to remove the identified one or more rows containing the invalid column values. The system (120) includes a generating unit (240) configured to generate the normalized data based on pre-processing of the data. Ref. Fig. 2
DESC:
FORM 2
THE PATENTS ACT, 1970
(39 of 1970)
&
THE PATENTS RULES, 2003
COMPLETE SPECIFICATION
(See section 10 and rule 13)
1. TITLE OF THE INVENTION
SYSTEM AND METHOD FOR NORMALIZING DATA
2. APPLICANT(S)
NAME NATIONALITY ADDRESS
JIO PLATFORMS LIMITED INDIAN OFFICE-101, SAFFRON, NR. CENTRE POINT, PANCHWATI 5 RASTA, AMBAWADI, AHMEDABAD 380006, GUJARAT, INDIA
3.PREAMBLE TO THE DESCRIPTION
THE FOLLOWING SPECIFICATION PARTICULARLY DESCRIBES THE NATURE OF THIS INVENTION AND THE MANNER IN WHICH IT IS TO BE PERFORMED.
FIELD OF THE INVENTION
[0001] The present invention relates to the field of wireless communication networks, more particularly relates to a method and a system for normalizing data.
BACKGROUND OF THE INVENTION
[0002] Real world data often has a lot of missing data and values. Missing values usually occur due to data corruption, failure to record data, system failure and any other such occurrences that lead to loss of data. The handling of missing data is very important during the preprocessing of the dataset as many machine learning algorithms do not support missing values.
[0003] Hence there is a need for a system and method for normalizing data required for machine learning training. The alternate method and system should remove any inaccuracies or inconsistencies from a data set.
SUMMARY OF THE INVENTION
[0004] One or more embodiments of the present disclosure provide a method and a system for normalizing data.
[0005] In one aspect of the present invention, the method for normalizing the data is disclosed. The method includes the step of retrieving, by one or more processors, data from one or more data sources. The method includes the step of identifying, by the one or more processors, one or more rows containing one or more invalid column values. The method includes the step of pre-processing, by the one or more processors, the data to remove the identified one or more rows containing the invalid column values. The method includes the step of generating, by the one or more processors, the normalized data based on pre-processing of the data.
[0006] In one embodiment, the one or more data sources include at least, a file input, a source path, an input stream, a Hyper Text Transfer Protocol 2 (HTTP 2), a Distributed File System (DFS), and a Network Attached Storage (NAS).
[0007] In another embodiment, the retrieved data is stored in a data frame.
[0008] In yet another embodiment, the step of identifying, by the one or more processors, one or more rows containing one or more invalid column values, includes the step of calculating a distance between each missing value and all other non-missing values of the one or more rows in a dataset. The step of identifying, by the one or more processors, one or more rows containing one or more invalid column values, includes the step of identifying a nearest non-missing value on the one or more rows based on the calculation. The step of identifying, by the one or more processors, one or more rows containing one or more invalid column values, includes the step of replacing the missing value of the one or more rows with the identified nearest non-missing value.
[0009] In yet another embodiment, the one or more invalid column values is at least one of a Not a Number (NaN) value, a zero value, a null value, and an empty string.
[0010] In yet another embodiment, for identifying the one or more rows, the method comprises parsing, by the one or more processors, a script associated with the retrieved data.
[0011] In yet another embodiment, the generated normalized data is stored in a storage unit and used for Machine Learning (ML) training.
[0012] In another aspect of the present invention, the system for normalizing the data is disclosed. The system includes a retrieving unit configured to retrieve data from one or more data sources. The system includes an identification unit configured to identify, one or more rows containing one or more invalid column values. The system includes a pre-processing unit configured to pre-process, the data to remove the identified one or more rows containing the invalid column values. The system includes a generating unit configured to generate the normalized data based on pre-processing of the data.
[0013] In another aspect of the embodiment, a non-transitory computer-readable medium having stored thereon computer-readable instructions that, when executed by a processor is disclosed. The processor is configured to retrieve data from one or more data sources. The processor is configured to identify one or more rows containing one or more invalid column values. The processor is configured to pre-process the data to remove the identified one or more rows containing the invalid column values. The processor is configured to generate the normalized data based on pre-processing of the data.
[0014] Other features and aspects of this invention will be apparent from the following description and the accompanying drawings. The features and advantages described in this summary and in the following detailed description are not all-inclusive, and particularly, many additional features and advantages will be apparent to one of ordinary skill in the relevant art, in view of the drawings, specification, and claims hereof. Moreover, it should be noted that the language used in the specification has been principally selected for readability and instructional purposes and may not have been selected to delineate or circumscribe the inventive subject matter, resort to the claims being necessary to determine such inventive subject matter.
BRIEF DESCRIPTION OF THE DRAWINGS
[0015] The accompanying drawings, which are incorporated herein, and constitute a part of this disclosure, illustrate exemplary embodiments of the disclosed methods and systems in which like reference numerals refer to the same parts throughout the different drawings. Components in the drawings are not necessarily to scale, emphasis instead being placed upon clearly illustrating the principles of the present disclosure. Some drawings may indicate the components using block diagrams and may not represent the internal circuitry of each component. It will be appreciated by those skilled in the art that disclosure of such drawings includes disclosure of electrical components, electronic components or circuitry commonly used to implement such components.
[0016] FIG. 1 is an exemplary block diagram of an environment for normalizing data, according to one or more embodiments of the present disclosure;
[0017] FIG. 2 is an exemplary block diagram of a system for normalizing the data, according to the one or more embodiments of the present disclosure;
[0018] FIG. 3 is a block diagram of an architecture that can be implemented in the system of FIG.2, according to the one or more embodiments of the present disclosure;
[0019] FIG. 4 is a signal flow diagram illustrating for normalizing the data, according to the one or more embodiments of the present disclosure; and
[0020] FIG. 5 is a flow diagram illustrating the method for normalizing the data, according to the one or more embodiments of the present disclosure.
[0021] The foregoing shall be more apparent from the following detailed description of the invention.
DETAILED DESCRIPTION OF THE INVENTION
[0022] Some embodiments of the present disclosure, illustrating all its features, will now be discussed in detail. It must also be noted that as used herein and in the appended claims, the singular forms "a", "an" and "the" include plural references unless the context clearly dictates otherwise.
[0023] Various modifications to the embodiment will be readily apparent to those skilled in the art and the generic principles herein may be applied to other embodiments. However, one of ordinary skill in the art will readily recognize that the present disclosure including the definitions listed here below are not intended to be limited to the embodiments illustrated but is to be accorded the widest scope consistent with the principles and features described herein.
[0024] A person of ordinary skill in the art will readily ascertain that the illustrated steps detailed in the figures and here below are set out to explain the exemplary embodiments shown, and it should be anticipated that ongoing technological development will change the manner in which particular functions are performed. These examples are presented herein for purposes of illustration, and not limitation. Further, the boundaries of the functional building blocks have been arbitrarily defined herein for the convenience of the description. Alternative boundaries can be defined so long as the specified functions and relationships thereof are appropriately performed. Alternatives (including equivalents, extensions, variations, deviations, etc., of those described herein) will be apparent to persons skilled in the relevant art(s) based on the teachings contained herein. Such alternatives fall within the scope and spirit of the disclosed embodiments.
[0025] Referring to FIG. 1, FIG. 1 illustrates an exemplary block diagram of an environment 100 for normalizing data, according to one or more embodiments of the present invention. The environment 100 includes the network 105, a User Equipment (UE) 110, a server 115, and a system 120. The UE 110 aids a user to interact with the system 120 for normalizing the data. In an embodiment, the user is at least one of, a network operator, and a service provider. The normalizing data refers to cleaning the data by identifying, pre-processing, and removing any inaccuracies or inconsistencies from a data set. In an exemplary embodiment, an organization is analyzing customer feedback data collected from one or more sources, such as surveys, online reviews, and social media. An original dataset includes a customer ID, feedback, and rating. The first customer ID includes feedback as great service and rating is 5. A second customer ID includes the feedback as poor service and rating is N/A. The missing values exist in the rating column such as N/A. The normalization step recognizes different formats for ratings in the feedback. Upon recognizing, all ratings are converted to a consistent numerical format. Further, the normalization of data involves replace the N/A with an appropriate value (e.g., the average rating) or remove the entry if necessary.
[0026] For the purpose of description and explanation, the description will be explained with respect to the UE 110, or to be more specific will be explained with respect to a first UE 110a, a second UE 110b, and a third UE 110c, and should nowhere be construed as limiting the scope of the present disclosure. Each of the UE 110 from the first UE 110a, the second UE 110b, and the third UE 110c is configured to connect to the server 115 via the network 105. In an embodiment, each of the first UE 110a, the second UE 110b, and the third UE 110c is one of, but not limited to, any electrical, electronic, electro-mechanical or an equipment and a combination of one or more of the above devices such as smartphones, virtual reality (VR) devices, augmented reality (AR) devices, laptop, a general-purpose computer, desktop, personal digital assistant, tablet computer, mainframe computer, or any other computing device.
[0027] The network 105 includes, by way of example but not limitation, one or more of a wireless network, a wired network, an internet, an intranet, a public network, a private network, a packet-switched network, a circuit-switched network, an ad hoc network, an infrastructure network, a Public-Switched Telephone Network (PSTN), a cable network, a cellular network, a satellite network, a fiber optic network, or some combination thereof. The network 105 may include, but is not limited to, a Third Generation (3G), a Fourth Generation (4G), a Fifth Generation (5G), a Sixth Generation (6G), a New Radio (NR), a Narrow Band Internet of Things (NB-IoT), an Open Radio Access Network (O-RAN), and the like.
[0028] The server 115 may include by way of example but not limitation, one or more of a standalone server, a server blade, a server rack, a bank of servers, a server farm, hardware supporting a part of a cloud service or system, a home server, hardware running a virtualized server, one or more processors executing code to function as a server, one or more machines performing server-side functionality as described herein, at least a portion of any of the above, some combination thereof. In an embodiment, the entity may include, but is not limited to, a vendor, a network operator, a company, an organization, a university, a lab facility, a business enterprise, a defense facility, or any other facility that provides content.
[0029] The environment 100 further includes the system 120 communicably coupled to the server 115 and each of the first UE 110a, the second UE 110b, and the third UE 110c via the network 105. The system 120 is configured for normalizing the data. The system 120 is adapted to be embedded within the server 115 or is embedded as the individual entity, as per multiple embodiments of the present invention.
[0030] Operational and construction features of the system 120 will be explained in detail with respect to the following figures.
[0031] FIG. 2 is an exemplary block diagram of a system 120 for normalizing the data, according to one or more embodiments of the present disclosure.
[0032] The system 120 includes a processor 205, a memory 210, a user interface 215, and a database 250. For the purpose of description and explanation, the description will be explained with respect to one or more processors 205, or to be more specific will be explained with respect to the processor 205 and should nowhere be construed as limiting the scope of the present disclosure. The one or more processor 205, hereinafter referred to as the processor 205 may be implemented as one or more microprocessors, microcomputers, microcontrollers, digital signal processors, central processing units, state machines, logic circuitries, single board computers, and/or any devices that manipulate signals based on operational instructions.
[0033] As per the illustrated embodiment, the processor 205 is configured to fetch and execute computer-readable instructions stored in the memory 210. The memory 210 may be configured to store one or more computer-readable instructions or routines in a non-transitory computer-readable storage medium, which may be fetched and executed to create or share data packets over a network service. The memory 210 may include any non-transitory storage device including, for example, volatile memory such as RAM, or non-volatile memory such as EPROM, flash memory, and the like.
[0034] The User Interface (UI) 215 includes a variety of interfaces, for example, interfaces for a Graphical User Interface (GUI), a web user interface, a Command Line Interface (CLI), and the like. The user interface 215 facilitates communication of the system 120. In one embodiment, the user interface 215 provides a communication pathway for one or more components of the system 120. Examples of the one or more components include, but are not limited to, the UE 110, and the storage unit 250. The term “storage unit” and “database” are used interchangeably hereinafter, without limiting the scope of the disclosure.
[0035] The database 250 is one of, but not limited to, a centralized database, a cloud-based database, a commercial database, an open-source database, a distributed database, an end-user database, a graphical database, a No-Structured Query Language (NoSQL) database, an object-oriented database, a personal database, an in-memory database, a document-based database, a time series database, a wide column database, a key value database, a search database, a cache databases, and so forth. The foregoing examples of database 250 types are non-limiting and may not be mutually exclusive e.g., a database can be both commercial and cloud-based, or both relational and open-source, etc.
[0036] Further, the processor 205, in an embodiment, may be implemented as a combination of hardware and programming (for example, programmable instructions) to implement one or more functionalities of the processor 205. In the examples described herein, such combinations of hardware and programming may be implemented in several different ways. For example, the programming for the processor 205 may be processor-executable instructions stored on a non-transitory machine-readable storage medium and the hardware for processor 205 may comprise a processing resource (for example, one or more processors), to execute such instructions. In the present examples, the memory 210 may store instructions that, when executed by the processing resource, implement the processor 205. In such examples, the system 120 may comprise the memory 210 storing the instructions and the processing resource to execute the instructions, or the memory 210 may be separate but accessible to the system 120 and the processing resource. In other examples, the processor 205 may be implemented by electronic circuitry.
[0037] In order for the system 120 to normalize the data, the processor 205 includes a retrieving unit 220, an identification unit 225, a parsing unit 230, a pre-processing unit 235, and a generating unit 240 communicably coupled to each other. In an embodiment, operations and functionalities of the retrieving unit 220, the identification unit 225, the parsing unit 230, the pre-processing unit 235, and the generating unit 240 can be used in combination or interchangeably.
[0038] Initially, a request is transmitted by the user via the UI 215 for cleaning the data from a data frame that contains one or more invalid column values. The data frame is a two-dimensional, tabular data structure used to store and manipulate the data in rows and columns. The data frame allows for easy data analysis and processing, especially in machine learning tasks. Each column in the data frame represents a variable, and each row represents a data point. The data frame contains data to be used for machine learning training.
[0039] Upon receiving the request, the retrieving unit 220 is configured to retrieve the data from one or more data sources. In an embodiment, the one or more data sources include at least, a file input, a source path, a Hyper Text Transfer Protocol 2 (HTTP 2), a Distributed File System (DFS), and a Network Attached Storage (NAS). The file input refers to reading data from the files stored locally or on the server 115. The files can be in different formats, including, but not limited to, Comma Separated Values (CSV), JavaScript Object Notation (JSON), extensible Markup Language (XML), or text files. In an exemplary embodiment, the data is stored in the CSV file, and the retrieving unit 220 fetches it for processing. The retrieving unit 220 retrieves the data from the file and loads it into the memory 210 for further processing.
[0040] The source path typically refers to the directory or network location where the data files are stored. The retrieving unit 220 fetches the data by following the provided file path. In an exemplary embodiment for the source path, the system 120 stores images in a specific directory. The retrieving unit 220 navigates to a designated source path and retrieves all files that match the required criteria (e.g., .jpg images). The input stream refers to continuous data that is read in real-time from a stream of data (e.g., data being transmitted over the network 105 or generated by sensors). In an exemplary embodiment, the data is being received from an Application Programming Interface (API) or a live data stream, and the retrieving unit 220 fetches it in real-time. The HTTP 2 is a protocol used for communication over the web, which improves upon HTTP/1.1 by offering multiplexing and better performance for handling multiple requests. In an exemplary embodiment, the retrieving unit 220 retrieves the data from the web server using the HTTP 2. The retrieving unit 220 uses HTTP 2 to fetch the data from remote web servers or APIs.
[0041] The DFS is a distributed file system used to store large datasets across multiple machines. The DFS is commonly used in big data environments to store and retrieve large amounts of data. The retrieving unit 220 connects to the DFS to retrieve the files for processing. The NAS is a dedicated file storage system that provides Local Area Network (LAN) access to the data. The NAS allows multiple users or systems to access the data from a centralized storage device. The retrieving unit 220 fetches the data from a NAS device over the network 105. In an exemplary embodiment, if the data is stored on the NAS, the retrieving unit 220 fetches the data via network protocols. Upon retrieving the data from the one or more data sources, the retrieved data is stored in the data frame for further processing.
[0042] Upon retrieving the data and stored in the data frame, the identification unit 225 is configured to identify the one or more rows containing one or more invalid column values. The identification unit 225 is configured to scan a dataset which contains the retrieved data. The identification unit 225 is configured to check the retrieved data in each row from one or more rows to determine if any columns contain the one or more invalid values. In an embodiment, the one or more invalid column values is at least one of a Not a Number (NaN) value, a zero value, a null value, and an empty string. The NaN value represents undefined or unrepresentable values. The identification unit 225 checks if any column contains the NaN value by scanning the data row by row. If the identification unit 225 determines any of the columns contain the NaN values in the column, the identification unit 225 flags the corresponding row. The zero value indicates invalid or missing data in certain columns. The identification unit 225 scans the dataset and identifies the columns where the value is 0.
[0043] As per the above embodiment, the null value represents empty or absence of any value in the dataset. The null values are often explicitly defined as NULL or empty fields in the databases or datasets. The identification unit 225 checks each column for null entries. The empty string is often considered invalid, especially in text-based columns where the data is expected. The identification unit 225 checks for empty string values in columns where text or other values are expected. If the dataset contains the one or more invalid column values (e.g., the NaN, the zeros, the nulls, and the empty strings), the identification unit 225 scans the dataset simultaneously. The identification unit 225 performs the identifying task for the one or more invalid column values by scanning through rows and checking for each type of invalid value (the NaN, the zero, the null, the empty string).
[0044] The system 120 further identifies patterns in the existing data, such as trends, relationships, and correlations between different variables. Using the identified patterns, the system 120 fills in the missing values with estimates that are likely to be accurate. After filling the missing values, the system assesses the quality of each repair based on the strength of the patterns used. It ranks the repairs to indicate which filled values are more reliable. For each filled value, the system provides a confidence score that reflects the likelihood that the replacement is accurate. Higher scores indicate a stronger basis for the estimate, based on the data patterns and correlations. The user can review the filled values along with their confidence scores, allowing them to make informed decisions about the reliability of the data before proceeding with analysis or reporting. The system 120 fills the data based on patterns or correlations, ranking the quality of repairs and providing confidence scores for user review.
[0045] The identification unit 225 is further configured to calculate a distance between each missing value and all other non-missing values of the one or more rows in a dataset. In an exemplary embodiment, a first row includes the value of 2, 5, and NaN in each column. The missing value occurred in the first row of the third column. The identification unit 225 calculates the distance between each missing value and the non-missing values in the one or more rows. The identification unit 225 is configured to identify the nearest non-missing value on the one or more rows based on the calculation. Based on the exemplary embodiment, the nearest non-missing value in the one or more rows is 5, which is determined by comparing to similar distances in the dataset. The identification unit 225 is further configured to replace the missing value of the one or more rows with the identified nearest non-missing value. Based on the exemplary embodiment, the identified nearest non-missing value is 5, which is placed on the missing value of the one or more rows.
[0046] Upon identification of the one or more rows containing the one or more invalid column values, the parsing unit 230 is configured to parse a script associated with the retrieved data to identify the one or more rows containing the one or more invalid column values. The script processes the retrieved data to identify the one or more rows that contain the one or more invalid column values. The script is specifically designed to examine both the structure and content of the dataset, ensuring data integrity and quality. By flagging the one or more invalid column values, the script aids in maintaining accurate and reliable data for further analysis. Upon flagging the one or more invalid column values, the parsing unit 230 specifically identifies the one or more rows in the data that contains the one or more invalid column values at a code level. By parsing the data through the script, the system 120 transmits the problematic one or more rows to the pre-processing unit 235 for further processing.
[0047] Upon parsing the data through the script, the pre-processing unit 235 configured to pre-process the data to remove the identified one or more rows containing the one or more invalid column values. The pre-processing unit 235 processes the data according to the list of one or more rows provided by the parsing unit 230. The pre-processing unit 235 scans the dataset to identify the one or more rows that contain the one or more invalid column values. Upon detection, the one or more rows containing the one or more invalid column values needs to be removed from the data frame or the dataset, ensuring that only the valid entries remain for analysis. The removal of the one or more invalid column values is performed by using functions that specifically target and eliminate the one or more rows with invalid data. Upon removing the one or more invalid column values, the pre-processing unit 235 ensures that the remaining data is clean, and well-structured. Whether the data is being prepared for analysis, modeling, or storage, which ensures that only valid data is retained, making the downstream processes more efficient and reliable. Further, the system 120 learns user preferences from past pre-processing steps and automatically adapts future transformations, and observes previous cleanings (e.g., how missing data was handled) and proposes the transformations for new datasets, ensuring consistency and reducing manual effort in pre-processing.
[0048] As per above embodiment, the system 120 automatically creates a new version of the datasets, enabling the user to keep track of all changes made throughout the data preparation process. A detailed change log records of what alterations were made and when they occurred are stored in the dataset. The log records serve as a comprehensive history of transformations. The user can easily revert to previous versions of the dataset, if necessary, which is particularly useful for correcting mistakes or analyzing data prior to specific changes. By providing visibility into how the data has been transformed, the system 120 promotes accountability and trust in the data processing pipeline. The ability to track modifications creates an audit trail that can be critical for compliance and validation purposes. By doing so, the system 120 ensures that the user can confidently manage and analyze their data while maintaining integrity and accountability. The system 120 generates reports for all pre-processing decisions, detailing why certain rows are removed, how the missing values are filled, and the potential impact of these changes on downstream processes.
[0049] Upon pre-processing the data, the generating unit 240 is configured to generate the normalized data based on the pre-processing of the data. The generating unit 240 is responsible for transforming the pre-processed data into a normalized format. The normalization is a process that adjusts or removes the data to a standard format or scale, making the data easier to analyze and compare. The normalization is a process to clean the data, which is the one or more invalid column values to filter the valid data. Upon cleaning the data, the valid data is stored and transmitted to the ML training unit 320 (as shown in FIG.3) to train the data. The normalized data enhances the performance of the ML training unit 320. Further, the AI/ML model utilizes a variety of ML techniques, such as supervised learning, unsupervised learning, and reinforcement learning.
[0050] In one embodiment, the supervised learning is a type of machine learning algorithm, which is trained on a labeled dataset. The supervised learning refers to each training example paired with an output label. The supervised learning algorithm learns to map inputs to a correct output. The supervised learning uses various algorithms (such as linear regression, decision trees, or neural networks) to learn the mapping from inputs to outputs. It minimizes the error between predicted outputs and actual labels through techniques like gradient descent. In one embodiment, the unsupervised learning is a type of machine learning algorithm, which is trained on data without any labels. The unsupervised learning algorithm tries to learn the underlying structure or distribution in the data in order to discover patterns or groupings. The unsupervised learning algorithm uses various techniques such as k-means clustering, and hierarchical clustering are used to group similar data points. In one embodiment, the reinforcement learning is a type of machine learning where an agent learns to make decisions by performing actions in an environment to maximize cumulative reward. The agent receives feedback in the form of rewards or penalties based on the actions it takes, and it learns a path that maps states of the environment to the best actions.
[0051] By doing normalization of the data, the system 120 is able to, advantageously, remove the one or more invalid column values helps in increasing the quality, quantity, and diversity of the training data. Hence, the disclosed system ensures that the raw data is retained for machine learning training, which improves processing speed, and reduces memory space requirement.
[0052] FIG. 3 is a block diagram of an architecture 300 that can be implemented in the system of FIG.2, according to one or more embodiments of the present disclosure. The architecture 300 of the system 120 includes an integrated system 305, a load balancer 310, and the processor 205. The processor 205 includes the identification unit 225, the pre-processing unit 235, a data source unit 315, and a machine learning training unit 320.
[0053] The architecture 300 of the system 120 is configured to interact with the integrated system 305 and the load balancer 310. The integrated system 305 is configured to access the data in the network 105 and is capable of interacting with the server 115, the database 250 to collect the data. In an embodiment, the integrated system 305 includes, but not limited to, the one or more data sources, from where the data can be retrieved. In an embodiment, the data can be retrieved as the file input, the source path, the input stream, the HTTP2, the HDFS and the NAS.
[0054] The load balancer 310 includes distributing the one or more data sources requesting traffic across the one or more processors 205. The distribution of the one or more data source requests that traffic helps in managing and optimizing the workload, ensuring that no single processor is overwhelmed while improving overall system performance and reliability.
[0055] The identification unit 225 is configured to identify the one or more rows containing the one or more invalid column values. In an embodiment, the one or more invalid column values is at least one of a Not a Number (NaN) value, a zero value, a null value, and an empty string. The NaN value indicates missing or undefined numerical data. The zero value indicates invalid or missing data in certain columns. The null value represents empty or absence of any value in the dataset. The empty string indicates a string field with no content. The identification unit 225 is configured to scan the entire dataset and inspect each row and column for any invalid column values.
[0056] The pre-processing unit 235 is configured to pre-process the data to remove the identified one or more rows containing the one or more invalid column values. The pre-processing unit 235 processes the data according to the list of one or more rows provided by the parsing unit 230. The pre-processing unit 235 scans the dataset to find the one or more rows that contain the one or more invalid column values and removes the one or more rows entirely from the data frame or the dataset. By removing the one or more invalid column values, the pre-processing unit 235 ensures that the remaining data is clean, and well-structured. Whether the data is being prepared for analysis, modeling, or storage, which ensures that only valid data is retained, making the downstream processes more efficient and reliable. In another embodiment, the pre-processing unit 235 is configured to restore missing values, or delete data frames that contain the missing values.
[0057] The data source unit 315 is configured to update the data retrieved post preprocessing and store the updated data. The ML training unit 320 is configured to train the updated data using one or more machine learning models and applies various machine learning algorithms to the preprocessed data to create predictive or analytical models. Based on the nature of the data and the problem at hand, different ML algorithms include, but are not limited to, linear regression, decision trees, neural networks, clustering algorithms, etc.
[0058] FIG. 4 is a signal flow diagram illustrating for normalizing the data, according to one or more embodiments of the present disclosure.
[0059] At step 405, Initially, the request is transmitted by the user for cleaning the data from the data frame that contains one or more invalid column values via the UI 215.
[0060] At step 410, upon receiving the request, the retrieving unit 220 is configured to retrieve the data from the one or more data sources. In an embodiment, the one or more data sources include at least, the file input, the source path, the input stream, the Hyper Text Transfer Protocol 2 (HTTP 2), the Distributed File System (DFS), and Network Attached Storage (NAS). Upon retrieving the data, the retrieved data is stored in the data frame for further processing.
[0061] At step 415, upon retrieving the data and stored in the data frame, the identification unit 225 is configured to identify the one or more rows containing one or more invalid column values. In an embodiment, the one or more invalid column values is at least one of a Not a Number (NaN) value, a zero value, a null value, and an empty string. The NaN value indicates missing or undefined numerical data. The zero value indicates invalid or missing data in certain columns. The null value represents empty or absence of any value in the dataset. The empty string indicates a string field with no content. The identification unit 225 is configured to scan the entire dataset and inspect each row and column for any invalid column values.
[0062] At step 420, upon parsing the data through the script, the pre-processing unit 235 is configured to pre-process the data to remove the identified one or more rows containing the one or more invalid column values. The pre-processing unit 235 processes the data according to the list of one or more rows provided by the parsing unit 230. The pre-processing unit 235 scans the dataset to find the one or more rows that contain the one or more invalid column values and removes the one or more rows entirely from the data frame or the dataset. By removing the one or more invalid column values, the pre-processing unit 235 ensures that the remaining data is clean, and well-structured. Whether the data is being prepared for analysis, modeling, or storage, which ensures that only valid data is retained, making the downstream processes more efficient and reliable.
[0063] At step 425, upon pre-processing the data, the generating unit 240 is configured to generate the normalized data based on pre-processing of the data. The generating unit 240 is responsible for transforming the pre-processed data into a normalized format. The normalization is a process that adjusts or removes the data to a standard format or scale, making the data easier to analyze and compare. The normalization is a process to clean the data, which is the one or more invalid column values to filter the valid data. Upon cleaning the data, the valid data is stored and transmitted to the ML training unit 320 (as shown in FIG.3) to train the data. The normalized data enhances the performance of the ML training model.
[0064] FIG. 5 is a flow diagram illustrating a method 500 for normalizing the data, according to one or more embodiments of the present disclosure.
[0065] At step 505, the method 500 includes the step of retrieving the data from the one or more data sources by the retrieving unit 220. In an embodiment, the one or more data sources include at least, the file input, the source path, the input stream, the Hyper Text Transfer Protocol 2 (HTTP 2), the Distributed File System (HDFS), and the Network Attached Storage (NAS). The file input refers to reading data from the files stored locally or on the server 115. The files can be in different formats, including, but not limited to, Comma Separated Values (CSV), JavaScript Object Notation (JSON), extensible Markup Language (XML), or text files. In an exemplary embodiment, the data is stored in the CSV file, and the retrieving unit 220 fetches the data for processing. The retrieving unit 220 retrieves the data from the file and loads it into the memory 210 for further processing.
[0066] At step 510, the method 500 includes the step of identifying the one or more rows containing one or more invalid column values by the identification unit 225. In an embodiment, the one or more invalid column values is at least one of the Not a Number (NaN) value, the zero value, the null value, and the empty string. The NaN value indicates missing or undefined numerical data. The zero value indicates invalid or missing data in certain columns. The null value represents empty or absence of any value in the dataset. The empty string indicates a string field with no content. The identification unit 225 is configured to scan the entire dataset and inspect each row and column for any invalid column values.
[0067] The identification unit 225 is further configured to calculate the distance between each missing value and all other non-missing values of the one or more rows in the dataset. In an exemplary embodiment, a first row includes the value of 2, 5, and NaN in each column. The missing value occurred in the first row of the third column. The identification unit 225 calculates the distance between each missing value and the non-missing values in the one or more rows. The identification unit 225 is configured to identify the nearest non-missing value on the one or more rows based on the calculation. Based on the exemplary embodiment, the nearest non-missing value in the one or more rows is 5, which is determined by comparing to similar distances in the dataset. The identification unit 225 is further configured to replace the missing value of the one or more rows with the identified nearest non-missing value. Based on the exemplary embodiment, the identified nearest non-missing value is 5, which is placed on the missing value of the one or more rows.
[0068] Upon identification of the one or more rows containing the one or more invalid column values, the parsing unit 230 is configured to parse the script associated with the retrieved data to identify the one or more rows containing the one or more invalid column values. The script is associated with the data and serves to examine the structure or content of the data. The parsing unit 230 specifically identifies the one or more rows in the data that contains the one or more invalid column values at the code level. By parsing the data through the script, the system 120 transmits the problematic one or more rows to the pre-processing unit 235 for further processing.
[0069] At step 515, the method 500 includes the step of pre-processing the data to remove the identified one or more rows containing the one or more invalid column values by the pre-processing unit 235. The pre-processing unit 235 processes the data according to the list of one or more rows provided by the parsing unit 230. The pre-processing unit 235 scans the dataset to find the one or more rows that contain the one or more invalid column values and removes the one or more rows entirely from the data frame or the dataset. By removing the one or more invalid column values, the pre-processing unit 235 ensures that the remaining data is clean, and well-structured. Whether the data is being prepared for analysis, modeling, or storage, which ensures that only valid data is retained, making the downstream processes more efficient and reliable.
[0070] At step 520, the method 500 includes the step of generating the normalized data based on pre-processing of the data by the generating unit 240. The generating unit 240 is responsible for transforming the pre-processed data into a normalized format. The normalization is a process that adjusts or removes the data to a standard format or scale, making the data easier to analyze and compare. The normalization is a process to clean the data, which is the one or more invalid column values to filter the valid data. Upon cleaning the data, the valid data is stored and transmitted to the ML training unit 320 to train the data. The normalized data enhances the performance of the ML training model.
[0071] In another aspect of the embodiment, a non-transitory computer-readable medium having stored thereon computer-readable instructions that, when executed by a processor 205. The processor 205 is configured to retrieve data from one or more data sources. The processor 205 is configured to identify one or more rows containing one or more invalid column values. The processor 205 is configured to pre-process the data to remove the identified one or more rows containing the invalid column values. The processor 205 is configured to generate the normalized data based on pre-processing of the data.
[0072] A person of ordinary skill in the art will readily ascertain that the illustrated embodiments and steps in description and drawings (FIGS.1-5) are set out to explain the exemplary embodiments shown, and it should be anticipated that ongoing technological development will change the manner in which particular functions are performed. These examples are presented herein for purposes of illustration, and not limitation. Further, the boundaries of the functional building blocks have been arbitrarily defined herein for the convenience of the description. Alternative boundaries can be defined so long as the specified functions and relationships thereof are appropriately performed. Alternatives (including equivalents, extensions, variations, deviations, etc., of those described herein) will be apparent to persons skilled in the relevant art(s) based on the teachings contained herein. Such alternatives fall within the scope and spirit of the disclosed embodiments.
[0073] The present disclosure provides technical advancement for generating the normalized data based on removal of the one or more invalid rows helps in increasing the quality, quantity, and diversity of the training data. Hence, the disclosed system and method ensures that the normalized data is retained for machine learning training.
[0074] The present invention offers multiple advantages over the prior art and the above listed are a few examples to emphasize on some of the advantageous features. The listed advantages are to be read in a non-limiting manner.
REFERENCE NUMERALS
[0075] Environment - 100
[0076] Network-105
[0077] User equipment- 110
[0078] Server - 115
[0079] System -120
[0080] Processor - 205
[0081] Memory - 210
[0082] User interface-215
[0083] Retrieving unit – 220
[0084] Identification unit– 225
[0085] Parsing unit – 230
[0086] Pre-processing unit- 235
[0087] Generating unit- 240
[0088] Storage unit– 250
[0089] Integrated system-305
[0090] Load balancer- 310
[0091] Data source unit-315
[0092] ML training unit- 320
,CLAIMS:CLAIMS
We Claim:
1. A method (500) of normalizing data, the method (500) comprising the steps of:
retrieving, by one or more processors (205), data from one or more data sources;
identifying, by the one or more processors (205), one or more rows containing one or more invalid column values;
pre-processing, by the one or more processors (205), the data to remove the identified one or more rows containing the one or more invalid column values; and
generating, by the one or more processors (205), the normalized data based on pre-processing of the data.
2. The method (500) as claimed in claim 1, wherein the one or more data sources include at least, a file input, a source path, an input stream, a Hyper Text Transfer Protocol 2 (HTTP 2), a Distributed File System (DFS), and a Network Attached Storage (NAS).
3. The method (500) as claimed in claim 1, wherein the retrieved data is stored in a data frame.
4. The method (500) as claimed in claim 1, wherein the step of identifying, by the one or more processors (205), one or more rows containing one or more invalid column values, comprising the step of:
calculating a distance between each missing value and all other non-missing values of the one or more rows in a dataset;
identifying a nearest non-missing value on the one or more rows based on the calculation; and
replacing the missing value of the one or more rows with the identified nearest non-missing value.
5. The method (500) as claimed in claim 1, wherein for identifying the one or more rows, the method (500) comprises parsing, by the one or more processors (205), a script associated with the retrieved data.
6. The method (500) as claimed in claim 1, wherein the generated normalized data is stored in a storage unit (250) and used for Machine Learning (ML) training.
7. A system (120) of normalizing data, the system (120) comprising:
a retrieving unit (220) configured to retrieve, data from one or more data sources;
an identification unit (225) configured to identify, one or more rows containing one or more invalid column values;
a pre-processing unit (235) configured to pre-process, the data to remove the identified one or more rows containing the invalid column values; and
a generating unit (240) configured to generate, the normalized data based on pre-processing of the data.
8. The system (120) as claimed in claim 7, wherein the one or more data sources include at least, a file input, a source path, an input stream, a Hyper Text Transfer Protocol 2 (HTTP 2), a Distributed File System (DFS), and a Network Attached Storage (NAS).
9. The system (120) as claimed in claim 7, wherein the retrieved data is stored in a data frame.
10. The system (120) as claimed in claim 7, wherein the identification unit (225) is further configured to:
calculate a distance between each missing value and all other non-missing values of the one or more rows in a dataset;
identify a nearest non-missing value on the one or more rows based on the calculation; and
replace the missing value of the one or more rows with the identified nearest non-missing value.
11. The system (120) as claimed in claim 7, wherein the one or more invalid column values is at least one of a Not a Number (NaN) value, a zero value, a null value, and an empty string.
12. The system (120) as claimed in claim 7, comprising a parsing unit (230) configured to parse a script associated with the retrieved data to identify the one or more rows containing the one or more invalid column values.
13. The system (120) as claimed in claim 7, wherein the generated normalized data is stored in a storage unit (240) and used for Machine Learning (ML) training.
| # | Name | Date |
|---|---|---|
| 1 | 202321067391-STATEMENT OF UNDERTAKING (FORM 3) [07-10-2023(online)].pdf | 2023-10-07 |
| 2 | 202321067391-PROVISIONAL SPECIFICATION [07-10-2023(online)].pdf | 2023-10-07 |
| 3 | 202321067391-FORM 1 [07-10-2023(online)].pdf | 2023-10-07 |
| 4 | 202321067391-FIGURE OF ABSTRACT [07-10-2023(online)].pdf | 2023-10-07 |
| 5 | 202321067391-DRAWINGS [07-10-2023(online)].pdf | 2023-10-07 |
| 6 | 202321067391-DECLARATION OF INVENTORSHIP (FORM 5) [07-10-2023(online)].pdf | 2023-10-07 |
| 7 | 202321067391-FORM-26 [27-11-2023(online)].pdf | 2023-11-27 |
| 8 | 202321067391-Proof of Right [12-02-2024(online)].pdf | 2024-02-12 |
| 9 | 202321067391-DRAWING [07-10-2024(online)].pdf | 2024-10-07 |
| 10 | 202321067391-COMPLETE SPECIFICATION [07-10-2024(online)].pdf | 2024-10-07 |
| 11 | Abstract.jpg | 2024-12-27 |
| 12 | 202321067391-Power of Attorney [24-01-2025(online)].pdf | 2025-01-24 |
| 13 | 202321067391-Form 1 (Submitted on date of filing) [24-01-2025(online)].pdf | 2025-01-24 |
| 14 | 202321067391-Covering Letter [24-01-2025(online)].pdf | 2025-01-24 |
| 15 | 202321067391-CERTIFIED COPIES TRANSMISSION TO IB [24-01-2025(online)].pdf | 2025-01-24 |
| 16 | 202321067391-FORM 3 [28-01-2025(online)].pdf | 2025-01-28 |