Abstract: ABSTRACT SYSTEM AND METHOD FOR NORMALIZING DATA The present invention relates to a system (120) and a method (500) for normalizing data is disclosed. The system (120) includes a retrieving unit (220) configured to retrieve data from one or more data sources. The system (120) includes a parsing unit (225) configured to parse the retrieved data to identify a pre-processing technique. The system (120) includes an applying unit (230) is configured to apply the identified pre-processing technique on the retrieved data to normalize the retrieved data. The system (120) includes a generating unit (235) configured to generate normalized data based on applying the pre-processing technique on the retrieved data. Ref. Fig. 2
DESC:
FORM 2
THE PATENTS ACT, 1970
(39 of 1970)
&
THE PATENTS RULES, 2003
COMPLETE SPECIFICATION
(See section 10 and rule 13)
1. TITLE OF THE INVENTION
SYSTEM AND METHOD FOR NORMALIZING DATA
2. APPLICANT(S)
NAME NATIONALITY ADDRESS
JIO PLATFORMS LIMITED INDIAN OFFICE-101, SAFFRON, NR. CENTRE POINT, PANCHWATI 5 RASTA, AMBAWADI, AHMEDABAD 380006, GUJARAT, INDIA
3.PREAMBLE TO THE DESCRIPTION
THE FOLLOWING SPECIFICATION PARTICULARLY DESCRIBES THE NATURE OF THIS INVENTION AND THE MANNER IN WHICH IT IS TO BE PERFORMED.
FIELD OF THE INVENTION
[0001] The present invention relates to the field of wireless communication networks, more particularly relates to a method and a system for normalizing data.
BACKGROUND OF THE INVENTION
[0002] With the increase in number of users, the network service provisions have been implemented for upgradations to enhance the service quality so as to keep pace with such high demand. For various purposes like improving the quality of a network, managing traffic, delegating node allocation, managing performance of routing device etc. many network elements, network functions and micro-services are integrated into a network. Therefore, a large amount of data is being generated by these services and network functions. Such data needs to be analyzed and assessed for the purpose of determining network health, security, modify management policy etc.
[0003] However, the generated data may include invalid and unwanted field values or wrongly entered data values and these problematic field values hindered the direct training of the data for Machine Learning (ML). This inconsistency and noise in the data made it challenging to perform effective ML training and analysis. There is a need to clean and normalize data so that they can be fed to the ML model for training. Presently no mechanism is available where data cleaning can be performed as required.
[0004] There is a requirement of a system and a method to filter out unwanted, extraneous information from required data and to include defined filtering techniques as required for the cleansing and normalization.
SUMMARY OF THE INVENTION
[0005] One or more embodiments of the present disclosure provide a method and a system for normalizing data.
[0006] In one aspect of the present invention, the method for normalizing the data is disclosed. The method includes the step of retrieving, by one or more processors, data from one or more data sources. The method includes the step of parsing, by the one or more processors, the retrieved data to identify a pre-processing technique. The pre-processing technique is at least one of a numerical operation filter and a metric conversion filter. The method includes the step of applying, by the one or more processors, the identified pre-processing technique on the retrieved data to normalize the retrieved data. The method includes the step of generating, by the one or more processors, the normalized data based on applying the pre-processing technique on the retrieved data.
[0007] In one embodiment, the one or more data sources include at least, a file input, a source path, an input stream, a Hyper Text Transfer Protocol 2 (HTTP 2), a Hadoop Distributed File System (HDFS), and a Network Attached Storage (NAS).
[0008] In another embodiment, the retrieved data is stored in a data frame.
[0009] In yet another embodiment, the numerical operation is at least one of equal to, greater than, and lesser than.
[0010] In yet another embodiment, the generated normalized data is stored in a storage unit and used for Machine Learning (ML) training.
[0011] In yet another embodiment, the step of parsing, by the one or more processors, the retrieved data to identify a pre-processing technique. The pre-processing technique is at least one of a numerical operation filter and a metric conversion filter, including the step of creating a custom filtering criteria through a user interface by a user. The step of parsing, by the one or more processors, the retrieved data to identify a pre-processing technique includes the step of enabling the created custom filtering criteria to selectively clean or normalize the data based on one or more parameters. The one or more parameters include a range, a frequency, or patterns.
[0012] In another aspect of the present invention, the system for normalizing the data is disclosed. The system includes a retrieving unit configured to retrieve data from one or more data sources. The system includes a parsing unit configured to parse, the retrieved data to identify a pre-processing technique. The pre-processing technique is at least one of a numerical operation filter and a metric conversion filter. The system includes an applying unit configured to apply, the identified pre-processing technique on the retrieved data to normalize the retrieved data. The system includes a generating unit configured to generate the normalized data based on applying the pre-processing technique on the retrieved data.
[0013] In another aspect of the embodiment, a non-transitory computer-readable medium having stored thereon computer-readable instructions that, when executed by a processor is disclosed. The processor is configured to retrieve data from one or more data sources. The processor is configured to parse the retrieved data to identify a pre-processing technique. The pre-processing technique is at least one of a numerical operation filter and a metric conversion filter. The processor is configured to apply the identified pre-processing technique on the retrieved data to normalize the retrieved data. The processor is configured to generate the normalized data based on applying the pre-processing technique on the retrieved data.
[0014] Other features and aspects of this invention will be apparent from the following description and the accompanying drawings. The features and advantages described in this summary and in the following detailed description are not all-inclusive, and particularly, many additional features and advantages will be apparent to one of ordinary skill in the relevant art, in view of the drawings, specification, and claims hereof. Moreover, it should be noted that the language used in the specification has been principally selected for readability and instructional purposes and may not have been selected to delineate or circumscribe the inventive subject matter, resort to the claims being necessary to determine such inventive subject matter.
BRIEF DESCRIPTION OF THE DRAWINGS
[0015] The accompanying drawings, which are incorporated herein, and constitute a part of this disclosure, illustrate exemplary embodiments of the disclosed methods and systems in which like reference numerals refer to the same parts throughout the different drawings. Components in the drawings are not necessarily to scale, emphasis instead being placed upon clearly illustrating the principles of the present disclosure. Some drawings may indicate the components using block diagrams and may not represent the internal circuitry of each component. It will be appreciated by those skilled in the art that disclosure of such drawings includes disclosure of electrical components, electronic components or circuitry commonly used to implement such components.
[0016] FIG. 1 is an exemplary block diagram of an environment for normalizing data, according to one or more embodiments of the present disclosure;
[0017] FIG. 2 is an exemplary block diagram of a system for normalizing the data, according to the one or more embodiments of the present disclosure;
[0018] FIG. 3 is a block diagram of an architecture that can be implemented in the system of FIG.2, according to the one or more embodiments of the present disclosure;
[0019] FIG. 4 is a signal flow diagram illustrating for normalizing the data, according to the one or more embodiments of the present disclosure; and
[0020] FIG. 5 is a flow diagram illustrating the method for normalizing the data, according to the one or more embodiments of the present disclosure.
[0021] The foregoing shall be more apparent from the following detailed description of the invention.
DETAILED DESCRIPTION OF THE INVENTION
[0022] Some embodiments of the present disclosure, illustrating all its features, will now be discussed in detail. It must also be noted that as used herein and in the appended claims, the singular forms "a", "an" and "the" include plural references unless the context clearly dictates otherwise.
[0023] Various modifications to the embodiment will be readily apparent to those skilled in the art and the generic principles herein may be applied to other embodiments. However, one of ordinary skill in the art will readily recognize that the present disclosure including the definitions listed here below are not intended to be limited to the embodiments illustrated but is to be accorded the widest scope consistent with the principles and features described herein.
[0024] A person of ordinary skill in the art will readily ascertain that the illustrated steps detailed in the figures and here below are set out to explain the exemplary embodiments shown, and it should be anticipated that ongoing technological development will change the manner in which particular functions are performed. These examples are presented herein for purposes of illustration, and not limitation. Further, the boundaries of the functional building blocks have been arbitrarily defined herein for the convenience of the description. Alternative boundaries can be defined so long as the specified functions and relationships thereof are appropriately performed. Alternatives (including equivalents, extensions, variations, deviations, etc., of those described herein) will be apparent to persons skilled in the relevant art(s) based on the teachings contained herein. Such alternatives fall within the scope and spirit of the disclosed embodiments.
[0025] Referring to FIG. 1, FIG. 1 illustrates an exemplary block diagram of an environment 100 for normalizing data, according to one or more embodiments of the present invention. The environment 100 includes the network 105, a User Equipment (UE) 110, a server 115, and a system 120. The UE 110 aids a user to interact with the system 120 for normalizing the data. In an embodiment, the user is at least one of, a network operator, and a service provider. Normalizing data refers to cleaning the data by parsing, pre-processing, and removing any inaccuracies or inconsistencies from a data set. In an exemplary embodiment, an original dataset includes employee ID, name, department, salary, and hire date. In a first employee ID, the name is john doe from sales department with salary 55000 and hire date is 2022-01-25. In a second employee ID, the name is jane smith from marketing department with salary twenty thousand and hire date is 2022-05-30. The inaccuracies or inconsistencies in the dataset is the salary for employees is written in alphabetical order instead of a numerical value. The normalization of data involves if there were any missing salary or hire date entries, they need to be addressed and either fill the missing values with appropriate estimates or remove those entries.
[0026] For the purpose of description and explanation, the description will be explained with respect to the UE 110, or to be more specific will be explained with respect to a first UE 110a, a second UE 110b, and a third UE 110c, and should nowhere be construed as limiting the scope of the present disclosure. Each of the UE 110 from the first UE 110a, the second UE 110b, and the third UE 110c is configured to connect to the server 115 via the network 105. In an embodiment, each of the first UE 110a, the second UE 110b, and the third UE 110c is one of, but not limited to, any electrical, electronic, electro-mechanical or an equipment and a combination of one or more of the above devices such as smartphones, virtual reality (VR) devices, augmented reality (AR) devices, laptop, a general-purpose computer, desktop, personal digital assistant, tablet computer, mainframe computer, or any other computing device.
[0027] The network 105 includes, by way of example but not limitation, one or more of a wireless network, a wired network, an internet, an intranet, a public network, a private network, a packet-switched network, a circuit-switched network, an ad hoc network, an infrastructure network, a Public-Switched Telephone Network (PSTN), a cable network, a cellular network, a satellite network, a fiber optic network, or some combination thereof. The network 105 may include, but is not limited to, a Third Generation (3G), a Fourth Generation (4G), a Fifth Generation (5G), a Sixth Generation (6G), a New Radio (NR), a Narrow Band Internet of Things (NB-IoT), an Open Radio Access Network (O-RAN), and the like.
[0028] The server 115 may include by way of example but not limitation, one or more of a standalone server, a server blade, a server rack, a bank of servers, a server farm, hardware supporting a part of a cloud service or system, a home server, hardware running a virtualized server, one or more processors executing code to function as a server, one or more machines performing server-side functionality as described herein, at least a portion of any of the above, some combination thereof. In an embodiment, the entity may include, but is not limited to, a vendor, a network operator, a company, an organization, a university, a lab facility, a business enterprise, a defense facility, or any other facility that provides content.
[0029] The environment 100 further includes the system 120 communicably coupled to the server 115 and each of the first UE 110a, the second UE 110b, and the third UE 110c via the network 105. The system 120 is configured for normalizing the data. The system 120 is adapted to be embedded within the server 115 or is embedded as the individual entity, as per multiple embodiments of the present invention.
[0030] Operational and construction features of the system 120 will be explained in detail with respect to the following figures.
[0031] FIG. 2 is an exemplary block diagram of a system 120 for normalizing the data, according to one or more embodiments of the present disclosure.
[0032] The system 120 includes a processor 205, a memory 210, a user interface 215, and a storage unit 240. For the purpose of description and explanation, the description will be explained with respect to one or more processors 205, or to be more specific will be explained with respect to the processor 205 and should nowhere be construed as limiting the scope of the present disclosure. The one or more processor 205, hereinafter referred to as the processor 205 may be implemented as one or more microprocessors, microcomputers, microcontrollers, digital signal processors, central processing units, state machines, logic circuitries, single board computers, and/or any devices that manipulate signals based on operational instructions.
[0033] As per the illustrated embodiment, the processor 205 is configured to fetch and execute computer-readable instructions stored in the memory 210. The memory 210 may be configured to store one or more computer-readable instructions or routines in a non-transitory computer-readable storage medium, which may be fetched and executed to create or share data packets over a network service. The memory 210 may include any non-transitory storage device including, for example, volatile memory such as RAM, or non-volatile memory such as EPROM, flash memory, and the like.
[0034] The User Interface (UI) 215 includes a variety of interfaces, for example, interfaces for a Graphical User Interface (GUI), a web user interface, a Command Line Interface (CLI), and the like. The user interface 215 facilitates communication of the system 120. In one embodiment, the user interface 215 provides a communication pathway for one or more components of the system 120. Examples of the one or more components include, but are not limited to, the UE 110, and the storage unit 240. The term “storage unit” and “database” are used interchangeably hereinafter, without limiting the scope of the disclosure.
[0035] The storage unit 240 is one of, but not limited to, a centralized database, a cloud-based database, a commercial database, an open-source database, a distributed database, an end-user database, a graphical database, a No-Structured Query Language (NoSQL) database, an object-oriented database, a personal database, an in-memory database, a document-based database, a time series database, a wide column database, a key value database, a search database, a cache databases, and so forth. The foregoing examples of storage unit 240 types are non-limiting and may not be mutually exclusive e.g., a database can be both commercial and cloud-based, or both relational and open-source, etc.
[0036] Further, the processor 205, in an embodiment, may be implemented as a combination of hardware and programming (for example, programmable instructions) to implement one or more functionalities of the processor 205. In the examples described herein, such combinations of hardware and programming may be implemented in several different ways. For example, the programming for the processor 205 may be processor-executable instructions stored on a non-transitory machine-readable storage medium and the hardware for processor 205 may comprise a processing resource (for example, one or more processors), to execute such instructions. In the present examples, the memory 210 may store instructions that, when executed by the processing resource, implement the processor 205. In such examples, the system 120 may comprise the memory 210 storing the instructions and the processing resource to execute the instructions, or the memory 210 may be separate but accessible to the system 120 and the processing resource. In other examples, the processor 205 may be implemented by electronic circuitry.
[0037] In order for the system 120 to normalize the data, the processor 205 includes a retrieving unit 220, a parsing unit 225, an applying unit 230, and a generating unit 235 communicably coupled to each other. In an embodiment, operations and functionalities of the retrieving unit 220, the parsing unit 225, the applying unit 230, and the generating unit 235 can be used in combination or interchangeably.
[0038] Initially, a request is transmitted by the user via the UI 215 for cleaning the data from a data frame by using a pre-processing technique. The data frame is a two-dimensional, tabular data structure used to store and manipulate the data in rows and columns. The data frame allows for easy data analysis and processing, especially in machine learning tasks. Each column in the data frame represents a variable, and each row represents a data point. The data frame contains data to be used for machine learning training.
[0039] Upon receiving the request, the retrieving unit 220 is configured to retrieve the data from one or more data sources. In an embodiment, the data includes, but not limited to, real time data, structured data, and unstructured data. In an embodiment, the one or more data sources include at least, a file input, a source path, an input stream, a Hyper Text Transfer Protocol 2 (HTTP 2), a Hadoop Distributed File System (HDFS), and a Network Attached Storage (NAS). The file input refers to reading data from the files stored locally or on the server 115. The files can be in different formats, including, but not limited to, Comma Separated Values (CSV), JavaScript Object Notation (JSON), extensible Markup Language (XML), or text files. In an exemplary embodiment, the data is stored in the CSV file, and the retrieving unit 220 fetches the data for processing. The retrieving unit 220 retrieves the data from the file and loads it into the memory 210 for further processing.
[0040] The source path typically refers to the directory or network location where the data files are stored. The retrieving unit 220 fetches the data by following the provided file path. In an exemplary embodiment for the source path, the system 120 stores images in a specific directory. The retrieving unit 220 navigates to a designated source path and retrieves all files that match the required criteria (e.g., .jpg images). The input stream refers to continuous data that is read in real-time from a stream of data (e.g., data being transmitted over the network 105 or generated by sensors). In an exemplary embodiment, the data is being received from an Application Programming Interface (API) or a live data stream, and the retrieving unit 220 fetches it in real-time. The HTTP 2 is a protocol used for communication over the web, which improves upon HTTP/1.1 by offering multiplexing and better performance for handling multiple requests. In an exemplary embodiment, the retrieving unit 220 retrieves the data from the web server using the HTTP 2. The retrieving unit 220 uses HTTP 2 to fetch the data from remote web servers or APIs.
[0041] The HDFS is a distributed file system used to store large datasets across multiple machines. The HDFS is commonly used in big data environments to store and retrieve large amounts of data. The retrieving unit 220 connects to the HDFS to retrieve the files for processing. The NAS is a dedicated file storage system that provides Local Area Network (LAN) access to the data. The NAS allows multiple users or systems to access the data from a centralized storage device. The retrieving unit 220 fetches the data from a NAS device over the network 105. In an exemplary embodiment, if the data is stored on the NAS, the retrieving unit 220 fetches the data via network protocols. Upon retrieving the data from the one or more data sources, the retrieved data is stored in at least a data frame for further processing.
[0042] Upon retrieving the data and stored in the data frame, the parsing unit 225 is configured to parse the retrieved data to identify a pre-processing technique. In an embodiment, the pre-processing technique is at least one of a numerical operation filter and a metric conversion filter. The numerical operation filter involves one or more operations. The one or more operations, include, but are not limited to, data normalization, scaling, or aggregation, which are applied to the numerical data before further processing. In an embodiment, the numerical operation is at least one of equal to, greater than, and lesser than. In this embodiment, the numerical operation refers to fundamental comparative functions applied to the data during pre-processing. Equal to compares the data elements to identify or filter values that exactly match a criterion. Geater than filters the data to identify values that exceed a threshold value. Lesser than filters the data to identify values that fall below the threshold value. In an embodiment, the criterion and the threshold value are defined by the user.
[0043] The metric conversion filter involves converting data units (e.g., from meters to kilometers or seconds to minutes or Celsius to Fahrenheit) to ensure consistency or compatibility with a required system format. The metric conversion filter is essential for data normalization, allowing different data sources to be compared and analyzed accurately, which ensures all measurements are in a uniform unit, preventing errors in analysis and reporting. In an exemplary embodiment, if the data shows a distance of 1500 meters, the metric conversion filter would yield 1.5 kilometers, which ensures to identify the field values easily in the dataset. In another exemplary embodiment, the normal body temperature for humans is 37°C (98.6°F). If the user needs the data for the body temperature of the human, which sets in Celsius. While running the data in the dataset, the data indicates in Fahrenheit. By using the metric conversion filter, the data is changed to 37°C, which aids the user to receive the valid data in a uniform manner.
[0044] In one embodiment, the parsing unit 225 is further configured to create a custom filtering criteria through the user interface 215 by the user. The custom filtering criteria allow the users to define specific conditions to include or exclude the data based on their needs, which is particularly useful in data analysis, reporting, and database management. The user interface 215 allows the user to easily set up and adjust the filtering criteria. The system 120 enables the created custom filtering criteria to selectively clean or normalize the data based on one or more parameters. The one or more parameters include a range, a frequency, or patterns. The range refers to filtering data points within specified numerical limits. The frequency refers to identifying and managing outliers or infrequent values. The patterns refer to detecting and handling specific trends or recurring sequences in the data. The user can target specific subsets of data for cleaning or normalization, ensuring that only relevant data points are modified, which reduces noise and enhances the quality of the dataset. By applying the custom filtering, the user can remove inconsistencies and correct errors in the data, leading to a more accurate representation of the underlying information. The user can see the immediate impact of the filtering criteria, allowing for quick adjustments and a better understanding of how changes affect the dataset.
[0045] Upon parsing the retrieved data to identify the pre-processing technique, the applying unit 230 is configured to apply the identified pre-processing technique on the retrieved data to normalize the retrieved data. The applying unit 230 is responsible for executing the identified pre-processing technique. The applying unit 230 applies the pre-processing technique to normalize the retrieved data. When the data comes from different sources or has varying ranges (e.g., temperatures from different locations), normalization is necessary to bring all the data into a uniform range, which ensures that no single variable dominates the analysis for larger range. In an exemplary embodiment, the dataset contains temperature readings from different locations, with values ranging from 10°C to 100°C. Without normalization, the varying magnitude of the values affects any statistical or machine learning algorithms. Owing to the system 120 parses the data and identifies the normalization by using the identified pre-processing technique. The applying unit 230 further takes the temperature readings and scales them to a common range (e.g., 0 to 1), ensuring uniformity and easier analysis.
[0046] The user interface 215 displays data quality metrics via the dashboard and the impact of cleaning techniques. The user interface 215 allows the user to observe the effects of data adjustments in real-time, thereby facilitating informed decision-making. In one embodiment, the system 120 provides the user interface 215 for defining complex filtering conditions using logical and comparison operators, along with regular expressions. The logical operators include, but not limited to, AND, OR, NOT to combine multiple filtering criteria. In an exemplary embodiment, the AND operator filters the data to show only entries that meet two or more criteria simultaneously by the user. The OR operator can create the filters that include entries meeting at least one of multiple criteria by the user. The NOT operator allows the users to exclude specific data points from the output. In another embodiment, the system 120 supports the comparison operators. The comparison operators include, but not limited to, less than, greater than, equal to, less than or equal to, and greater than or equal to. The comparison operators enable precise filtering of numerical and categorical data, allowing the users to define conditions that suit the data analytical needs.
[0047] The system 120 supports advanced transformations including, but not limited to, log transformations, power transformations and outlier capping, applicable to specific columns or based on defined conditions. The log transformations are used to stabilize variance and make data more normally distributed, which is particularly effective for skewed datasets, where the log transformation can help meet the assumptions of many statistical analyses. The power transformations can adjust the distribution of data to achieve normality. The user can apply different power levels to the data, which is helpful for reducing skewness and improving linear relationships. The outlier capping involves limiting the effect of extreme values on the dataset. The user can define thresholds to cap outliers, ensuring that the outliers do not disproportionately influence analyses of the data. The transformations can be applied to specific columns of the dataset, which allows the user to target only the relevant data points that require transformation, enhancing the flexibility in the data preprocessing steps.
[0048] Upon applying the identified pre-processing technique on the retrieved data, the generating unit 235 is configured to generate the normalized data based on applying the pre-processing technique on the retrieved data. The generating unit 235 is responsible for transforming the pre-processed data into a normalized format. The normalization is a process that adjusts or removes the data to a standard format or scale, making the data easier to analyze and compare. The normalization is a process to normalize the data form the one or more data sources by changing the values of numeric columns in the dataset to use a common scale, without distorting differences in the ranges of values or losing information. The normalization process ensures that the normalized data is retained for ML training. Upon normalizing the data, the normalized data is stored and transmitted to the ML training unit 320 (as shown in FIG.3) to train the data. In one embodiment, the raw data is also stored in a data source unit 315 (as shown in FIG.3). The normalized data enhances the performance of the ML training unit 320.
[0049] By doing the normalization of the data, the system 120 is able to, advantageously, remove irrelevant elements and retain the raw data. Hence, the disclosed system ensures that the raw data is retained for machine learning training, which improves processing speed, and reduces memory space requirement.
[0050] FIG. 3 is a block diagram of an architecture 300 that can be implemented in the system of FIG.2, according to one or more embodiments of the present disclosure. The architecture 300 of the system 120 includes an integrated system 305, a load balancer 310, and the processor 205. The processor 205 includes the parsing unit 225, the applying unit 230, a data source unit 315, and a machine learning training unit 320.
[0051] The architecture 300 of the system 120 is configured to interact with the integrated system 305 and the load balancer 310. The integrated system 305 is configured to access the data in the network 105 and is capable of interacting with the server 115, the storage unit 240 to collect the data. In an embodiment, the integrated system 305 includes, but not limited to, the one or more data sources, from where the data can be retrieved. In an embodiment, the data can be retrieved as the file input, the source path, the input stream, the HTTP2, the HDFS and the NAS.
[0052] The load balancer 310 includes distributing the one or more data sources request traffic across the one or more processors 205. The distribution of the one or more data source request traffic helps in managing and optimizing the workload, ensuring that no single processor is overwhelmed while improving overall system performance and reliability.
[0053] Upon distributing the data, the parsing unit 225 is configured to parse the retrieved data to identify the pre-processing technique. In an embodiment, the pre-processing technique is at least one of the numerical operation filter and the metric conversion filter. The numerical operation filter involves one or more operations. The one or more operations, include, but are not limited to, data normalization, scaling, or aggregation, which are applied to the numerical data before further processing. In an embodiment, the numerical operation is at least one of equal to, greater than, and lesser than. In this embodiment, the numerical operation refers to fundamental comparative functions applied to the data during pre-processing. Equal to compare the data elements to identify or filter values that exactly match a criterion. Geater than filters the data to identify values that exceed a threshold value. Lesser than filters the data to identify values that fall below the threshold value. In an embodiment, the criterion and the threshold value are defined by the user. The metric conversion filter involves converting data units (e.g., from meters to kilometers or seconds to minutes or Celsius to Fahrenheit) to ensure consistency or compatibility with a required system format.
[0054] Upon parsing the data to identify the pre-processing technique, the applying unit 230 is configured to apply the identified pre-processing technique on the retrieved data to normalize the retrieved data. The applying unit 230 is responsible for executing the identified pre-processing technique. The applying unit 230 applies the pre-processing technique to normalize the retrieved data. When the data comes from different sources or has varying ranges (e.g., temperatures from different locations), normalization is necessary to bring all the data into a uniform range, which ensures that no single variable dominates the analysis for larger range. Owing to the system 120 parses the data and identifies the normalization by using the identified pre-processing technique. The applying unit 230 further takes the temperature readings and scales them to a common range (e.g., 0 to 1), ensuring uniformity and easier analysis.
[0055] The data source unit 315 is configured to update the data retrieved post preprocessing and store the updated data. The ML training unit 320 is configured to train the updated data using one or more machine learning models and applies various machine learning algorithms to the preprocessed data to create predictive or analytical models. Based on the nature of the data, different ML algorithms include, but not limited to, linear regression, decision trees, neural networks, clustering algorithms, etc.
[0056] FIG. 4 is a signal flow diagram illustrating for normalizing the data, according to one or more embodiments of the present disclosure.
[0057] At step 405, Initially, the request is transmitted by the user for cleaning the data from the data frame that contains one or more invalid column values via the UI 215.
[0058] At step 410, upon receiving the request, the retrieving unit 220 is configured to retrieve the data from the one or more data sources. In an embodiment, the one or more data sources include at least, file input, source path, input stream, Hyper Text Transfer Protocol 2 (HTTP 2), Hadoop Distributed File System (HDFS), and Network Attached Storage (NAS). Upon retrieving the data, the retrieved data is stored in the data frame for further processing.
[0059] At step 415, upon retrieving the data and stored in the data frame, the parsing unit 225 is configured to parse the retrieved data to identify the pre-processing technique. In an embodiment, the pre-processing technique is at least one of the numerical operation filter and the metric conversion filter. The numerical operation filter involves one or more operations. The one or more operations, include, but are not limited to, data normalization, scaling, or aggregation, which are applied to the numerical data before further processing. In an embodiment, the numerical operation is at least one of equal to, greater than, and lesser than. In this embodiment, the numerical operation refers to fundamental comparative functions applied to the data during pre-processing. Equal to compare the data elements to identify or filter values that exactly match a criterion. Geater than filters the data to identify values that exceed a threshold value. Lesser than filters the data to identify values that fall below the threshold value. In an embodiment, the criterion and the threshold value are defined by the user. The metric conversion filter involves converting data units (e.g., from meters to kilometers or seconds to minutes or Celsius to Fahrenheit) to ensure consistency or compatibility with a required system format.
[0060] At step 420, upon parsing the data to identify the pre-processing technique, the applying unit 230 is configured to apply the identified pre-processing technique on the retrieved data to normalize the retrieved data. The applying unit 230 is responsible for executing the identified pre-processing technique. The applying unit 230 applies the pre-processing technique to normalize the retrieved data. When the data comes from different sources or has varying ranges (e.g., temperatures from different locations), normalization is necessary to bring all the data into a uniform range, which ensures that no single variable dominates the analysis for larger range. Owing to the system 120 parses the data and identifies the normalization by using the identified pre-processing technique. The applying unit 230 further takes the temperature readings and scales them to a common range (e.g., 0 to 1), ensuring uniformity and easier analysis.
[0061] At step 425, upon applying the identified pre-processing technique on the retrieved data, the generating unit 235 is configured to generate the normalized data based on applying the pre-processing technique on the retrieved data. The generating unit 235 is responsible for transforming the pre-processed data into the normalized format. The normalization is a process that adjusts or removes the data to a standard format or scale, making the data easier to analyze and compare. The normalization is a process to normalize the data form the one or more data sources by changing the values of numeric columns in the dataset to use a common scale, without distorting differences in the ranges of values or losing information. The normalization process ensures that the normalized data is retained for ML training. Upon normalizing the data, the normalized data is stored and transmitted to the ML training unit 320 to train the data. In one embodiment, the raw data is also stored in the data source unit 315. The normalized data enhances the performance of the ML training unit 320.
[0062] FIG. 5 is a flow diagram illustrating a method 500 for normalizing the data, according to one or more embodiments of the present disclosure.
[0063] At step 505, the method 500 includes the step of retrieving the data from the one or more data sources by the retrieving unit 220. In an embodiment, the one or more data sources include at least, file input, source path, input stream, Hyper Text Transfer Protocol 2 (HTTP 2), Hadoop Distributed File System (HDFS), and Network Attached Storage (NAS). Upon retrieving the data, the retrieved data is stored in the data frame for further processing.
[0064] At step 510, the method 500 includes the step of parsing the retrieved data to identify a pre-processing technique by the parsing unit 225. In an embodiment, the pre-processing technique is at least one of a numerical operation filter and a metric conversion filter. The numerical operation filter involves one or more operations. The one or more operations, include, but are not limited to, data normalization, scaling, or aggregation, which are applied to the numerical data before further processing. In an embodiment, the numerical operation is at least one of equal to, greater than, and lesser than. In this embodiment, the numerical operation refers to fundamental comparative functions applied to the data during pre-processing. Equal to compare the data elements to identify or filter values that exactly match a criterion. Geater than filters the data to identify values that exceed a threshold value. Lesser than filters the data to identify values that fall below the threshold value. In an embodiment, the criterion and the threshold value are defined by the user. The metric conversion filter involves converting data units (e.g., from meters to kilometers or seconds to minutes or Celsius to Fahrenheit) to ensure consistency or compatibility with a required system format.
[0065] In one embodiment, the parsing unit 225 is further configured to create the custom filtering criteria through the user interface 215 by the user. The user interface 215 allows the user to easily set up and adjust the filtering criteria. The system 120 enables the created custom filtering criteria to selectively clean or normalize the data based on one or more parameters. The one or more parameters include a range, a frequency, or patterns. The range refers to filtering data points within specified numerical limits. The frequency refers to identifying and managing outliers or infrequent values. The patterns refer to detecting and handling specific trends or recurring sequences in the data. The user can target specific subsets of data for cleaning or normalization, ensuring that only relevant data points are modified, which reduces noise and enhances the quality of the dataset. By applying the custom filtering, the user can remove inconsistencies and correct errors in the data, leading to a more accurate representation of the underlying information. The user can see the immediate impact of the filtering criteria, allowing for quick adjustments and a better understanding of how changes affect the dataset.
[0066] At step 515, the method 500 includes the step of applying the identified pre-processing technique on the retrieved data to normalize the retrieved data by the applying unit 230. The applying unit 230 is responsible for executing the identified pre-processing technique. The applying unit 230 applies the pre-processing technique to normalize the retrieved data. When the data comes from different sources or has varying ranges (e.g., temperatures from different locations), normalization is necessary to bring all the data into a uniform range, which ensures that no single variable dominates the analysis for larger range. In an exemplary embodiment, the dataset contains temperature readings from different locations, with values ranging from 10°C to 100°C. Without normalization, the varying magnitude of the values affects any statistical or machine learning algorithms. Owing to the system 120 parses the data and identifies the normalization by using the identified pre-processing technique. The applying unit 230 further takes the temperature readings and scales them to a common range (e.g., 0 to 1), ensuring uniformity and easier analysis.
[0067] At step 520, the method 500 includes the step of generating the normalized data based on applying the pre-processing technique on the retrieved data by the generating unit 235. The generating unit 235 is responsible for transforming the pre-processed data into the normalized format. The normalization process adjusts or removes the data to a standard format or scale, making the data easier to analyze and compare. The normalization is a process to normalize the data form the one or more data sources by changing the values of numeric columns in the dataset to use a common scale, without distorting differences in the ranges of values or losing information. The normalization process ensures that the normalized data is retained for ML training. Upon normalizing the data, the normalized data is stored and transmitted to the ML training unit 320 to train the data. In one embodiment, the raw data is also stored in the data source unit 315. The normalized data enhances the performance of the ML training unit 320.
[0068] In another aspect of the embodiment, a non-transitory computer-readable medium stored thereon computer-readable instructions that, when executed by a processor 205. The processor 205 is configured to retrieve data from one or more data sources. The processor 205 is configured to parse the retrieved data to identify a pre-processing technique. The pre-processing technique is at least one of a numerical operation filter and a metric conversion filter. The processor 205 is configured to apply the identified pre-processing technique on the retrieved data to normalize the retrieved data. The processor 205 is configured to generate normalized data based on applying the pre-processing technique on the retrieved data.
[0069] A person of ordinary skill in the art will readily ascertain that the illustrated embodiments and steps in description and drawings (FIGS.1-5) are set out to explain the exemplary embodiments shown, and it should be anticipated that ongoing technological development will change the manner in which particular functions are performed. These examples are presented herein for purposes of illustration, and not limitation. Further, the boundaries of the functional building blocks have been arbitrarily defined herein for the convenience of the description. Alternative boundaries can be defined so long as the specified functions and relationships thereof are appropriately performed. Alternatives (including equivalents, extensions, variations, deviations, etc., of those described herein) will be apparent to persons skilled in the relevant art(s) based on the teachings contained herein. Such alternatives fall within the scope and spirit of the disclosed embodiments.
[0070] The present disclosure provides technical advancement for implementing the pre-processing techniques for data cleaning and normalization which is applied to ML models for training data that aid in desired analysis. The trained data is used for efficient information retrieval and knowledge extraction as well as enhancement of the effectiveness of the system. The present disclosure identifies the preprocessing techniques to enhance the data by applying the required operation filter, which helps ML models to understand the structure and learn patterns at a more granular level. Further, the present disclosure enhances the preprocessing results in more accurate and contextually appropriate responses from the ML model.
[0071] The present invention offers multiple advantages over the prior art and the above listed are a few examples to emphasize on some of the advantageous features. The listed advantages are to be read in a non-limiting manner.
REFERENCE NUMERALS
[0072] Environment - 100
[0073] Network-105
[0074] User equipment- 110
[0075] Server - 115
[0076] System -120
[0077] Processor - 205
[0078] Memory - 210
[0079] User interface-215
[0080] Retrieving unit – 220
[0081] Parsing unit – 225
[0082] Applying unit- 230
[0083] Generating unit- 235
[0084] Storage unit– 240
[0085] Integrated system-305
[0086] Load balancer- 310
[0087] Data source unit-315
[0088] ML training unit- 320
,CLAIMS:CLAIMS
We Claim:
1. A method (500) of normalizing data, the method (500) comprising the steps of:
retrieving, by one or more processors (205), data from one or more data sources;
parsing, by the one or more processors (205), the retrieved data to identify a pre-processing technique;
applying, by the one or more processors (205), the identified pre-processing technique on the retrieved data to normalize the retrieved data; and
generating, by the one or more processors (205), normalized data based on applying the pre-processing technique on the retrieved data.
2. The method (500) as claimed in claim 1, wherein the one or more data sources include at least, a file input, a source path, an input stream, a Hyper Text Transfer Protocol 2 (HTTP 2), a Hadoop Distributed File System (HDFS), and a Network Attached Storage (NAS).
3. The method (500) as claimed in claim 1, wherein the retrieved data is stored in at least one data frame.
4. The method (500) as claimed in claim 1, wherein the pre-processing technique is at least one of a numerical operation filter and a metric conversion filter, wherein the numerical operation is at least one of equal to, greater than, and lesser than.
5. The method (500) as claimed in claim 1, wherein the generated normalized data is stored in a storage unit (240) and used for Machine Learning (ML) training.
6. The method (500) as claimed in claim 1, wherein the step of parsing, by the one or more processors (205), the retrieved data to identify a pre-processing technique, wherein the pre-processing technique is at least one of a numerical operation filter and a metric conversion filter, comprising the step of:
creating a custom filtering criteria through a user interface (215) by a user; and
enabling the created custom filtering criteria to selectively clean or normalize the data based on one or more parameters, wherein the one or more parameters include a range, a frequency, or patterns.
7. A system (120) for normalizing data, the system (120) comprising:
a retrieving unit (220) configured to retrieve, data from one or more data sources;
a parsing unit (225) configured to parse, the retrieved data to identify a pre-processing technique;
an applying unit (230) configured to apply, the identified pre-processing technique on the retrieved data to normalize the retrieved data; and
a generating unit (235) configured to generate, normalized data based on applying the pre-processing technique on the retrieved data.
8. The system (120) as claimed in claim 7, wherein the one or more data sources include at least, a file input, a source path, an input stream, a Hyper Text Transfer Protocol 2 (HTTP 2), a Hadoop Distributed File System (HDFS), and a Network Attached Storage (NAS).
9. The system (120) as claimed in claim 7, wherein the retrieved data is stored in at least one data frame.
10. The system (120) as claimed in claim 7, wherein the pre-processing technique is at least one of a numerical operation filter and a metric conversion filter, wherein the numerical operation is at least one of equal to, greater than, and less than.
11. The system (120) as claimed in claim 7, wherein the generated normalized data is stored in a storage unit (240) and used for Machine Learning (ML) training.
12. The system (120) as claimed in claim 7, wherein the parsing unit (225) is further configured to:
create a custom filtering criteria through a user interface (215) by a user; and
enable the created custom filtering criteria to selectively clean or normalize the data based on one or more parameters, wherein the one or more parameters include a range, a frequency, or patterns.
| # | Name | Date |
|---|---|---|
| 1 | 202321067384-STATEMENT OF UNDERTAKING (FORM 3) [07-10-2023(online)].pdf | 2023-10-07 |
| 2 | 202321067384-PROVISIONAL SPECIFICATION [07-10-2023(online)].pdf | 2023-10-07 |
| 3 | 202321067384-FORM 1 [07-10-2023(online)].pdf | 2023-10-07 |
| 4 | 202321067384-FIGURE OF ABSTRACT [07-10-2023(online)].pdf | 2023-10-07 |
| 5 | 202321067384-DRAWINGS [07-10-2023(online)].pdf | 2023-10-07 |
| 6 | 202321067384-DECLARATION OF INVENTORSHIP (FORM 5) [07-10-2023(online)].pdf | 2023-10-07 |
| 7 | 202321067384-FORM-26 [27-11-2023(online)].pdf | 2023-11-27 |
| 8 | 202321067384-Proof of Right [12-02-2024(online)].pdf | 2024-02-12 |
| 9 | 202321067384-DRAWING [07-10-2024(online)].pdf | 2024-10-07 |
| 10 | 202321067384-COMPLETE SPECIFICATION [07-10-2024(online)].pdf | 2024-10-07 |
| 11 | Abstract.jpg | 2024-12-20 |
| 12 | 202321067384-Power of Attorney [24-01-2025(online)].pdf | 2025-01-24 |
| 13 | 202321067384-Form 1 (Submitted on date of filing) [24-01-2025(online)].pdf | 2025-01-24 |
| 14 | 202321067384-Covering Letter [24-01-2025(online)].pdf | 2025-01-24 |
| 15 | 202321067384-CERTIFIED COPIES TRANSMISSION TO IB [24-01-2025(online)].pdf | 2025-01-24 |
| 16 | 202321067384-FORM 3 [28-01-2025(online)].pdf | 2025-01-28 |