Abstract: ABSTRACT SYSTEM AND METHOD FOR PRE-PROCESSING DATA The present invention relates to a system (120) and a method (500) for pre-processing the data is disclosed. The system (120) includes a retrieving unit (220) configured to retrieve data from one or more data sources. The system (120) includes a selection unit (225) configured to select one or more filters on the retrieved data based on at least a type of Key Performance Indicator (KPI) parameters. The system (120) includes an applying unit (230) configured to apply, the one or more filters on the retrieved data to generate the filtered data. The system (120) includes a pre-processing unit (235) configured to pre-process the filtered data based on applying the one or more filters on the retrieved data. Ref. Fig. 2
DESC:
FORM 2
THE PATENTS ACT, 1970
(39 of 1970)
&
THE PATENTS RULES, 2003
COMPLETE SPECIFICATION
(See section 10 and rule 13)
1. TITLE OF THE INVENTION
SYSTEM AND METHOD FOR PRE-PROCESSING DATA
2. APPLICANT(S)
NAME NATIONALITY ADDRESS
JIO PLATFORMS LIMITED INDIAN OFFICE-101, SAFFRON, NR. CENTRE POINT, PANCHWATI 5 RASTA, AMBAWADI, AHMEDABAD 380006, GUJARAT, INDIA
3.PREAMBLE TO THE DESCRIPTION
THE FOLLOWING SPECIFICATION PARTICULARLY DESCRIBES THE NATURE OF THIS INVENTION AND THE MANNER IN WHICH IT IS TO BE PERFORMED.
FIELD OF THE INVENTION
[0001] The present invention relates to the field of wireless communication networks, more particularly relates to a system and a method for pre-processing data.
BACKGROUND OF THE INVENTION
[0002] With the increase in number of users, the network service provisions have been implementing to up-gradations to enhance the service quality so as to keep pace with such high demand. With the advancement of technology, there is a demand for the telecommunication service to induce up to date features into the scope of provision so as to enhance user experience. For this purpose, integrating Artificial Intelligence/ Machine Learning (AI/ML) for various network practices like estimating network performance, tracking health of a network, enhancing user interactive features, and monitoring security has become essential. Incorporating advanced AI/ML methodology has become a priority to keep up with the rapidly evolving telecom sector. The AI/ML incorporation is usually performed by introducing training models with specific data sets to enable them to recognize patterns, trends and based on these, to predict required output. ML training for the given data extracted from a data source is performed by a specifically constructed system.
[0003] However, the data source may contain information with complete time range and unnecessary data input may stress the model and training may fail. If an optimal ML training requires data from particular time range, then this limitation is not overcome by any of the presently available mechanisms.
[0004] There is a requirement for a system and a method thereof to clean the data obtained from network resources based on required operation filters, most preferably based on time-date-events for the cleaning and normalization.
SUMMARY OF THE INVENTION
[0005] One or more embodiments of the present disclosure provide a method and a system for pre-processing data.
[0006] In one aspect of the present invention, the method for pre-processing the data is disclosed. The method includes the step of retrieving, by one or more processors, data from one or more data sources. The method includes the step selecting, by the one or more processors, one or more filters on the retrieved data based on at least a type of Key Performance Indicator (KPI) parameters. The method includes the step of applying, by the one or more processors, the one or more filters on the retrieved data to generate the filtered data. The method includes the step of pre-processing, by the one or more processors, the filtered data based on applying the one or more filters on the retrieved data.
[0007] In one embodiment, the one or more data sources include at least, a file input, a source path, an input stream, a Hyper Text Transfer Protocol 2 (HTTP 2), a Distributed File System (DFS), and a Network Attached Storage (NAS).
[0008] In another embodiment, the retrieved data is stored in at least one data frame.
[0009] In yet another embodiment, selecting, by the one or more processors, one or more filters on the retrieved data is based on historical data and current data.
[0010] In yet another embodiment, the KPI parameters include bandwidth utilization, latency, packet loss, network availability, throughput, error rate, connection time, and response time.
[0011] In yet another embodiment, the one or more filters is at least one of a date filter and wherein the data filter includes a start date and an end date.
[0012] In yet another embodiment, the step of selecting, by the one or more processors, one or more filters on the retrieved data based on at least a type of Key Performance Indicator (KPI) parameters includes the step of determining a covariance matrix for multiple combinations of one or more start dates from the start date range and one or more end dates from the end date range. Further, the step of selecting, by the one or more processors, one or more filters on the retrieved data based on at least a type of Key Performance Indicator (KPI) parameters includes the step of selecting at least one combination of the start date and the end date having high covariance based on the determined covariance matrix.
[0013] In yet another embodiment, the filtered data includes retrieved data within the start date and the end date.
[0014] In yet another embodiment, the pre-processed data is stored in a storage unit and is used for Machine Learning (ML) training.
[0015] In another aspect of the present invention, the system for pre-processing the data is disclosed. The system includes a retrieving unit configured to retrieve data from one or more data sources. The system includes a selection unit configured to select one or more filters on the retrieved data based on at least a type of Key Performance Indicator (KPI) parameters. The system includes an applying unit configured to apply, the one or more filters on the retrieved data to generate the filtered data. The system includes a pre-processing unit configured to pre-process the filtered data based on applying the one or more filters on the retrieved data.
[0016] In another aspect of the embodiment, a non-transitory computer-readable medium having stored thereon computer-readable instructions that, when executed by a processor is disclosed. The processor is configured to retrieve data from one or more data sources. The processor is configured to select one or more filters on the retrieved data based on at least a type of Key Performance Indicator (KPI) parameters. The processor is configured to apply the one or more filters on the retrieved data to generate the filtered data. The processor is configured to pre-process the filtered data based on applying the one or more filters on the retrieved data.
[0017] Other features and aspects of this invention will be apparent from the following description and the accompanying drawings. The features and advantages described in this summary and in the following detailed description are not all-inclusive, and particularly, many additional features and advantages will be apparent to one of ordinary skill in the relevant art, in view of the drawings, specification, and claims hereof. Moreover, it should be noted that the language used in the specification has been principally selected for readability and instructional purposes and may not have been selected to delineate or circumscribe the inventive subject matter, resort to the claims being necessary to determine such inventive subject matter.
BRIEF DESCRIPTION OF THE DRAWINGS
[0018] The accompanying drawings, which are incorporated herein, and constitute a part of this disclosure, illustrate exemplary embodiments of the disclosed methods and systems in which like reference numerals refer to the same parts throughout the different drawings. Components in the drawings are not necessarily to scale, emphasis instead being placed upon clearly illustrating the principles of the present disclosure. Some drawings may indicate the components using block diagrams and may not represent the internal circuitry of each component. It will be appreciated by those skilled in the art that disclosure of such drawings includes disclosure of electrical components, electronic components or circuitry commonly used to implement such components.
[0019] FIG. 1 is an exemplary block diagram of an environment for pre-processing data, according to one or more embodiments of the present disclosure;
[0020] FIG. 2 is an exemplary block diagram of a system for pre-processing the data, according to the one or more embodiments of the present disclosure;
[0021] FIG. 3 is a block diagram of an architecture that can be implemented in the system of FIG.2, according to the one or more embodiments of the present disclosure;
[0022] FIG. 4 is a signal flow diagram illustrating for pre-processing the data, according to the one or more embodiments of the present disclosure; and
[0023] FIG. 5 is a flow diagram illustrating a method for pre-processing the data, according to the one or more embodiments of the present disclosure.
[0024] The foregoing shall be more apparent from the following detailed description of the invention.
DETAILED DESCRIPTION OF THE INVENTION
[0025] Some embodiments of the present disclosure, illustrating all its features, will now be discussed in detail. It must also be noted that as used herein and in the appended claims, the singular forms "a", "an" and "the" include plural references unless the context clearly dictates otherwise.
[0026] Various modifications to the embodiment will be readily apparent to those skilled in the art and the generic principles herein may be applied to other embodiments. However, one of ordinary skill in the art will readily recognize that the present disclosure including the definitions listed here below are not intended to be limited to the embodiments illustrated but is to be accorded the widest scope consistent with the principles and features described herein.
[0027] A person of ordinary skill in the art will readily ascertain that the illustrated steps detailed in the figures and here below are set out to explain the exemplary embodiments shown, and it should be anticipated that ongoing technological development will change the manner in which particular functions are performed. These examples are presented herein for purposes of illustration, and not limitation. Further, the boundaries of the functional building blocks have been arbitrarily defined herein for the convenience of the description. Alternative boundaries can be defined so long as the specified functions and relationships thereof are appropriately performed. Alternatives (including equivalents, extensions, variations, deviations, etc., of those described herein) will be apparent to persons skilled in the relevant art(s) based on the teachings contained herein. Such alternatives fall within the scope and spirit of the disclosed embodiments.
[0028] The present invention provides a system and a method for pre-processing the data based on the one or more filters. The one or more filters, include but not limited to, specific dates, time ranges, and event periods. The one or more filters ensure that the required data is fed to an AI/ ML model and enabling smooth data analysis. To solve the issue associated with extraneous and excessive unfiltered data which made it challenging to perform effective ML training and analysis because the unfiltered data obscure the AI/ML model and a regular pattern may not be identified, the present system provides intelligently cleaning and normalizing the data as a pre-processing step. The preprocessing step ensures that the data is from the required timeline/time period, is in clean and consistent format, making it suitable for ML model training. Thus, the present system enables effective analysis and modeling of the data, solving the problem of working with unnecessary data sets in the existing architecture.
[0029] Referring to FIG. 1, FIG. 1 illustrates an exemplary block diagram of an environment 100 for pre-processing data, according to one or more embodiments of the present invention. The environment 100 includes the network 105, a User Equipment (UE) 110, a server 115, and a system 120. The UE 110 aids a user to interact with the system 120 for pre-processing the data. In an embodiment, the user is at least one of, a network operator, and a service provider. Pre-processing data refers to cleaning and normalizing the data by selecting, and pre-processing of the data.
[0030] For the purpose of description and explanation, the description will be explained with respect to the UE 110, or to be more specific will be explained with respect to a first UE 110a, a second UE 110b, and a third UE 110c, and should nowhere be construed as limiting the scope of the present disclosure. Each of the UE 110 from the first UE 110a, the second UE 110b, and the third UE 110c is configured to connect to the server 115 via the network 105. In an embodiment, each of the first UE 110a, the second UE 110b, and the third UE 110c is one of, but not limited to, any electrical, electronic, electro-mechanical or an equipment and a combination of one or more of the above devices such as smartphones, virtual reality (VR) devices, augmented reality (AR) devices, laptop, a general-purpose computer, desktop, personal digital assistant, tablet computer, mainframe computer, or any other computing device.
[0031] The network 105 includes, by way of example but not limitation, one or more of a wireless network, a wired network, an internet, an intranet, a public network, a private network, a packet-switched network, a circuit-switched network, an ad hoc network, an infrastructure network, a Public-Switched Telephone Network (PSTN), a cable network, a cellular network, a satellite network, a fiber optic network, or some combination thereof. The network 105 may include, but is not limited to, a Third Generation (3G), a Fourth Generation (4G), a Fifth Generation (5G), a Sixth Generation (6G), a New Radio (NR), a Narrow Band Internet of Things (NB-IoT), an Open Radio Access Network (O-RAN), and the like.
[0032] The server 115 may include by way of example but not limitation, one or more of a standalone server, a server blade, a server rack, a bank of servers, a server farm, hardware supporting a part of a cloud service or system, a home server, hardware running a virtualized server, one or more processors executing code to function as a server, one or more machines performing server-side functionality as described herein, at least a portion of any of the above, some combination thereof. In an embodiment, the entity may include, but is not limited to, a vendor, a network operator, a company, an organization, a university, a lab facility, a business enterprise, a defense facility, or any other facility that provides content.
[0033] The environment 100 further includes the system 120 communicably coupled to the server 115 and each of the first UE 110a, the second UE 110b, and the third UE 110c via the network 105. The system 120 is configured for pre-processing the data. The system 120 is adapted to be embedded within the server 115 or is embedded as the individual entity, as per multiple embodiments of the present invention.
[0034] Operational and construction features of the system 120 will be explained in detail with respect to the following figures.
[0035] FIG. 2 is an exemplary block diagram of a system 120 for pre-processing the data, according to one or more embodiments of the present disclosure.
[0036] The system 120 includes a processor 205, a memory 210, a user interface 215, and a storage unit 240. For the purpose of description and explanation, the description will be explained with respect to one or more processors 205, or to be more specific will be explained with respect to the processor 205 and should nowhere be construed as limiting the scope of the present disclosure. The one or more processor 205, hereinafter referred to as the processor 205 may be implemented as one or more microprocessors, microcomputers, microcontrollers, digital signal processors, central processing units, state machines, logic circuitries, single board computers, and/or any devices that manipulate signals based on operational instructions.
[0037] As per the illustrated embodiment, the processor 205 is configured to fetch and execute computer-readable instructions stored in the memory 210. The memory 210 may be configured to store one or more computer-readable instructions or routines in a non-transitory computer-readable storage medium, which may be fetched and executed to create or share data packets over a network service. The memory 210 may include any non-transitory storage device including, for example, volatile memory such as RAM, or non-volatile memory such as EPROM, flash memory, and the like.
[0038] The User Interface (UI) 215 includes a variety of interfaces, for example, interfaces for a Graphical User Interface (GUI), a web user interface, a Command Line Interface (CLI), and the like. The user interface 215 facilitates communication of the system 120. In one embodiment, the user interface 215 provides a communication pathway for one or more components of the system 120. Examples of the one or more components include, but are not limited to, the UE 110, and the storage unit 240. The term “storage unit” and “database” are used interchangeably hereinafter, without limiting the scope of the disclosure.
[0039] The storage unit 240 is one of, but not limited to, a centralized database, a cloud-based database, a commercial database, an open-source database, a distributed database, an end-user database, a graphical database, a No-Structured Query Language (NoSQL) database, an object-oriented database, a personal database, an in-memory database, a document-based database, a time series database, a wide column database, a key value database, a search database, a cache databases, and so forth. The foregoing examples of storage unit 240 types are non-limiting and may not be mutually exclusive, e.g., a database can be both commercial and cloud-based, or both relational and open-source, etc.
[0040] Further, the processor 205, in an embodiment, may be implemented as a combination of hardware and programming (for example, programmable instructions) to implement one or more functionalities of the processor 205. In the examples described herein, such combinations of hardware and programming may be implemented in several different ways. For example, the programming for the processor 205 may be processor-executable instructions stored on a non-transitory machine-readable storage medium and the hardware for processor 205 may comprise a processing resource (for example, one or more processors), to execute such instructions. In the present examples, the memory 210 may store instructions that, when executed by the processing resource, implement the processor 205. In such examples, the system 120 may comprise the memory 210 storing the instructions and the processing resource to execute the instructions, or the memory 210 may be separate but accessible to the system 120 and the processing resource. In other examples, the processor 205 may be implemented by electronic circuitry.
[0041] In order for the system 120 to pre-process the data, the processor 205 includes a retrieving unit 220, a selection unit 225, an applying unit 230, and a pre-processing unit 235 communicably coupled to each other. In an embodiment, operations and functionalities of the retrieving unit 220, the selection unit 225, the applying unit 230, and the pre-processing unit 235 can be used in combination or interchangeably.
[0042] Initially, a request is transmitted by the user via the UI 215 for cleaning and normalizing the data from a data frame by applying one or more filters. The data frame is a two-dimensional, tabular data structure used to store and manipulate the data in rows and columns. The data frame allows for easy data analysis and processing, especially in machine learning tasks. Each column in the data frame represents a variable, and each row represents a data point. The data frame contains data to be used for machine learning training.
[0043] Upon receiving the request, the retrieving unit 220 is configured to retrieve the data from one or more data sources. In an embodiment, the one or more data sources include at least, a file input, a source path, an input stream, a Hyper Text Transfer Protocol 2 (HTTP 2), a Distributed File System (DFS), and a Network Attached Storage (NAS). The file input refers to reading data from the files stored locally or on the server 115. The files can be in different formats, including, but not limited to, Comma Separated Values (CSV), JavaScript Object Notation (JSON), extensible Markup Language (XML), or text files. In an exemplary embodiment, the data is stored in the CSV file, and the retrieving unit 220 fetches the data for processing. The retrieving unit 220 retrieves the data from the file and loads it into the memory 210 for further processing.
[0044] The source path typically refers to the directory or network location where the data files are stored. The retrieving unit 220 fetches the data by following the provided file path. In an exemplary embodiment for the source path, the system 120 stores images in a specific directory. The retrieving unit 220 navigates to a designated source path and retrieves all files that match the required criteria (e.g., .jpg images). The input stream refers to continuous data that is read in real-time from a stream of data (e.g., data being transmitted over the network 105 or generated by sensors). In an exemplary embodiment, the data is being received from an Application Programming Interface (API) or a live data stream, and the retrieving unit 220 fetches it in real-time. The HTTP 2 is a protocol used for communication over the web, which improves upon HTTP/1.1 by offering multiplexing and better performance for handling multiple requests. In an exemplary embodiment, the retrieving unit 220 retrieves the data from the web server using the HTTP 2. The retrieving unit 220 uses HTTP 2 to fetch the data from remote web servers or APIs.
[0045] The DFS is a distributed file system used to store large datasets across multiple machines. The DFS is commonly used in big data environments to store and retrieve large amounts of data. The retrieving unit 220 connects to the DFS to retrieve the files for processing. The NAS is a dedicated file storage system that provides Local Area Network (LAN) access to the data. The NAS allows multiple users or systems to access the data from a centralized storage device. The retrieving unit 220 fetches the data from a NAS device over the network 105. In an exemplary embodiment, if the data is stored on the NAS, the retrieving unit 220 fetches the data via network protocols. Upon retrieving the data from the one or more data sources, the retrieved data is stored in at least a data frame for further processing.
[0046] Upon retrieving the data and stored in the data frame, the selection unit 225 is configured to select one or more filters on the retrieved data based on at least a type of Key Performance Indicator (KPI) parameters. The KPI is a type of performance measurement. The KPIs evaluate a particular activity (such as projects, programs, products and other initiatives), which creates an analytical basis for decision making. The KPIs help network operators to assess the network's efficiency, reliability, and overall health of the network 105. In an embodiment, the one or more KPI parameters include, but not limited to, bandwidth utilization, latency, packet loss, network availability, throughput, error rate, connection time, and response time. The KPIs are essential for creating effective selection filters. The KPIs aids to define the focus of analysis, prioritize relevant data, and enhance the overall efficiency and effectiveness of data-driven decision-making. The selection filters are applied based on the KPIs, which enables the tracking of trends over time, and identifying emerging patterns, and potential issues. In AI/ML learning, the filtering data by the KPIs can enhance model training. The historical data helps the models learn the patterns that are directly tied to key performance outcomes, improving prediction accuracy.
[0047] Upon selecting the one or more filters on the retrieved data is based on the historical data and current data. The historical data refers to records of events or metrics that have been collected over time. The historical data can span months, years, or even decades and is used to analyze trends, patterns, and changes over time. The current data refers to the most recent records or metrics collected in real-time. The current data reflects the present state of the performance metrics. The historical and current datasets are assessed to identify the errors or inconsistencies in the data. The cleaning process performs data cleaning on the historical and current datasets to address the errors, missing values, and format issues. Further, normalization process is applied to ensure the historical and current datasets are comparable and combine the cleaned and normalized datasets for analysis. The normalization and cleaning process ensures that the data used for KPI analysis is accurate, consistent, and reliable, ultimately leading to more informed decision-making.
[0048] Upon selecting the one or more filters on the retrieved data based on based on historical data and current data, the applying unit 230 is configured to apply the one or more filters on the retrieved data to generate the filtered data. In an embodiment, the one or more filters is at least one of a date filter. The date filter includes a start date and an end date. The selection unit 225 is further configured to determine a covariance matrix for multiple combinations of one or more start dates from the start date range and one or more end dates from the end date range. The selection unit 225 is further configured to select at least one combination of the start date and the end date having high covariance based on the determined covariance matrix. The pre-defined number of the start date and the end date pairs (such as 5 pairs) having highest covariance are derived and sent to the user interface 215 for the user to decide the appropriate dates by checking the cleanliness of the input data under consideration.
[0049] As per the above embodiment, the date filter allows the users to specify a time range by selecting the start date and the end date to narrow down the data. In an exemplary embodiment, the user requires the bandwidth utilization reports within the start date and the end date. The start date includes January 1st and the end date includes April 31st. The bandwidth utilization reports are generated to assess whether the current bandwidth allocation meets user demand. The date filter can help the network administrators and security teams to effectively analyze data over specific time frames.
[0050] Upon applying the one or more filters on the retrieved data, the pre-processing unit 235 is configured to pre-process the filtered data based on applying the one or more filters on the retrieved data. The filtered data includes the retrieved data within the start date and the end date. In an exemplary embodiment, the dataset containing network traffic logs for the month of January 2024 to March 2024. The date filter is applied, such as the start date is January 1st 2024 and the end date is March 31st 2024. The system 120 selects the traffic logs from January 1st 2024 to March 31st 2024. The filtered data includes only the traffic logs from the specified date range, allowing for focused analysis. The pre-processing unit 235 formats the filtered logs, aggregates traffic data, and identifies any anomalies (e.g., unusual spikes in traffic). The pre-processed data provides insights into network usage patterns during the specified period, helping in making informed decisions about resource allocation or identifying potential issues.
[0051] In an embodiment, the pre-processing unit 235 has provision to incorporate even more data into pre-processing steps if any more prior data is required to figure out the pattern, then the pre-processing unit 235 also has provision to add prior data of, such as a previous month, or a previous day.
[0052] Upon pre-processing the filtered data, the pre-processed data is stored in the storage unit 240 and is used for Machine Learning (ML) training to train the data. The preprocessing is performed to enhance the data by applying the one or more filters, which helps Artificial Intelligence/Machine Learning (AI/ML) models to understand the structure and learn the patterns at a more granular level. The filtered data is provided to the AI/ML model training, which ensures to identify the patterns and helps in generating more accurate and contextually appropriate responses.
[0053] FIG. 3 is a block diagram of an architecture 300 that can be implemented in the system of FIG.2, according to one or more embodiments of the present disclosure. The architecture 300 of the system 120 includes an integrated system 305, a load balancer 310, and the processor 205. The processor 205 includes the selection unit 225, the applying unit 230, a data source unit 315, and a machine learning training unit 320.
[0054] The architecture 300 of the system 120 is configured to interact with the integrated system 305 and the load balancer 310. The integrated system 305 is configured to access the relevant data in the network 105 and is capable of interacting with the server 115, the storage unit 240 to collect the data from the one or more data sources. In an embodiment, the one or more data sources, include but not limited to, the file input, the source path, the input stream, the HTTP2, the distributed file system and the NAS.
[0055] The load balancer 310 includes distributing the one or more data sources request traffic across the one or more processors 205. The distribution of the one or more data source request traffic helps in managing and optimizing the workload, ensuring that no single processor is overwhelmed while improving overall system performance and reliability.
[0056] Upon retrieving the data and stored in the data frame, the selection unit 225 is configured to select one or more filters on the retrieved data based on at least the type of KPI parameters. In an embodiment, the one or more KPI parameters include, but not limited to, bandwidth utilization, latency, packet loss, network availability, throughput, error rate, connection time, and response time. Upon selecting the one or more filters on the retrieved data is based on historical data and current data. The historical and current datasets are assessed to identify the errors or inconsistencies in the data. The cleaning process performs data cleaning on the historical and current datasets to address the errors, missing values, and format issues. Further, normalization process is applied to ensure the historical and current datasets are comparable and combine the cleaned and normalized datasets for analysis. The normalization and cleaning process ensures that the data used for KPI analysis is accurate, consistent, and reliable, ultimately leading to more informed decision-making.
[0057] Upon selecting the one or more filters on the retrieved data based on historical data and current data, the applying unit 230 is configured to apply the one or more filters on the retrieved data to generate the filtered data. In an embodiment, the one or more filters are at least one of the date filters. The date filter includes the start date and the end date. The date filter allows the users to specify a time range by selecting the start date and the end date to narrow down the data. In an exemplary embodiment, the user requires the bandwidth utilization reports within the start date and the end date. The start date includes January 1st and the end date includes April 31st. The bandwidth utilization reports are generated to assess whether the current bandwidth allocation meets user demand. The date filter can help the network administrators and security teams to effectively analyze data over specific time frames.
[0058] The data source unit 315 is configured to store the filtered data. The data source is updated after the preprocessing and the updated data source is stored. The ML training unit 320 is configured to train ML model by using the filtered data from the data source. and applies various machine learning algorithms to the preprocessed data to create predictive or analytical models. Based on the nature of the data, different ML algorithms include, but not limited to, linear regression, decision trees, neural networks, clustering algorithms, etc.
[0059] FIG. 4 is a signal flow diagram illustrating for normalizing the data, according to one or more embodiments of the present disclosure.
[0060] At 405, Initially, the request is transmitted by the user via the UI 215 for cleaning and normalizing the data from the data frame by applying the one or more filters.
[0061] At 410, upon receiving the request, the retrieving unit 220 is configured to retrieve the data from the one or more data sources. In an embodiment, the one or more data sources include at least, the file input, the source path, the input stream, the HTTP 2, the distributed file system, and the NAS. Upon retrieving the data, the retrieved data is stored in the data frame for further processing.
[0062] At 415, upon retrieving the data and stored in the data frame, the selection unit 225 is configured to select the one or more filters on the retrieved data based on at least the type of KPI parameters. In an embodiment, the one or more KPI parameters include, but not limited to, bandwidth utilization, latency, packet loss, network availability, throughput, error rate, connection time, and response time. Upon selecting the one or more filters on the retrieved data is based on historical data and current data. The historical and current datasets are assessed to identify the errors or inconsistencies in the data. The cleaning process performs data cleaning on the historical and current datasets to address the errors, missing values, and format issues. Further, normalization process is applied to ensure the historical and current datasets are comparable and combine the cleaned and normalized datasets for analysis. The normalization and cleaning process ensures that the data used for KPI analysis is accurate, consistent, and reliable, ultimately leading to more informed decision-making.
[0063] At 420, upon selecting the one or more filters on the retrieved data based on historical data and current data, the applying unit 230 is configured to apply the one or more filters on the retrieved data to generate the filtered data. In an embodiment, the one or more filters are at least one of the date filters. The date filter includes the start date and the end date. The date filter allows the users to specify the time range by selecting the start date and the end date to narrow down the data.
[0064] At 425, upon applying the one or more filters on the retrieved data, the pre-processing unit 235 is configured to pre-process the filtered data based on applying the one or more filters on the retrieved data. The filtered data includes the retrieved data within the start date and the end date. The pre-processing unit 235 formats the filtered logs, aggregates traffic data, and identifies any anomalies (e.g., unusual spikes in traffic). The pre-processed data provides insights into network usage patterns during the specified period, helping in making informed decisions about resource allocation or identifying potential issues.
[0065] In an embodiment, the pre-processing unit 235 has provision to incorporate even more data into pre-processing steps if any more prior data is required to figure out the pattern, then the pre-processing unit 235 also has provision to add prior data of, such as a previous month, or a previous day.
[0066] Upon pre-processing the filtered data, the pre-processed data is stored in the storage unit 240 and is used for Machine Learning (ML) training to train the data. The preprocessing is performed to enhance the data by applying the one or more filters, which helps Artificial Intelligence/Machine Learning (AI/ML) models to understand the structure and learn the patterns at a more granular level. The filtered data is provided to the AI/ML model training, which ensures to identify the patterns and helps in generating more accurate and contextually appropriate responses.
[0067] FIG. 5 is a flow diagram illustrating a method 500 for normalizing the data, according to one or more embodiments of the present disclosure.
[0068] At step 505, the method 500 includes the step of retrieving the data from the one or more data sources by the retrieving unit 220. In an embodiment, the one or more data sources include at least, the file input, the source path, the input stream, the Hyper Text Transfer Protocol 2 (HTTP 2), the Distributed File System (DFS), and the Network Attached Storage (NAS). The file input refers to reading data from the files stored locally or on the server 115. The files can be in different formats, including, but not limited to, the Comma Separated Values (CSV), the JavaScript Object Notation (JSON), the extensible Markup Language (XML), or the text files. In an exemplary embodiment, the data is stored in the CSV file, and the retrieving unit 220 fetches the data for processing. The retrieving unit 220 retrieves the data from the file and loads it into the memory 210 for further processing.
[0069] At step 510, the method 500 includes the step of selecting the one or more filters on the retrieved data based on at least the type of Key Performance Indicator (KPI) parameters by the selection unit 225. In an embodiment, the one or more KPI parameters include, but not limited to, bandwidth utilization, latency, packet loss, network availability, throughput, error rate, connection time, and response time. Upon selecting the one or more filters on the retrieved data based on historical data and current data. The historical and current datasets are assessed to identify the errors or inconsistencies in the data. The cleaning process performs data cleaning on the historical and current datasets to address the errors, missing values, and format issues. Further, normalization process is applied to ensure the historical and current datasets are comparable and combine the cleaned and normalized datasets for analysis. The normalization and cleaning process ensures that the data used for KPI analysis is accurate, consistent, and reliable, ultimately leading to more informed decision-making.
[0070] At step 515, the method 500 includes the step of applying the one or more filters on the retrieved data to generate the filtered data by the applying unit 230 based on the historical data and the current data. In an embodiment, the one or more filters is at least one of a date filter. The date filter includes a start date and an end date. The selection unit 225 is further configured to determine the covariance matrix for multiple combinations of one or more start dates from the start date range and one or more end dates from the end date range. The selection unit 225 is further configured to select the at least one combination of the start date and the end date having high covariance based on the determined covariance matrix. The pre-defined number of the start date and the end date pairs (such as 5 pairs) having highest covariance are derived and sent to the user interface 215 for the user to decide the appropriate dates by checking the cleanliness of the input data under consideration.
[0071] The date filter allows the users to specify a time range by selecting the start date and the end date to narrow down the data. In an exemplary embodiment, the user requires the bandwidth utilization reports within the start date and the end date. The start date includes January 1st and the end date includes April 31st. The bandwidth utilization reports are generated to assess whether the current bandwidth allocation meets user demand. The date filter can help the network administrators and security teams to effectively analyze data over specific time frames.
[0072] At step 520, the method 500 includes the step of pre-processing the filtered data based on applying the one or more filters on the retrieved data. The filtered data includes the retrieved data within the start date and the end date. In an exemplary embodiment, the dataset containing network traffic logs for the month of January 2024 to March 2024. The date filter is applied, such as the start date is January 1st 2024 and the end date is March 31st 2024. The system 120 selects the traffic logs from January 1st 2024 to March 31st 2024. The filtered data includes only the traffic logs from the specified date range, allowing for focused analysis. The pre-processing unit 235 formats the filtered logs, aggregates traffic data, and identifies any anomalies (e.g., unusual spikes in traffic). The pre-processed data provides insights into network usage patterns during the specified period, helping in making informed decisions about resource allocation or identifying potential issues.
[0073] In an embodiment, the pre-processing unit 235 has provision to incorporate even more data into pre-processing steps if any more prior data is required to figure out the pattern, then the pre-processing unit 235 also has provision to add prior data of, such as a previous month, or a previous day.
[0074] Upon pre-processing the filtered data, the pre-processed data is stored in the storage unit 240 and is used for Machine Learning (ML) training to train the data. The preprocessing is performed to enhance the data by applying the one or more filters, which helps Artificial Intelligence/Machine Learning (AI/ML) models to understand the structure and learn the patterns at a more granular level. The filtered data is provided to the AI/ML model training, which ensures to identify the patterns and helps in generating more accurate and contextually appropriate responses.
[0075] In one or more embodiments, the AI/ML trained models don't stop at initial predictions and actions. The ML training unit is configured to continuously adapt and evolve based on changing network conditions and user demands. As the current data is generated and feedback is received, the AI/ML models refine the predictions, ensuring that the system remains efficient and effective over time and each time the data normalization as well as cleaning is performed by the present system and method.
[0076] In preferred embodiments, the method may also include various steps to collect information from network elements like servers and other network functions, trigger consecutive operational procedures etc., improve learning methodology for the ML models and may not be considered strictly limited to the above method steps.
[0077] In another aspect of the embodiment, a non-transitory computer-readable medium having stored thereon computer-readable instructions that, when executed by a processor 205. The processor 205 is configured to retrieve data from one or more data sources. The processor 205 is configured to select one or more filters on the retrieved data based on at least a type of Key Performance Indicator (KPI) parameters. The processor 205 is configured to apply the one or more filters on the retrieved data to generate the filtered data. The processor 205 is configured to pre-process, the filtered data based on applying the one or more filters on the retrieved data.
[0078] A person of ordinary skill in the art will readily ascertain that the illustrated embodiments and steps in description and drawings (FIGS.1-5) are set out to explain the exemplary embodiments shown, and it should be anticipated that ongoing technological development will change the manner in which particular functions are performed. These examples are presented herein for purposes of illustration, and not limitation. Further, the boundaries of the functional building blocks have been arbitrarily defined herein for the convenience of the description. Alternative boundaries can be defined so long as the specified functions and relationships thereof are appropriately performed. Alternatives (including equivalents, extensions, variations, deviations, etc., of those described herein) will be apparent to persons skilled in the relevant art(s) based on the teachings contained herein. Such alternatives fall within the scope and spirit of the disclosed embodiments.
[0079] The present disclosure provides technical advancement for implementing the pre-processing techniques for data cleaning and normalization which is applied to ML models for training data that aid in desired analysis. The trained data is used for efficient information retrieval and knowledge extraction as well as enhancement of the effectiveness of the system. The system pre-processes the data for the ML training that may contain various types of fields and filters unnecessary data by means of one or more filters which may be defined by the user as per requirements The present disclosure identifies the preprocessing techniques to enhance the data by applying the required operation filter, which helps ML models to understand the structure and learn patterns at a more granular level. Further, the present disclosure enhances the preprocessing results in more accurate and contextually appropriate responses from the ML model.
[0080] The present invention offers multiple advantages over the prior art and the above listed are a few examples to emphasize on some of the advantageous features. The listed advantages are to be read in a non-limiting manner.
REFERENCE NUMERALS
[0081] Environment - 100
[0082] Network-105
[0083] User equipment- 110
[0084] Server - 115
[0085] System -120
[0086] Processor - 205
[0087] Memory - 210
[0088] User interface-215
[0089] Retrieving unit – 220
[0090] Selection unit – 225
[0091] Applying unit- 230
[0092] Pre-processing unit- 235
[0093] Storage unit– 240
[0094] Integrated system-305
[0095] Load balancer- 310
[0096] Data source unit-315
[0097] ML training unit- 320
,CLAIMS:CLAIMS
We Claim:
1. A method (500) of pre-processing data, the method (500) comprising the steps of:
retrieving, by one or more processors (205), the data from one or more data sources;
selecting, by the one or more processors (205), one or more filters on the retrieved data based on at least a type of Key Performance Indicator (KPI) parameters;
applying, by the one or more processors (205), the one or more filters on the retrieved data to generate the filtered data; and
pre-processing, by the one or more processors (205), the filtered data based on applying the one or more filters on the retrieved data.
2. The method (500) as claimed in claim 1, wherein the one or more data sources include at least, a file input, a source path, an input stream, a Hyper Text Transfer Protocol 2 (HTTP 2), a Distributed File System (DFS), and a Network Attached Storage (NAS).
3. The method (500) as claimed in claim 1, wherein the retrieved data is stored in at least one data frame.
4. The method (500) as claimed in claim 1, wherein selecting, by the one or more processors (205), one or more filters on the retrieved data is based on historical data and current data.
5. The method (500) as claimed in claim 1, wherein the KPI parameters comprise bandwidth utilization, latency, packet loss, network availability, throughput, error rate, connection time, and response time.
6. The method (500) as claimed in claim 1, wherein the one or more filters is at least one of a date filter and wherein the date filter includes a start date range and an end date range.
7. The method (500) as claimed in claim 6, wherein the step of selecting, by the one or more processors (205), one or more filters on the retrieved data based on at least a type of Key Performance Indicator (KPI) parameters, comprising the step of:
determining, by the one or more processors, a covariance matrix for multiple combinations of one or more start dates from the start date range and one or more end dates from the end date range; and
selecting, by the one or more processors, at least one combination of the start date and the end date having high covariance based on the determined covariance matrix.
8. The method (500) as claimed in claim 1, wherein the filtered data comprises of retrieved data within the start date and the end date.
9. The method (500) as claimed in claim 1, wherein the pre-processed data is stored in a storage unit (240) and is used for Machine Learning (ML) training.
10. A system (120) of pre-processing data, the system (120) comprises:
a retrieving unit (220) configured to retrieve, data from one or more data sources;
a selection unit (225) configured to select, one or more filters on the retrieved data based on at least a type of Key Performance Indicator (KPI) parameters;
an applying unit (230) configured to apply, the one or more filters on the retrieved data to generate the filtered data; and
a pre-processing unit (235) configured to pre-process, the filtered data based on applying the one or more filters on the retrieved data.
11. The system (120) as claimed in claim 10, wherein the one or more data sources include at least, a file input, a source path, an input stream, a Hyper Text Transfer Protocol 2 (HTTP 2), a Distributed File System(DFS), and a Network Attached Storage (NAS).
12. The system (120) as claimed in claim 10, wherein the retrieved data is stored in at least one data frame.
13. The system (120) as claimed in claim 10, wherein selecting, by the one or more processors (205), one or more filters on the retrieved data is based on historical data and current data.
14. The system (120) as claimed in claim 10, wherein the KPI parameters comprise bandwidth utilization, latency, packet loss, network availability, throughput, error rate, connection time, and response time.
15. The system (120) as claimed in claim 10, wherein the one or more filters is at least one of a date filter and wherein the data filter includes a start date and an end date.
16. The system (120) as claimed in claim 15, wherein the selection unit (225) is further configured to:
determine a covariance matrix for multiple combinations of one or more start dates from the start date range and one or more end dates from the end date range; and
select at least one combination of the start date and the end date having high covariance based on the determined covariance matrix.
17. The system (120) as claimed in claim 10, wherein the filtered data comprises of retrieved data within the start date and the end date.
18. The system (120) as claimed in claim 10, wherein the pre-processed data is stored in a storage unit (240) and is used for Machine Learning (ML) training.
| # | Name | Date |
|---|---|---|
| 1 | 202321067385-STATEMENT OF UNDERTAKING (FORM 3) [07-10-2023(online)].pdf | 2023-10-07 |
| 2 | 202321067385-PROVISIONAL SPECIFICATION [07-10-2023(online)].pdf | 2023-10-07 |
| 3 | 202321067385-FORM 1 [07-10-2023(online)].pdf | 2023-10-07 |
| 4 | 202321067385-FIGURE OF ABSTRACT [07-10-2023(online)].pdf | 2023-10-07 |
| 5 | 202321067385-DRAWINGS [07-10-2023(online)].pdf | 2023-10-07 |
| 6 | 202321067385-DECLARATION OF INVENTORSHIP (FORM 5) [07-10-2023(online)].pdf | 2023-10-07 |
| 7 | 202321067385-FORM-26 [27-11-2023(online)].pdf | 2023-11-27 |
| 8 | 202321067385-Proof of Right [12-02-2024(online)].pdf | 2024-02-12 |
| 9 | 202321067385-DRAWING [07-10-2024(online)].pdf | 2024-10-07 |
| 10 | 202321067385-COMPLETE SPECIFICATION [07-10-2024(online)].pdf | 2024-10-07 |
| 11 | Abstract.jpg | 2024-12-30 |
| 12 | 202321067385-Power of Attorney [24-01-2025(online)].pdf | 2025-01-24 |
| 13 | 202321067385-Form 1 (Submitted on date of filing) [24-01-2025(online)].pdf | 2025-01-24 |
| 14 | 202321067385-Covering Letter [24-01-2025(online)].pdf | 2025-01-24 |
| 15 | 202321067385-CERTIFIED COPIES TRANSMISSION TO IB [24-01-2025(online)].pdf | 2025-01-24 |
| 16 | 202321067385-FORM 3 [28-01-2025(online)].pdf | 2025-01-28 |