Method And System For Prediction Of Missing Data

< Back

Method And System For Prediction Of Missing Data

Abstract: Prediction of missing data have been very important in various fields such as finance, share market etc. The existing methods for data prediction are generally lacking accuracy of the predicted data. A method and system for prediction of missing data has been provided. The system provides a framework for performing end to end prediction along with data exploration using machine learning techniques. The system performs statistical analysis of data including computing correlations, outlier detection (box and whisker) and removal etc. Further, a large number of model-feature combinations are evaluated and shortlists best performing models based on a chosen metric. A multi-dimensional grid is formed which can be configured for specific problems with minimal code changes.

Get Free WhatsApp Updates!
Notices, Deadlines & Correspondence

Patent Information

Application #

Filing Date

29 January 2020

Publication Number

31/2021

Publication Type

INA

Invention Field

COMPUTER SCIENCE

Status

ip@legasis.in

Parent Application

Patent Number

Legal Status

Grant Date

2024-09-26

Renewal Date

Applicants

Tata Consultancy Services Limited

Nirmal Building, 9th Floor, Nariman Point, Mumbai - 400021, Maharashtra, India

Inventors

1. SHARMA, Aanchal

Tata Consultancy Services Limited, Yantra Park,Pokharan Road Number 2, TCS Approach Rd, Thane West - 400606, Maharashtra, India

2. KUMAR, Nishant

Tata Consultancy Services Limited, Yantra Park,Pokharan Road Number 2, TCS Approach Rd, Thane West - 400606, Maharashtra, India

3. KALELE, Amit

Tata Consultancy Services Limited, Sahayadri Park, Rajiv Gandhi Infotech Park, Hinjewadi Phase 3, Pune - 411057, Maharashtra, India

4. JAIN, Anubhav

Tata Consultancy Services Limited, C-56, Noida Phase-2, Sector 80, Noida - 201305, Uttar Pradesh, India

Specification

Claims:

1. A processor implemented method for prediction of missing data, the method comprising:

receiving a plurality of data sets as an input from one or more sources, wherein the plurality of data sets is specific to a domain and for a predefined time period, wherein the plurality of data sets having a plurality of features corresponding to the domain;
preprocessing, via one or more hardware processors, the plurality of data sets to achieve preprocessed data sets, wherein the preprocessing further comprises:
computing a mean, a standard deviation and a shape of distribution of the plurality of data sets,
performing a set of statistical analysis techniques to gather insights on the plurality of data sets and removing outliers in the plurality of data sets, and
expanding and transforming the plurality of data sets to achieve better accuracy as compared to the input;
partitioning, via one or more hardware processors, the preprocessed plurality of data sets into a test set and a training set depending on a pattern of the missing data and a quantum value of the missing data;
scaling, via the one or more hardware processors, the plurality of features of the test set and the training set to a similar unit of measurements;
selecting, via the one or more hardware processors, a set of features from the scaled plurality of features using a plurality of feature engineering techniques;
building, via the one or more hardware processors, a prediction model using the selected set of features and a regression based machine learning algorithm, wherein the prediction model having a model configuration;
evaluating, via the one or more hardware processors, a plurality of model configurations achieved from a plurality of combinations of techniques used in the scaling, the plurality of feature engineering techniques and the regression based machine learning algorithm using a cross validation method to get the prediction model with a best performing model configuration;
providing, via the one or more hardware processors, a set of requirements to the prediction model with the best performing model configuration; and
predicting, via the one or more hardware processors, the missing data based on the set of requirements using the prediction model with the best performing model configuration.

2. The method of claim 1, wherein the domain is equity asset class and the plurality of features include yield on government bond, price for representative index, exchange rate, interest on currency and peer stocks.

3. The method of claim 1, wherein the plurality of data sets are evaluated on their prediction capability based on the context of the domain.

4. The method of claim 1 further comprising the step of estimating the accuracy of predicted missing data using one of mean of absolute difference (MAD), root mean square error (RMSE) or standard error (SE) method.

5. The method of claim 1 further comprising correlation analysis for computing pairwise correlation between target feature corresponding to the missing data and the selected set of features to define the relationship between them.

6. The method of claim 1, wherein the scaling is performed using at least one of a min-max scaler, a max-abs scaler, a standard scaler, a robust scaler or a normalizer method.

7. The method of claim 1, wherein the selected set of features have higher predictive power for estimating the missing data as compared to the input.

8. The method of claim 1, wherein the one or more sources comprises a live data, a plurality of databases related to the domain.

9. The method of claim 1, wherein the plurality of feature engineering techniques comprises principle component analysis (PCA), polynomial expansion, 2 layer recurrent neural network (RNN) and Kbest technique.

10. The method of claim 1, wherein the regression based machine learning algorithm comprises at least one of a knowledge neural network (KNN), a random forest, a decision trees or a Ridge and Lasso technique.

11. A system for prediction of missing data, the system comprises:

an input/output interface for receiving a plurality of data sets as an input from one or more sources, wherein the plurality of data sets is specific to a domain and for a predefined time period, wherein the plurality of data sets having a plurality of features corresponding to the domain;
one or more hardware processors;
a memory in communication with the one or more hardware processors , wherein the memory further comprises:
a preprocessor for preprocessing the plurality of data sets to achieve preprocessed data sets, wherein the preprocessing further comprises:
computing a mean, a standard deviation and a shape of distribution of the plurality of data sets,
performing a set of statistical analysis techniques to gather insights on the plurality of data sets and removing outliers in the plurality of data sets, and
expanding and transforming the plurality of data sets to achieve better accuracy as compared to the input;
a partitioning module for partitioning the preprocessed plurality of data sets into a test set and a training set depending on a pattern of the missing data and a quantum value of the missing data;
a scaler for scaling the plurality of features of the test set and the training set to a similar unit of measurements;
a feature selection module for selecting a set of features from the scaled plurality of features using a plurality of feature engineering techniques;
a prediction model building module for building a prediction model using the selected set of features and a regression based machine learning algorithm, wherein the prediction model having a model configuration;
an evaluation module for evaluating a plurality of model configurations achieved from a plurality of combinations of techniques used in the scaling, the plurality of feature engineering techniques and the regression based machine learning algorithm using a cross validation method to get the prediction model with a best performing model configuration; and
a data prediction module for predicting the missing data based on a set of requirements using the prediction model with the best performing model configuration.
, Description:FORM 2

THE PATENTS ACT, 1970
(39 of 1970)
&
THE PATENT RULES, 2003

COMPLETE SPECIFICATION
(See Section 10 and Rule 13)

Title of the invention:
METHOD AND SYSTEM FOR PREDICTION OF MISSING DATA

Applicant:
Tata Consultancy Services Limited
A company Incorporated in India under the Companies Act, 1956
Having address:
Nirmal Building, 9th Floor,
Nariman Point, Mumbai 400021,
Maharashtra, India

The following specification particularly describes the invention and the manner in which it is to be performed.
TECHNICAL FIELD
[001] The embodiments herein generally relate to the field of prediction of missing data. More particularly, but not specifically, the present disclosure provides a method and system to provide a framework for end to end prediction of missing data along with data exploration using machine learning techniques.

BACKGROUND
[002] Various institutions such as banks, finance companies relies on time series data for various operations such as modelling. In this domain, the time series data is defined as the market prices of the risk factors involved over a large period of time. The value of time series data is exposed to market risk, and are expected to quantify using metrics such as value at risk (VAR) and expected shortfall (ES). Modelling such risk measures mandates availability of historical time series market data.
[003] Institutions are grappling with data quality issues since such historical time series market data is found to contain missing values due to liquidity issues in financial markets. While institutions apply certain traditional methods for treating the missing data, there exists a scope for machine learning based solutions to provide more accurate results. Such potential improvement in accuracy may result in reduced disparity between actual vs. predicted VAR leading to reduced back testing exceptions. As per some of the newer regulations like fundamental review of trading book (FRTB), such reduction in model risk can have huge capital implications for banks.
[004] Presently, traditional methods such as roll over method, waterfall proxy methods are being employed by the banks to fill the missing data for assessing the market risk. Few other methods have explored machine learning methods but still requires further improvement. The significance of machine learning methods for back filling missing data to yield more accurate results is especially high in scenarios when there are a large number of data points missing which is the case in circumstances like demerger from a parent entity or IPO.
SUMMARY
[005] The following presents a simplified summary of some embodiments of the disclosure to provide a basic understanding of the embodiments. This summary is not an extensive overview of the embodiments. It is not intended to identify key/critical elements of the embodiments or to delineate the scope of the embodiments. It’s sole purpose is to present some embodiments in a simplified form as a prelude to the more detailed description that is presented below.
[006] In view of the foregoing, an embodiment herein provides a system for prediction of missing data. The system comprises an input/output interface, one or more hardware processor, and a memory. The input/output interface receives a plurality of data sets as an input from one or more sources, wherein the plurality of data sets is specific to a domain and for a predefined time period, wherein the plurality of data sets having a plurality of features corresponding to the domain. The memory in communication with the one or more hardware processors. The memory further comprises a preprocessor, a partitioning module, a scaler, a feature selection module, a prediction model building module, an evaluation module and a data prediction module. The preprocessor preprocesses the plurality of data sets to achieve preprocessed data sets, wherein the preprocessing further comprises: computing a mean, a standard deviation and a shape of distribution of the plurality of data sets, performing a set of statistical analysis techniques to gather insights on the plurality of data sets and removing outliers in the plurality of data sets, and expanding and transforming the plurality of data sets to achieve better accuracy as compared to the input. The partitioning module partitions the preprocessed plurality of data sets into a test set and a training set depending on a pattern of the missing data and a quantum value of the missing data. The scaler scales the plurality of features of the test set and the training set to a similar unit of measurements. The feature selection module selects a set of features from the scaled plurality of features using a plurality of feature engineering techniques. The prediction model building module builds a prediction model using the selected set of features and a regression based machine learning algorithm, wherein the prediction model having a model configuration. The evaluation module evaluates a plurality of model configurations achieved from a plurality of combinations of techniques used in the scaling, the plurality of feature engineering techniques and the regression based machine learning algorithm using a cross validation method to get the prediction model with a best performing model configuration. The data prediction module predicts the missing data based on a set of requirements using the prediction model with the best performing model configuration.
[007] In another aspect, the embodiment here provides a method for prediction of missing data. Initially, a plurality of data sets is received as an input from one or more sources. The plurality of data sets is specific to a domain and for a predefined time period, wherein the plurality of data sets having a plurality of features corresponding to the domain. At next step, the plurality of data sets is preprocessed to achieve preprocessed data sets, wherein the preprocessing further comprises: computing a mean, a standard deviation and a shape of distribution of the plurality of data sets, performing a set of statistical analysis techniques to gather insights on the plurality of data sets and removing outliers in the plurality of data sets, and expanding and transforming the plurality of data sets to achieve better accuracy as compared to the input. In the next step, the preprocessed plurality of data sets is partitioned into a test set and a training set depending on a pattern of the missing data and a quantum value of the missing data. Further, the plurality of features of the test set and the training set are scaled to a similar unit of measurements. In the next step, a set of features is selected from the scaled plurality of features using a plurality of feature engineering techniques. Further a prediction model is built using the selected set of features and a regression based machine learning algorithm, wherein the prediction model having a model configuration. Further, a plurality of model configurations achieved from a plurality of combinations of techniques used in the scaling, the plurality of feature engineering techniques and the regression based machine learning algorithm using a cross validation method are evaluated to get the prediction model with a best performing model configuration. In the next step, a set of requirements is provided to the prediction model with the best performing model configuration. And finally, the missing data is predicted based on the set of requirements using the prediction model with the best performing model configuration.
[008] It should be appreciated by those skilled in the art that any block diagram herein represents conceptual views of illustrative systems embodying the principles of the present subject matter. Similarly, it will be appreciated that any flow charts, flow diagrams, state transition diagrams, pseudo code, and the like represent various processes which may be substantially represented in a computer-readable medium and so executed by a computing device or processor, whether or not such computing device or processor is explicitly shown.

BRIEF DESCRIPTION OF THE DRAWINGS
[009] The accompanying drawings, which are incorporated in and constitute a part of this disclosure, illustrate exemplary embodiments and, together with the description, serve to explain the disclosed principles.
[010] FIG. 1 shows a block diagram of a system for prediction of missing data according to an embodiment of the present disclosure.
[011] FIG. 2 shows an architectural view of the system and approach for prediction of missing data according to an embodiment of the disclosure.
[012] FIGS. 3A-3B show a flowchart illustrating the steps involved in prediction of missing data according to an embodiment of the present disclosure.
[013] FIGS. 4A-4B show a detailed flowchart illustrating steps involved in prediction step according to an embodiment of the disclosure.
[014] FIGS. 5A-5C show a detailed flowchart illustrating steps involved in training step according to an embodiment of the disclosure.
[015] FIGS. 6A, 6B and 6C show graphical representation of comparative analysis of error metrics for test scenario according to an embodiment of the present disclosure.
DETAILED DESCRIPTION
[016] Exemplary embodiments are described regarding the accompanying drawings. In the figures, the left-most digit(s) of a reference number identifies the figure in which the reference number first appears. Wherever convenient, the same reference numbers are used throughout the drawings to refer to the same or like parts. While examples and features of disclosed principles are described herein, modifications, adaptations, and other implementations are possible without departing from the spirit and scope of the disclosed embodiments. It is intended that the following detailed description be considered as exemplary only, with the true scope and spirit being indicated by the following claims.
[017] Referring now to the drawings, and more particularly to FIG. 1 through FIG. 6C, where similar reference characters denote corresponding features consistently throughout the figures, there are shown preferred embodiments and these embodiments are described in the context of the following exemplary system and/or method.
[018] According to an embodiment of the disclosure, a system 100 for prediction of missing data is shown in the block diagram of FIG. 1. An architectural overview of the system 100 is shown in the FIG. 2. The system 100 is configured to provide a framework for performing end to end prediction along with data exploration using machine learning techniques. The system 100 is configured to perform statistical analysis of data including computing correlations, outlier detection (box and whisker) and removal etc. Further, the system 100 evaluates large number of model-feature combinations and shortlists best performing models based on a chosen metric. A multi-dimensional grid is formed which can be configured for specific problems with minimal code changes.
[019] According to an embodiment of the disclosure, the system 100 comprises an input / output interface 102, one or more hardware processors 104 and a memory 106 in communication with the one or more hardware processors 104 as shown in the block diagram of FIG. 1. The one or more hardware processors 104 work in communication with the memory 106. The one or more hardware processors 104 are configured to execute a plurality of algorithms stored in the memory 106. The memory 106 further includes a plurality of modules for performing various functions. The memory 106 comprises a preprocessor 108, a partitioning module 110, a scaler 112, a feature selection module 114, a prediction model building module 116, an evaluation module 118 and a data prediction module 120. The memory 108 may further comprise other modules for performing certain functions.
[020] According to an embodiment of the disclosure, the input / out interface 102 (I/O interface) is configured to receive a plurality of data sets as an input data from one or more sources, wherein the plurality of data sets is specific to a domain and for a predefined time period. The plurality of data sets having a plurality of features corresponding to the domain. It should be appreciated that the plurality of data sets may also be referred as a historical data. The one or more sources comprises live data, data from databases etc. The I/O interface 102 is accessible to the user via smartphones, laptop or desktop configuration thus giving the user the freedom to interact with the system 100 from anywhere anytime. The I/O interface 102 may include a variety of software and hardware interfaces, for example, interfaces for peripheral device(s), such as a keyboard, a mouse, an external memory, a camera device, and a printer. The I/O interface 102 can facilitate multiple communications within a wide variety of networks and protocol types, including wired networks, for example, local area network (LAN), cable, etc., and wireless networks, such as Wireless LAN (WLAN), cellular, or satellite.
[021] In an example the input data is gathered by collecting available historical data set of an equity asset class, of an organization or an entity in given geography. A plurality of input features are selected from the input data. The parameters considered for input feature selection depend on the asset class, geography as well as a sector from which the security is issued.
[022] Various parameters are considered for collation of historical window. The input features are selected based on evaluation of their prediction power in the context of domain considerations such as the asset class, geography and sector of the output feature.
[023] According to an embodiment of the disclosure, the memory 106 comprises the preprocessor 108. The preprocessor 108 is configured to preprocess the plurality of data sets to achieve preprocessed data sets. The plurality of data sets is analysed to derive output in the form of descriptive statistics for each of the plurality of data sets. Further, a correlation analysis using boxplot graphs is performed and outliers are removed. Various methods are used for the pre-processing of data sets.
[024] According to an embodiment of the disclosure, the descriptive statistics is performed by computation of key statistics that summarize central tendency (or mean), dispersion (standard deviation) and shape of a data set’s distribution (through quartile analysis). Further, the correlation analysis is performed by computation of pairwise correlation between a set of target features and input set of features to define the relationship between them. The output of this step is therefore a correlation matrix. There are three variants of correlation that are computed - Pearson coefficient, Kendall coefficient and Spearman coefficient.
[025] According to an embodiment of the disclosure, the preprocessing also comprises outlier detection: The outlier detection involves identification of extreme values that deviate from majority of the plurality of data sets. The outliers are detected based on statistical properties of the input data. The process adopted for outlier identification are as follows: First, a quartile analysis of data is performed, later an inter-quartile distance is determined, followed by deriving upper and lower limits based on the inter-quartile distance. And finally, all the values less than the lower limit and greater than the upper limit are flagged as outliers. The output is an overall count of outliers for each input features, the count of repetition of each value for a given risk factor as well as the number of occurrences of repeated values. For example, if a value of 0.045 gets repeated thrice for LIBOR feature then outlier count detected would have the description: (0.045, 3). Additionally, outliers are also graphically depicted as a through boxplot and whisker plots
[026] According to an embodiment of the disclosure, the memory 106 also comprises the partitioning module 110. The partitioning module 110 is configured to partition the preprocessed plurality of data sets into a test set and a training set depending on a pattern of the missing data and a quantum value of the missing data. Once the data exploration is done and actionable insights (if any) performed on the data set, the preprocessed plurality of data sets is partitioned into test and training sets depending upon the pattern in which data is found to be missing and the quantum of values missing.
[027] The partitioning of data can be explained with the help of following examples. Let’s assume the overall data set is collected with regression window of Jan 2010 to Jan 2016. There are at least four possibilities as explained below
• If fewer values of data are found to be missing randomly. For example, if data is found to be missing for 4 dates only - 31-Jan-2010, 15-Feb-2010, 6-March-2010 and 12-Apr-2010 then test set would comprise the data of input features for these 4 dates only since the target feature would have to be predicted for these 4 days. Training set would comprise remainder of the data set.
• If fewer values of data are found to be missing consecutively .For example, if data is found to be missing for 10 consecutive dates only – i.e. say from 31-Jan-2010 to 9-Feb-2010 then test set would comprise the data of input features for these 10 consecutive dates since the target feature would have to be predicted for these 10 days. Training set would comprise remainder of the data set.
• If larger quantum of values are missing due to an IPO scenario. Suppose a company got publicly listed in 2015.But if a Bank requires stock price of the company from 2014 onwards for computation of Market Risk specific metrics, then test set would comprise the data of input features for the year 2014-2015 since the target feature would have to be predicted for the 1 year. Training set would comprise remainder of the data set.
• If larger quantum of values are missing due to a demerger scenario .Suppose a company demerged from its parent company in 2014.But if a Bank requires stock price of the company from 2014 onwards for computation of Market Risk specific metrics, then test set would comprise the data of input features for the year 2014-2015 since the target feature would have to be predicted for the 1 year. Training set would comprise remainder of the data set.
[028] According to an embodiment of the disclosure, the memory 106 comprises the scaler 112. The scaler 112 is configured to scale the plurality of features of the test set and the training set to a similar unit of measurements. Scaling is performed to eliminate the bias that may emerge due to different units of measurement of each of the plurality of features. Additionally, the machine learning output may also suffer in the chance that a feature has a variance that is orders of magnitude larger than others. The scaling is performed using at least one of min-max scaler, max-abs scaler, standard scaler, robust scaler or normaliser method. The use of any other method for scaling is well within the scope of this disclosure.
[029] According to an embodiment of the disclosure, the memory 106 comprises the feature selection module 114. The feature selection module 114 is configured to select a set of features from the scaled plurality of features using a plurality of feature engineering techniques. Initially, a few distinguishing features are obtained from the scaled plurality of features. The distinguished features are then used to select the set of features. The selected set of features have a higher predictive power.
[030] Various techniques can be used for distinguishing the features as follows. Principal component analysis (PCA) technique which orthogonally transforms the correlated input features into linearly uncorrelated features. In another example, polynomial expansion can also be used. Sometimes, the machine learning framework might give a better output if the input features are polynomial expanded. This higher accuracy will be the upshot of non-linear relationship between the input features.
[031] The feature selection module 114 further uses either regression score based feature selection or other supervised learning method based feature elimination to obtain important features.
[032] According to an embodiment of the disclosure, the memory 106 comprises the prediction model building module 116. The prediction model building module 116 is configured to build a prediction model using the selected set of features and a regression based machine learning algorithm. The prediction model having a model configuration. The regression based machine learning (ML) algorithms such as knowledge neural network (KNN), Random Forest, Decision Trees, Ridge and Lasso (each with its own set of hyper parameters).
[033] According to an embodiment of the disclosure, the memory 106 comprises the evaluation module 118. The evaluation module 118 is configured to evaluate a plurality of model configurations achieved from a plurality of combinations of techniques used in the scaling, the plurality of feature engineering techniques and the regression based machine learning algorithm using a cross validation method to get the prediction model with a best performing model configuration.
[034] The selected set of features are fed as input into various regression based machine learning (ML) algorithms such as knowledge neural network (KNN), Random Forest, Decision Trees, Ridge and Lasso (each with its own set of hyper parameters). Each model configuration would result from a combination of various Scaling, feature engineering and regression algorithms each with their own set of hyper-parameters. This would result in a very high number of model configurations to be validated. Therefore, grid search mechanism is utilized to evaluate the model performance of multiple configurations and select the best performing configurations for prediction.
[035] According to an embodiment of the disclosure, the memory 106 comprises the data prediction module 120. The data prediction module 120 is configured to predict the missing data based on a set of requirements using the prediction model with the best performing model configuration.
[036] According to an embodiment of the disclosure, the system 100 may also be configured to estimate the accuracy of predicted missing data using one of mean of absolute difference (MAD), root mean square error (RMSE) or standard error (SE) method. The output of this step is therefore error estimates like MAD, RMSE and SE. This is followed comparative analysis of ML predicted output vs the output of traditional methods to assess if ML algorithms provide a more accurate result
[037] In operation, a flowchart 200 illustrating a method for prediction of missing data is shown in FIGS. 3A-3B. Initially at step 202, the plurality of data sets is received as an input from one or more sources. The one or more sources may include a database or any other live data. The plurality of data sets is specific to a domain such as equity, finance etc., and are for a predefined time period. The plurality of data sets have the plurality of features corresponding to the domain. At step 204, the plurality of data sets is preprocessed to achieve the preprocessed data sets. The preprocessing further comprises various other steps. Preprocessing may include computing a mean, a standard deviation and a shape of distribution of the plurality of data sets. Preprocessing further include performing a set of statistical analysis techniques to gather insights on the plurality of data sets and removing outliers in the plurality of data sets. Preprocessing may further include expanding and transforming the plurality of data sets to achieve better accuracy as compared to the input.
[038] In the next step 206, the preprocessed plurality of data sets is partitioned into a test set and a training set depending on a pattern of the missing data and a quantum value of the missing data as explained in the example above. At step 208, the plurality of features of the test set and the training set are scaled to a similar unit of measurements. At step 210, a set of features is selected from the scaled plurality of features using a plurality of feature engineering techniques.
[039] In the next step 212, the prediction model is built using the selected set of features and a regression based machine learning algorithm, The prediction model always have a model configuration. At step 214, the plurality of model configurations achieved from a plurality of combinations of techniques used in the scaling, the plurality of feature engineering techniques and the regression based machine learning algorithm are evaluated using a cross validation method to get the prediction model with a best performing model configuration. Further at step 216, a set of requirements is provided to the prediction model with the best performing model configuration. And finally at step 218, the missing data is predicted based on the set of requirements using the prediction model with the best performing model configuration.
[040] According to an embodiment of the disclosure, FIG. 4A-4B shows a detailed flowchart 3000 illustrating steps involved in prediction step. At step 302, test data set, list of features, label/output load saved model and configuration are read as an input. At step 304, features[x] and label/output[y] data set are separated from the input. At step 306, feature expansion status from loaded configuration is read from the separated features. At step 308, it is checked if features have been expanded or not. If yes then at step 310, flag from loaded configuration is read.
[041] Further at step 312, flag value is checked whether it is “1”, “2” or “3”. Depending on the value of the flag an expanded feature generation method is selected. If flag value is “1” then polynomial transformation is applied. If flag value is “2” then deep neural network is applied. If flag value is “3” then gradient boosting is applied. At step 314, expanded features [x_new] are generated using the selected method. At step 316, a normalizer is read from the loaded configuration. At step 318, the normalizer is applied. At step 320, the saved model is applied. And finally at step 322, predicted data set is generated and results are saved.
[042] According to an embodiment of the disclosure FIG. 5A-5C shows a detailed flowchart 400 illustrating steps involved in training step. At step 402, time series data is provided as input. At step 404, it is checked whether NaN (Not a Number) need to be removed from the input or not. If true then at step 406, records with NaN are detected and corresponding records are deleted. If false then at step 408, it is checked if outliers need to be removed or not. If true then at step 410, outliers are detected and corresponding records are deleted. At step 412, the output is the filtered time series data.
[043] At step 414, it is checked if shuffle is there or not, and depending on the shuffle, the data set is split into train and test data set either randomly or consecutively. At step 416, train data set, list of features, label/output are read. At step 418, Features[x] and label/output[y] data set are separated from the read data set. At step 420, it is checked whether feature expansion need to be done or not. If true, user input is provided with a flag value.
[044] Further at step 422, the flag value is checked whether it is “1”, “2” or “3”. Depending on the value of flag expanded feature generation method is selected. If flag value is “1” then polynomial transformation is applied. If flag value is “2” then deep neural network is applied. If flag value is “3” then gradient boosting is applied. At step 424, expanded features [x_new] are generated using the selected method. At step 426, normalizer is applied. At step 428, hyper-parameters are set for transformer and transformer is applied. At step 430, hyper-parameters for the model are set and cross validation on model and data is done. At step 432, a score is generated and a best score of all scores and model is updated. And finally at step 434, it is checked whether the list of normalizers, list of transformers and list of models have achieved a “NULL” value. If true, then the steps 426 to 432 are repeated otherwise method is stopped.
[045] According to an embodiment of the disclosure, the system 100 can also be explained with the help of following example as shown in FIG. 6A, 6B and 6C. The test scenarios were identified that comprised varying combinations of data missing in sequence and /or in random. From a regulatory perspective, a preliminary filter was first performed to evaluate the model configurations for which the standard deviation of the predicted results is greater than standard deviation of the original data. Error estimates were subsequently evaluated on the filtered result set to statistically ascertain the accuracy of the predicted results.
[046] The test results reveal that machine learning models yield consistently more accurate results as evidenced by lower error metrics like root mean square error (RMSE), mean average difference (MAD) and standard error (SE) estimates compared to the traditional technique of data roll over across all the model configurations. The predicted MAD / RMSE / SE is the error estimate of the result obtained via Machine Learning techniques. While the price MAD / RMSE / SE and return MAD / RMSE / SE is the error estimate for the result using traditional methods.
[047] The written description describes the subject matter herein to enable any person skilled in the art to make and use the embodiments. The scope of the subject matter embodiments is defined by the claims and may include other modifications that occur to those skilled in the art. Such other modifications are intended to be within the scope of the claims if they have similar elements that do not differ from the literal language of the claims or if they include equivalent elements with insubstantial differences from the literal language of the claims.
[048] The embodiments of the present disclosure herein solve the problems of in accurate and incomplete data prediction of the missing data. The disclosure provides a method and system for prediction of missing data using machine learning techniques.
[049] It is to be understood that the scope of the protection is extended to such a program and in addition to a computer-readable means having a message therein; such computer-readable storage means contain program-code means for implementation of one or more steps of the method, when the program runs on a server or mobile device or any suitable programmable device. The hardware device can be any kind of device which can be programmed including e.g. any kind of computers like a server or a personal computer, or the like, or any combination thereof. The device may also include means which could be e.g. hardware means like e.g. an application-specific integrated circuit (ASIC), a field-programmable gate array (FPGA), or a combination of hardware and software means, e.g. an ASIC and an FPGA, or at least one microprocessor and at least one memory with software modules located therein. Thus, the means can include both hardware means and software means. The method embodiments described herein could be implemented in hardware and software. The device may also include software means. Alternatively, the embodiments may be implemented on different hardware devices, e.g. using a plurality of CPUs.
[050] The embodiments herein can comprise hardware and software elements. The embodiments that are implemented in software include but are not limited to, firmware, resident software, microcode, etc. The functions performed by various modules described herein may be implemented in other modules or combinations of other modules. For the purposes of this description, a computer-usable or computer-readable medium can be any apparatus that can comprise, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.
[051] The illustrated steps are set out to explain the exemplary embodiments shown, and it should be anticipated that ongoing technological development will change the manner in which particular functions are performed. These examples are presented herein for purposes of illustration, and not limitation. Further, the boundaries of the functional building blocks have been arbitrarily defined herein for the convenience of the description. Alternative boundaries can be defined so long as the specified functions and relationships thereof are appropriately performed. Alternatives (including equivalents, extensions, variations, deviations, etc., of those described herein) will be apparent to persons skilled in the relevant art(s) based on the teachings contained herein. Such alternatives fall within the scope and spirit of the disclosed embodiments. Also, the words “comprising,” “having,” “containing,” and “including,” and other similar forms are intended to be equivalent in meaning and be open ended in that an item or items following any one of these words is not meant to be an exhaustive listing of such item or items, or meant to be limited to only the listed item or items. It must also be noted that as used herein and in the appended claims, the singular forms “a,” “an,” and “the” include plural references unless the context clearly dictates otherwise.
[052] Furthermore, one or more computer-readable storage media may be utilized in implementing embodiments consistent with the present disclosure. A computer-readable storage medium refers to any type of physical memory on which information or data readable by a processor may be stored. Thus, a computer-readable storage medium may store instructions for execution by one or more processors, including instructions for causing the processor(s) to perform steps or stages consistent with the embodiments described herein. The term “computer-readable medium” should be understood to include tangible items and exclude carrier waves and transient signals, i.e., be non-transitory. Examples include random access memory (RAM), read-only memory (ROM), volatile memory, non-volatile memory, hard drives, CD ROMs, DVDs, flash drives, disks, and any other known physical storage media.
[053] It is intended that the disclosure and examples be considered as exemplary only, with a true scope of disclosed embodiments being indicated by the following claims.

Documents

Application Documents

#	Name	Date
1	202021003988-STATEMENT OF UNDERTAKING (FORM 3) [29-01-2020(online)].pdf	2020-01-29
2	202021003988-REQUEST FOR EXAMINATION (FORM-18) [29-01-2020(online)].pdf	2020-01-29
3	202021003988-FORM 18 [29-01-2020(online)].pdf	2020-01-29
4	202021003988-FORM 1 [29-01-2020(online)].pdf	2020-01-29
5	202021003988-FIGURE OF ABSTRACT [29-01-2020(online)].jpg	2020-01-29
6	202021003988-DRAWINGS [29-01-2020(online)].pdf	2020-01-29
7	202021003988-COMPLETE SPECIFICATION [29-01-2020(online)].pdf	2020-01-29
8	Abstract1.jpg	2020-02-04
9	202021003988-Proof of Right [19-02-2020(online)].pdf	2020-02-19
10	202021003988-FORM-26 [12-11-2020(online)].pdf	2020-11-12
11	202021003988-FER.pdf	2021-10-28
12	202021003988-OTHERS [19-01-2022(online)].pdf	2022-01-19
13	202021003988-FER_SER_REPLY [19-01-2022(online)].pdf	2022-01-19
14	202021003988-COMPLETE SPECIFICATION [19-01-2022(online)].pdf	2022-01-19
15	202021003988-CLAIMS [19-01-2022(online)].pdf	2022-01-19
16	202021003988-PatentCertificate26-09-2024.pdf	2024-09-26
17	202021003988-IntimationOfGrant26-09-2024.pdf	2024-09-26

Search Strategy

1	SearchHistoryE_26-10-2021.pdf