Methods And Systems For Determining Missing Data In Imbalanced

< Back

Methods And Systems For Determining Missing Data In Imbalanced Datasets Using Automated Predicting Functions

Abstract: This disclosure relates to determining missing data in imbalanced datasets using automated predicting functions. State-of-the-art methods are time bound which means they are dependent on time series data, involve high computation cost, do not factor in co-relations between features and may also end up introducing bias in data. The present disclosure provides a high performance, faster and efficient method for determining missing data in a plurality of imbalanced datasets by first determining dependencies across a plurality of features in the plurality of imbalanced datasets and then automatically generating a set of predictive functions using machine learning models. The set of automated predicted functions provide good coverage to the plurality of imbalanced datasets and are applied to a plurality of generative models. Output of the method of present disclosure is a set of probabilistic labels that are assigned to one or more missing values in the plurality of imbalanced datasets.

Get Free WhatsApp Updates!
Notices, Deadlines & Correspondence

Patent Information

Application #

Filing Date

09 December 2019

Publication Number

24/2021

Publication Type

INA

Invention Field

COMPUTER SCIENCE

Status

ip@legasis.in

Parent Application

Patent Number

Legal Status

Grant Date

2024-05-16

Renewal Date

Applicants

Tata Consultancy Services Limited

Nirmal Building, 9th Floor, Nariman Point, Mumbai - 400021, Maharashtra, India

Inventors

1. KUNDE, Shruti

Tata Consultancy Services Limited, Olympus A, Hiranandani Estate, Patlipada, off Ghodbunder Road, Thane (W) - 400607, Maharashtra, India

2. MISHRA, Mayank

Tata Consultancy Services Limited, Olympus A, Hiranandani Estate, Patlipada, off Ghodbunder Road, Thane (W) - 400607, Maharashtra, India

3. NAMBIAR, Manoj

Tata Consultancy Services Limited, Olympus A, Hiranandani Estate, Patlipada, off Ghodbunder Road, Thane (W) - 400607, Maharashtra, India

4. PANDIT, Amey

Tata Consultancy Services Limited, Olympus A, Hiranandani Estate, Patlipada, off Ghodbunder Road, Thane (W) - 400607, Maharashtra, India

5. SHROFF, Gautam

Tata Consultancy Services Limited, Block C, Kings Canyon, ASF Insignia, Gurgaon - Faridabad Road, Gawal Pahari, Gurgaon - 122003, Haryana, India

6. GUPTA, Shashank

315B, Shanti Nagar, Durgapura, Jaipur - 302018, Rajasthan, India

Specification

DESC:FORM 2

THE PATENTS ACT, 1970
(39 of 1970)
&
THE PATENT RULES, 2003

COMPLETE SPECIFICATION
(See Section 10 and Rule 13)

Title of invention:
METHODS AND SYSTEMS FOR DETERMINING MISSING DATA IN IMBALANCED DATASETS USING AUTOMATED PREDICTING FUNCTIONS

Applicant:
Tata Consultancy Services Limited
A company Incorporated in India under the Companies Act, 1956
Having address:
Nirmal Building, 9th Floor,
Nariman Point, Mumbai 400021,
Maharashtra, India

The following specification particularly describes the invention and the manner in which it is to be performed.
CROSS-REFERENCE TO RELATED APPLICATIONS AND PRIORITY
[001] The present application claims priority from Indian provisional patent application no. 201921050826, filed on December 09, 2019. The entire contents of the aforementioned application are incorporated herein by reference.

TECHNICAL FIELD
The disclosure herein generally relates to determining missing data in imbalanced datasets, and, more particularly, to methods and systems for determining missing data in imbalanced datasets using automated predicting functions.

BACKGROUND
With the advent in technology, machine learning, classification and data science-based models have been widely used in different applications of multiple domains. These models are built on many real-world datasets that contain missing data such as missing feature values for varied reasons. These missing feature values are usually represented in form of any place holders such as blanks or NaNs. It is well known that completeness of a dataset is one factor which can significantly affect quality of output of a machine learning dataset. When these missing feature values are not handled appropriately during a model building, then the model is not considered as reliable, and any decision based on the model may be prone to error and may further incur loss for an organization.
There exist multiple techniques for handling imbalanced datasets that contain missing data. Few conventional methods perform deletion of rows which have missing values. However, deletion of rows may risk losing data points that may have valuable information. Further, some conventional methods impute values from existing part of the imbalanced datasets. However, imputation of missing data may not efficiently handle very large and distributed data sources that are now encountered in practice. Other conventional methods ignore the missing values and let machine learning algorithms handle this problem of missing values. It can be inferred from above conventional techniques that different machine learning algorithms have different ways of handling the missing values. Few machine learning algorithms provide basic solutions by determining mean or median of all values in a column, replacing with most frequent value or zeros, and may end up introducing bias in the data.
SUMMARY
[005] Embodiments of the present disclosure present technological improvements as solutions to one or more of the above-mentioned technical problems recognized by the inventors in conventional systems. For example, in one aspect, there is provided a processor implemented method, comprising: receiving, via one or more hardware processors, a plurality of imbalanced datasets pertaining to one or more domains; performing an exploratory data analysis (EDA) mechanism on the plurality of imbalanced datasets to determine a plurality of features corresponding to the plurality of imbalanced datasets; determining, based on the EDA mechanism, a correlation matrix to identify one or more correlations and one or more dependencies among the plurality of features; obtaining, based on the identified one or more correlations and the identified one or more dependencies, a subset of features from the plurality of features required to predict one or more missing values pertaining to a first test feature. In an embodiment, the subset of features from the plurality of features are determined based on correlation values obtained from the correlation matrix. The method further comprising automatically generating, using at least one machine learning model from a plurality of machine learning models comprised in the memory, a set of predictive functions. In an embodiment, the at least one machine learning model from the plurality of machine learning models to generate the set of predictive functions is selected based on one or more model parameters. In an embodiment, the one or more model parameters include accuracy, precision, recall, and F1 score. In an embodiment, the method further comprising applying the set of predictive functions on the plurality of imbalanced datasets to generate a sparse matrix for each of a plurality of data points of the plurality of imbalanced datasets, wherein the generated sparse matrix is further split to generate a plurality of subsets of the sparse matrix; obtaining a set of probabilistic labels for each data point of each of the plurality of subsets of the sparse matrix by inputting each of the plurality of subsets of the sparse matrix to at least one of a plurality of generative models; concatenating, each set of the probabilistic labels obtained for each data point of each subset of the sparse matrix to obtain a final set of probabilistic labels for each of the plurality of imbalanced datasets; and incrementally predicting, based on the final set of probabilistic labels, one or more missing values of one or more test features of the plurality of imbalanced datasets. In an embodiment, the first test feature is distinct from the one or more test features. In an embodiment, the step of generating the plurality of subsets of the sparse matrix by splitting the sparse matrix sparse matrix for each of a plurality of data points of the plurality of imbalanced datasets enables parallel and faster processing and helps in achieving high performance.
[006] In another aspect, there is provided a system comprising: one or more data storage devices (102) operatively coupled to one or more hardware processors (104) and configured to store instructions configured for execution via the one or more hardware processors to: receive, via one or more hardware processors, a plurality of imbalanced datasets pertaining to one or more domains; perform an exploratory data analysis (EDA) mechanism on the plurality of imbalanced datasets to determine a plurality of features corresponding to the plurality of imbalanced datasets; determine, based on the EDA mechanism, a correlation matrix to identify one or more correlations and one or more dependencies among the plurality of features; obtain, based on the identified one or more correlations and the identified one or more dependencies, a subset of features from the plurality of features required to predict one or more missing values pertaining to a first test feature. In an embodiment, the subset of features from the plurality of features are determined based on correlation values obtained from the correlation matrix. The one or more hardware processors are further configured by the instructions to automatically generate, using at least one machine learning model from a plurality of machine learning models comprised in the memory, a set of predictive functions. In an embodiment, the at least one machine learning model from the plurality of machine learning models to generate the set of predictive functions is selected based on one or more model parameters. In an embodiment, the one or more model parameters include accuracy, precision, recall, and F1 score. In an embodiment, the one or more hardware processors are further configured by the instructions to apply the set of predictive functions on the plurality of imbalanced datasets to generate a sparse matrix for each of a plurality of data points of the plurality of imbalanced datasets, wherein the generated sparse matrix is further split to generate a plurality of subsets of the sparse matrix; obtain a set of probabilistic labels for each data point of each of the plurality of subsets of the sparse matrix by inputting each of the plurality of subsets of the sparse matrix to at least one of a plurality of generative models; concatenate, each set of the probabilistic labels obtained for each data point of each subset of the sparse matrix to obtain a final set of probabilistic labels for each of the plurality of imbalanced datasets; and incrementally predict, based on the final set of probabilistic labels, one or more missing values of one or more test features of the plurality of imbalanced datasets. In an embodiment, the first test feature is distinct from the one or more test features. In an embodiment, the step of generating the plurality of subsets of the sparse matrix by splitting the sparse matrix sparse matrix for each of a plurality of data points of the plurality of imbalanced datasets enables parallel and faster processing and helps in achieving high performance.
In yet another aspect, there are provided one or more non-transitory machine readable information storage mediums comprising one or more instructions which when executed by one or more hardware processors cause: receiving, via one or more hardware processors, a plurality of imbalanced datasets pertaining to one or more domains; performing an exploratory data analysis (EDA) mechanism on the plurality of imbalanced datasets to determine a plurality of features corresponding to the plurality of imbalanced datasets; determining, based on the EDA mechanism, a correlation matrix to identify one or more correlations and one or more dependencies among the plurality of features; obtaining, based on the identified one or more correlations and the identified one or more dependencies, a subset of features from the plurality of features required to predict one or more missing values pertaining to a first test feature. In an embodiment, the subset of features from the plurality of features are determined based on correlation values obtained from the correlation matrix. The one or more instructions when executed by one or more hardware processors further cause automatically generating, using at least one machine learning model from a plurality of machine learning models comprised in the memory, a set of predictive functions. In an embodiment, the at least one machine learning model from the plurality of machine learning models to generate the set of predictive functions is selected based on one or more model parameters. In an embodiment, the one or more model parameters include accuracy, precision, recall, and F1 score. In an embodiment, the one or more instructions when executed by one or more hardware processors further cause applying the set of predictive functions on the plurality of imbalanced datasets to generate a sparse matrix for each of a plurality of data points of the plurality of imbalanced datasets, wherein the generated sparse matrix is further split to generate a plurality of subsets of the sparse matrix; obtaining a set of probabilistic labels for each data point of each of the plurality of subsets of the sparse matrix by inputting each of the plurality of subsets of the sparse matrix to at least one of a plurality of generative models; concatenating, each set of the probabilistic labels obtained for each data point of each subset of the sparse matrix to obtain a final set of probabilistic labels for each of the plurality of imbalanced datasets; and incrementally predicting, based on the final set of probabilistic labels, one or more missing values of one or more test features of the plurality of imbalanced datasets. In an embodiment, the first test feature is distinct from the one or more test features. In an embodiment, the step of generating the plurality of subsets of the sparse matrix by splitting the sparse matrix sparse matrix for each of a plurality of data points of the plurality of imbalanced datasets enables parallel and faster processing and helps in achieving high performance.

BRIEF DESCRIPTION OF THE DRAWINGS
The accompanying drawings, which are incorporated in and constitute a part of this disclosure, illustrate exemplary embodiments and, together with the description, serve to explain the disclosed principles:
FIG.1 illustrates an exemplary block diagram of a system for determining missing data in imbalanced datasets using automated predicting functions, in accordance with some embodiments of the present disclosure.
FIG.2 illustrates an exemplary flow diagram of a processor implemented method for determining missing data in imbalanced datasets using automated predicting functions, in accordance with some embodiments of the present disclosure.
FIG.3 shows an example of a correlation matrix and corresponding code for determining the correlation matrix to identify one or more correlations and one or more dependencies among the plurality of features, in accordance with some embodiments of the present disclosure.
FIGS.4A and 4B illustrate a flow chart of an example describing method for determining missing data in imbalanced datasets using automated predicting functions, in accordance with some embodiments of the present disclosure.
FIG. 5 illustrates an example of n-ary classification for generating the set of probabilistic labels by showing top four classes of ‘OS_version’ feature of an example imbalanced dataset by depicting frequency of each class in sequence, in accordance with some embodiments of the present disclosure.
FIG.6. shows an example of a set of probabilistic labels generated in case of both, binary and n-ary classification, in accordance with some embodiments of the present disclosure.
FIG.7A through 7C illustrate graphs depicting experimental results for determining missing data in imbalanced datasets using automated predicting functions, in accordance with some embodiments of the present disclosure.

DETAILED DESCRIPTION OF EMBODIMENTS
Exemplary embodiments are described with reference to the accompanying drawings. In the figures, the left-most digit(s) of a reference number identifies the figure in which the reference number first appears. Wherever convenient, the same reference numbers are used throughout the drawings to refer to the same or like parts. While examples and features of disclosed principles are described herein, modifications, adaptations, and other implementations are possible without departing from the scope of the disclosed embodiments. It is intended that the following detailed description be considered as exemplary only, with the true scope being indicated by the following claims.
The embodiments herein provide methods and systems for determining missing data in imbalanced datasets using automated predicting functions. The typical interpretation of results obtained from conventional missing data prediction methods has been modified for providing faster and efficient missing data prediction in imbalanced datasets. State-of-the-art methods are time bound which means they are dependent on time series data, involves high computation cost, do not factor in co-relations between features and may also end up introducing bias in data. For example, process of generating data based on time stamp sequences in a dataset, may involve introducing new rows following the time stamp sequences, but do not capture any dependencies across other features. This may lead to bias in the overall dataset.
The present disclosure has addressed these issues by determining missing data in imbalanced datasets using automated predicting functions. Although the technical problem was realized in many state of the art methods, it may be understood by a person skilled in the art, that the present disclosure provides a high performance method for determining missing data in the imbalanced datasets by first determining dependencies across a plurality of features in the imbalanced datasets and then automatically generating a set of predictive functions using machine learning models. The set of automated predicted functions provide good coverage to the imbalanced datasets. Output of the method of present disclosure is a set of probabilistic labels that are assigned to the missing values in the imbalanced datasets. The method of present disclosure is applicable to both categorical and numerical data.
Referring now to the drawings, and more particularly to FIGS. 1 through 7C, where similar reference characters denote corresponding features consistently throughout the figures, there are shown preferred embodiments and these embodiments are described in the context of the following exemplary system and/or method.
FIG.1 illustrates an exemplary block diagram of a system 100 for determining missing data in imbalanced datasets using automated predicting functions, in accordance with some embodiments of the present disclosure. In an embodiment, the system 100 includes one or more processors 104, communication interface device(s) or input/output (I/O) interface(s) 106, and one or more data storage devices or memory 102 operatively coupled to the one or more processors 104. The one or more processors 104 that are hardware processors can be implemented as one or more microprocessors, microcomputers, microcontrollers, digital signal processors, central processing units, state machines, graphics controllers, logic circuitries, and/or any devices that manipulate signals based on operational instructions. Among other capabilities, the processor(s) are configured to fetch and execute computer-readable instructions stored in the memory. In the context of the present disclosure, the expressions ‘processors’ and ‘hardware processors’ may be used interchangeably. In an embodiment, the system 100 can be implemented in a variety of computing systems, such as laptop computers, notebooks, hand-held devices, workstations, mainframe computers, servers, a network cloud and the like.
I/O interface(s) 106 can include a variety of software and hardware interfaces, for example, a web interface, a graphical user interface, and the like and can facilitate multiple communications within a wide variety of networks N/W and protocol types, including wired networks, for example, LAN, cable, etc., and wireless networks, such as WLAN, cellular, or satellite. In an embodiment, the I/O interface(s) can include one or more ports for connecting a number of devices to one another or to another server.
The memory 102 may include any computer-readable medium known in the art including, for example, volatile memory, such as static random access memory (SRAM) and dynamic random access memory (DRAM), and/or non-volatile memory, such as read only memory (ROM), erasable programmable ROM, flash memories, hard disks, optical disks, and magnetic tapes. In an embodiment, one or more modules (not shown) of the system 100 can be stored in the memory 102. The one or more modules (not shown) of the system 100 stored in the memory 102 may include routines, programs, objects, components, data structures, and so on, which perform particular tasks or implement particular (abstract) data types. In an embodiment, the memory 102 includes a data repository 108 for storing data processed, received, and generated as output(s) by the system 100.
The data repository 108, amongst other things, includes a system database and other data. In an embodiment, the data repository 108 may be external (not shown) to the system 100 and accessed through the I/O interfaces 106. The memory 102 may further comprise information pertaining to input(s)/output(s) of each step performed by the processor 104 of the system 100 and methods of the present disclosure. In an embodiment, the system database stores information being processed at each step of the proposed methodology. The other data may include, data generated as a result of the execution of the one or more modules (not shown) of the system 100 stored in the memory 102. The generated data may be further learnt to provide improved learning in the next iterations to output desired results with improved accuracy.
FIG. 2 illustrate an exemplary flow diagram of a processor implemented method for determining missing data in imbalanced datasets using automated predicting functions using the system of FIG. 1, in accordance with an embodiment of the present disclosure. In an embodiment, the system 100 includes one or more data storage devices or memory 102 operatively coupled to the one or more processors 104 and is configured to store instructions configured for execution of steps of the method 200 by the one or more processors 104. The steps of the method 200 will now be explained in detail with reference to the components of the system 100 of FIG.1. Although process steps, method steps, techniques or the like may be described in a sequential order, such processes, methods and techniques may be configured to work in alternate orders. In other words, any sequence or order of steps that may be described does not necessarily indicate a requirement that the steps be performed in that order. The steps of processes described herein may be performed in any order practical. Further, some steps may be performed simultaneously.
In accordance with an embodiment of the present disclosure, the one or more processors 104 are configured to receive, at step 202, a plurality of imbalanced datasets pertaining to one or more domains. In an embodiment, the plurality of imbalanced datasets can be received from publicly available sources such as internet. Further, the one or more domains may include but not limited to retail, finance, aerospace, healthcare, insurance, manufacturing, telecom, embedded system, hospitality. In an embodiment, a non-limiting example of an imbalanced dataset in retail domain is considered. The example imbalanced dataset is represented by Table 1 provided below as:
App_code Model Network OS_version Dow Hour_of_day Is_conversion
27909 SM-G925L WIFI 6.0.1 1 13 -1
3268 SM-920S WIFI 6.0.1 2 19 -1
3268 SM-920S WIFI 6.0.1 2 20 -1
16827 LG-F200K WIFI 4.1.2 2 22 -1
22444 SM-G720N WIFI 4.4.4 4 18 -1
3268 SM-920S WIFI 6.0.1 4 18 -1
27899 SM-T255S WIFI 4.3 5 19 -1
3268 SM-920S WIFI 6.0.1 5 19 -1
16827 LG-F200K WIFI 4.1.2 5 23 -1
23715 LG-F460L 4G 6 6 21 -1
16827 LG-F200K WIFI 4.1.2 6 22 -1
14651 SM-920S 4G 6.0.1 6 23 -1
16827 LG-F200K WIFI 4.1.2 7 22 -1
14651 SM-920S WIFI 6.0.1 8 13 -1
Table 1
Further at step 204, the one or more hardware processors 104 are configured to perform, an exploratory data analysis (EDA) mechanism on the plurality of imbalanced datasets to determine a plurality of features corresponding to the plurality of imbalanced datasets. In an embodiment, the EDA mechanism analyzes datasets by summarizing their main characteristics with visualizations. Also, the EDA mechanism helps in finding details such as number of columns and other metadata of the datasets which further helps to gauge size and other properties such as range of values in the columns of the dataset. In an embodiment of the present disclosure, the columns of the plurality of imbalanced datasets indicate the plurality of features. As depicted in Table 1, App_code, Model, Network, OS_version, Dow, Hour_of_day, and Is_conversion represent a plurality of features comprised in the example imbalanced dataset shown in Table 1. Here, App_code feature depicts that every application used by a user is identified by a unique app code, model feature depicts the model number of device of the user, OS_version feature depicts version of operating system on the user’s device, Dow feature depicts day of a week on which a transaction is done, Hour_of_day feature depicts hour or time of the day when the transaction is done, and Is_conversion feature depicts determining whether an order is placed for a product by the user or not. The column for Is_conversion feature comprises values either ‘1’ or ‘-1’, wherein ‘1’ denotes that an order was placed and ‘-1’ denotes that the order was not placed.
In accordance with an embodiment of the present disclosure, at step 206, the one or more processors 104 are configured to determine, based on the EDA mechanism, a correlation matrix to identify one or more correlations and one or more dependencies among the plurality of features. FIG.3 shows an example of a correlation matrix and corresponding code for determining the correlation matrix to identify one or more correlations and one or more dependencies among the plurality of features, in accordance with some embodiments of the present disclosure. In an embodiment, the correlation matrix gives information regarding those features which are most relevant. The closer the value is to 1 (in the co-relation matrix), the higher is the co-relation between the features being compared.
In accordance with an embodiment of the present disclosure, at step 208, the one or more processors 104 are configured to obtain, based on the identified one or more correlations and the identified one or more dependencies, a subset of features from the plurality of features required to predict one or more missing values pertaining to a first test feature. In an embodiment, the subset of features from the plurality of features are determined based on correlation values obtained from the correlation matrix. In an embodiment, the Is_conversion feature comprised in the example imbalanced dataset is assumed as a first test feature and correlation and dependencies are identified between Is_conversion feature and other features comprised in the example imbalanced dataset. In an embodiment, the correlation values obtained from the correlation matrix are indicative of a positive correlation and negative correlation, wherein the positive correlation helps in identifying the features that are most relevant to predict the feature which is being labeled (Is_conversion, in this case). It can be seen in FIG.3, that features including ‘network’, ‘dow’ and ‘hour_of_day’ show a positive correlation with Is_conversion feature and are thus comprised in the obtained subset of features. This implies that these features are best suited for predicting the Is_conversion feature. In another embodiment, it is observed that in imbalanced datasets, some features are densely populated, and some features are sparsely populated. Thus, the correlation matrix is used to determine that subset of features from the plurality of features which can be used to predict unknown values. The correlation matrix is created based on the EDA, which indicates dependencies and correlations between columns of the plurality of imbalanced datasets. In an embodiment, another non-limiting example of imbalanced dataset is considered which comprises features such as ‘Age’, ‘Title’, ‘City’, and ‘Profession’. Further, if ‘Age’ feature is assumed to be the first test feature and values of the ‘Age’ column are to be predicted. Then, the correlation matrix helps in indicating if ‘Age’ feature is more dependent on ‘Title’ and ‘Profession’ and less dependent on ‘City’ based on available data.
In accordance with an embodiment of the present disclosure, the one or more processors 104 are configured to automatically generate a set of predictive functions, at step 210, using at least one machine learning model from a plurality of machine learning models comprised in the memory. It is to be understood by person having ordinary skill in the art and person skilled in the art that automated predictive functions are never generated using machine learning models. In an embodiment, the plurality of machine learning models may include but not limited to a bagging classifier (e.g., random forest), a boosting classifier (e.g., Ada boost), perceptron, decision tree, stochastic gradient descent (SGD), and the like. In an embodiment, the at least one machine learning model from the plurality of machine learning models is applied on the plurality of features comprised in the plurality of imbalanced datasets to generate the set of predictive functions. FIG.4 illustrates a flow chart of an example describing method for determining missing data in imbalanced datasets using automated predicting functions, in accordance with some embodiments of the present disclosure. As can be seen from FIG. 3, a set of predictive functions namely PF1, PF2, PF3, PF4, and PF5 are automatically generated. In an embodiment, the at least one machine learning model from the plurality of machine learning models to generate the set of predictive functions is selected based on one or more model parameters. Here, the one or more model parameters include accuracy, precision, recall, and F1 score. Table 2 provides values of the one or more model parameters of the plurality of machine learning models.
Classifier Model parameter Value
Decision tree Accuracy 0.417159210758
Recall 0.417159210758
Precision 0.417159210758
F1 score 0.417159210758
Bagging Classifier Accuracy 0.418002394973
Recall 0.418002394973
Precision 0.418002394973
F1 score 0.418002394973
Perceptron Accuracy 0.117159132072
Recall 0.117159132072
Precision 0.117159132072
F1 score 0.117159132072
SGD Accuracy 0.217129410761
Recall 0.217129410761
Precision 0.217129410761
F1 score 0.217129410761
Boosting Classifier Accuracy 0.317156210734
Recall 0.257156210734
Precision 0.257156210734
F1 score 0.257156210734
Table 2
For example, as depicted in Table 2, value of the one or more model parameters for decision tree and bagging classifier is more than perceptron, SGD, and the boosting classifier. In such a scenario, only decision tree and bagging classifier are selected for automatically generating labelling functions. A pseudo code is provided below as an illustrative example to show how the Bagging Classifier and the Boosting classifier are applied on the plurality of features comprised in the example imbalanced dataset shown in Table 1 to generate automated predictive functions:
elif modelName == ‘bagging_random_forest’:
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import RandomForestClassifier
model = BaggingClassifier (RandomForestClassifier ( ) )

elif modelName == ‘adaboost’:
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import RandomForestClassifier
model = AdaBoostClassifier ( )

start = time. time ( )
model. fit (train, train_y)
stop = time. time ( )

model = cp.load (open (model_path + modelName + model_extn + “.cp”,
df = pd. DataFrame ( )
df [modelName] = model.predict (dev)

In accordance with an embodiment of the present disclosure, at step 212, the one or more processors 104 are configured to apply the set of predictive functions on the plurality of imbalanced datasets to generate a sparse matrix for each of a plurality of data points of the plurality of imbalanced datasets. An example of generated sparse matrix is shown in FIG.4, where 1 and -1 represent a positive and a negative label assigned by a predictive function to a data point. 0 represents that corresponding predictive function is not able to clearly determine a label for the corresponding data point. In an embodiment, each column of the generated sparse matrix represents each predictive function and each row represents each data point of each of the plurality of imbalanced datasets. Here, the plurality of data points are represented by C_1, C_2,…C_n. In an embodiment, the sparse matrix indicate output of each predictive function on each data point of the plurality of imbalanced datasets. In an embodiment, the generated sparse matrix is further split to generate a plurality of subsets of sparse matrix. As can be seen in FIG. 4, the generated sparse matrix is split into three parts to generate three subsets of the sparse matrix, wherein each subset of the sparse matrix comprising three rows.
In accordance with an embodiment of the present disclosure, at step 214, the one or more processors 104 are configured to obtain a set of probabilistic labels for each data point of each of the plurality of subsets of sparse matrix by inputting each of the plurality of subsets of sparse matrix to at least one of a plurality of generative models. In an embodiment, for a given observation variable X and a target variable Y, a generative model refers to a statistical model of joint probability distribution on X × Y,P(X,Y). In an embodiment, a probabilistic label is also referred as marginal probability, wherein the marginal probability refers to probability with which a data point belongs to a particular class. In an embodiment, the generative models could be binary classifiers and n-ary classifiers. In case of binary classifiers, possible value of the probabilistic labels includes only ‘0’ and ‘1’. For example, in case of ‘Is_conversion’ feature of the example imbalanced dataset shown in Table 1, value ‘0’ indicates that an order is not placed by user and value ‘1’ indicates that the order is placed by the user. However, in case of n-ary classifiers, a data point may belong to any of number of classes such as class 1, class 2, class 3 and so on. Further, in case of n-ary classification, all entries of a specific feature comprised in an imbalanced dataset are derived and then top n classes with maximum number of rows are determined in the imbalanced dataset. Further, these n classes are used for prediction of one or missing values of the specific feature and remaining entries in the dataset are classified as ‘others’. For example, in case of ‘OS_version’ feature of the example imbalanced dataset shown in Table 1, top four classes with maximum number of rows in the example imbalanced dataset are considered. FIG. 5 shows the top four classes of ‘OS_version’ by depicting frequency of each class in sequence, in accordance with some embodiments of the present disclosure. As can be seen in FIG.5, class 1.0 represents the os_version 6.0.1, class 2.0 represents the os_version 4.4.2 and so on. Further, Class 1.0 has 23329 entries, corresponding to highest frequency in the example imbalanced dataset. FIG.6 shows an example of the set of probabilistic labels generated in case of both, binary and nary classification, in accordance with some embodiments of the present disclosure.
In an embodiment, the step of generating the plurality of subsets of the sparse matrix by splitting the sparse matrix for each of a plurality of data points of the plurality of imbalanced datasets enables parallel and faster processing and helps in achieving high performance. For example, the sparse matrix is denoted by S and the plurality of subsets of the sparse matrix are denoted by S_1, S_2,…S_n, and the plurality of generative models are denoted by G_1, G_2,…G_n. Then, S_1 is applied to generative model G_1 to provide the set of probabilistic labels [P_1, P_2,? P?_3], wherein P_1 refers to probabilistic label assigned to data point C_1, P_2 refers to probabilistic label assigned to data point C_2, and P_3 refers to probabilistic label assigned to data point C_3 respectively. Similary, S_2 is applied to generative model G_2, S_3 is applied to generative model G_3 and so on to provide the sets of probabilistic labels [P_4, P_5,? P?_6] and [P_7, P_8,? P?_9] respectively. In an embodiment, instead of inputting the generated sparse matrix S to a generative model from a plurality of generative models, each subset of the sparse matrix is inputted and processed independently on each generative model from the plurality of generative models that are running in parallel. This helps in achieving high performance with faster processing. In an embodiment, output is received by providing minimal information to the plurality of generative models resulting in reducing processing time and less consumption of resources.
In accordance with an embodiment of the present disclosure, at step 216, the one or more processors 104 are configured to concatenate, each set of the probabilistic labels obtained for each data point of each subset of the sparse matrix to obtain a final set of probabilistic labels for each of the plurality of imbalanced datasets. For example, ‘m’ refers to number of data points of each subset of the sparse matrix, F_1,F_2,? F?_3….? F?_m refer to probabilistic labels assigned to each data point of each subset of the sparse matrix, then the final set of probabilistic labels which is also referred as concatenated output is represented as ?F_n= F?_1,F_2,? F?_3….? F?_m. In an embodiment, upon obtaining the final set of probabilistic labels, one or more thresholding techniques are used to generate specific labels for each data point of each of the plurality of imbalanced datasets. For example, in the case of a binary probabilistic labels, if threshold value of probabilistic label is kept ‘0.5’ and determined value of probabilistic label is greater than ‘0.5’, then the value of probabilistic label is considered as ‘1’, otherwise ‘-1’. The final set of probabilistic labels are assigned to one or missing values of the first test feature after thresholding.
In accordance with an embodiment of the present disclosure, at step 218, the one or more processors 104 are configured to incrementally predict, based on the final set of probabilistic labels, one or more missing values of one or more test features of the plurality of imbalanced datasets. In an embodiment, the first test feature is distinct from the one or more test features. For example, as depicted in Table 1, ‘Is_conversion’ feature was considered as a first test feature and one or missing values of ‘Is_conversion’ column were predicted. In a similar way, one or more missing values of other features such as ‘App_code’, ‘Model’, ‘Network’, ‘OS_version’, ‘Dow’, and ‘Hour_of_day’ features are predicted incrementally using the proposed method. Similarly, as explained above in another example of imbalanced dataset, ‘Age’ feature was considered as a first test feature and one or missing values of ‘Age’ columns were predicted. In a similar way, one or more missing values of other features such as ‘Title’ feature, ‘city’ feature and ‘profession’ feature are predicted using the same method.
EXPERIMENTAL OBSERVATIONS
FIG.7A through 7C illustrate graphs depicting experimental results for determining missing data in imbalanced datasets using automated predicting functions, in accordance with some embodiments of the present disclosure. As can be seen in FIGS. 7A and 7B, a performance comparison of of a generative model used in the present disclosure (Snorkel – generative model as known in the art) is provided with other known in the art generative models such as XGBoost and Neural networks. It is depicted from FIGS. 7A and 7B that log loss values shown by Snorkel are lower as compared to the Logloss values of XGBoost and Neural Networks.
FIG. 7C illustrates the step of generating the plurality of subsets of the the sparse matrix by splitting the sparse matrix for each of the plurality of data points of the plurality of imbalanced datasets which enables parallel and faster processing and helps in achieving high performance. As can be seen in FIG. 7C, datasets from domains, namely Retail, Hospitality and News are considered. In this experiment, large input files were divided into multiple chunks, wherein each chunk represents a subset of the sparse matrix that was given as input to a generative model. It is observed from FIG. 7C that time required for generating the set of probabilistic labels decreases when the large input file(s) is/are distributed/divided.
The method of the present disclosure discloses a high performance method providing faster and efficient missing data prediction in imbalanced datasets by enabling parallel processing of distributed input files with low log losses.
The written description describes the subject matter herein to enable any person skilled in the art to make and use the embodiments. The scope of the subject matter embodiments is defined by the claims and may include other modifications that occur to those skilled in the art. Such other modifications are intended to be within the scope of the claims if they have similar elements that do not differ from the literal language of the claims or if they include equivalent elements with insubstantial differences from the literal language of the claims.
It is to be understood that the scope of the protection is extended to such a program and in addition to a computer-readable means having a message therein; such computer-readable storage means contain program-code means for implementation of one or more steps of the method, when the program runs on a server or mobile device or any suitable programmable device. The hardware device can be any kind of device which can be programmed including e.g. any kind of computer like a server or a personal computer, or the like, or any combination thereof. The device may also include means which could be e.g. hardware means like e.g. an application-specific integrated circuit (ASIC), a field-programmable gate array (FPGA), or a combination of hardware and software means, e.g. an ASIC and an FPGA, or at least one microprocessor and at least one memory with software processing components located therein. Thus, the means can include both hardware means and software means. The method embodiments described herein could be implemented in hardware and software. The device may also include software means. Alternatively, the embodiments may be implemented on different hardware devices, e.g. using a plurality of CPUs.
The embodiments herein can comprise hardware and software elements. The embodiments that are implemented in software include but are not limited to, firmware, resident software, microcode, etc. The functions performed by various components described herein may be implemented in other components or combinations of other components. For the purposes of this description, a computer-usable or computer readable medium can be any apparatus that can comprise, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.
The illustrated steps are set out to explain the exemplary embodiments shown, and it should be anticipated that ongoing technological development will change the manner in which particular functions are performed. These examples are presented herein for purposes of illustration, and not limitation. Further, the boundaries of the functional building blocks have been arbitrarily defined herein for the convenience of the description. Alternative boundaries can be defined so long as the specified functions and relationships thereof are appropriately performed. Alternatives (including equivalents, extensions, variations, deviations, etc., of those described herein) will be apparent to persons skilled in the relevant art(s) based on the teachings contained herein. Such alternatives fall within the scope of the disclosed embodiments. Also, the words “comprising,” “having,” “containing,” and “including,” and other similar forms are intended to be equivalent in meaning and be open ended in that an item or items following any one of these words is not meant to be an exhaustive listing of such item or items, or meant to be limited to only the listed item or items. It must also be noted that as used herein and in the appended claims, the singular forms “a,” “an,” and “the” include plural references unless the context clearly dictates otherwise.
Furthermore, one or more computer-readable storage media may be utilized in implementing embodiments consistent with the present disclosure. A computer-readable storage medium refers to any type of physical memory on which information or data readable by a processor may be stored. Thus, a computer-readable storage medium may store instructions for execution by one or more processors, including instructions for causing the processor(s) to perform steps or stages consistent with the embodiments described herein. The term “computer-readable medium” should be understood to include tangible items and exclude carrier waves and transient signals, i.e., be non-transitory. Examples include random access memory (RAM), read-only memory (ROM), volatile memory, nonvolatile memory, hard drives, CD ROMs, DVDs, flash drives, disks, and any other known physical storage media.
It is intended that the disclosure and examples be considered as exemplary only, with a true scope of disclosed embodiments being indicated by the following claims.
,CLAIMS:

1. A processor implemented method, comprising:
receiving (202), via one or more hardware processors, a plurality of imbalanced datasets pertaining to one or more domains;
performing (204) an exploratory data analysis (EDA) mechanism on the plurality of imbalanced datasets to determine a plurality of features corresponding to the plurality of imbalanced datasets;
determining (206), based on the EDA mechanism, a correlation matrix to identify one or more correlations and one or more dependencies among the plurality of features;
obtaining (208), based on the identified one or more correlations and the identified one or more dependencies, a subset of features from the plurality of features required to predict one or more missing values pertaining to a first test feature;
automatically generating (210), using at least one machine learning model from a plurality of machine learning models comprised in the memory, a set of predictive functions;
applying (212) the set of predictive functions on the plurality of imbalanced datasets to generate a sparse matrix for each of a plurality of data points of the plurality of imbalanced datasets, wherein the generated sparse matrix is further split to generate a plurality of subsets of the sparse matrix;
obtaining (214) a set of probabilistic labels for each data point of each of the plurality of subsets of the sparse matrix by inputting each of the plurality of subsets of the sparse matrix to at least one of a plurality of generative models;
concatenating (216), each set of the probabilistic labels obtained for each data point of each subset of the sparse matrix to obtain a final set of probabilistic labels for each of the plurality of imbalanced datasets; and
incrementally predicting (218), based on the final set of probabilistic labels, one or more missing values of one or more test features of the plurality of imbalanced datasets.

2. The processor implemented method as claimed in claim 1, wherein the subset of features from the plurality of features are determined based on correlation values obtained from the correlation matrix.

3. The processor implemented method as claimed in claim 1, wherein the at least one machine learning model from the plurality of machine learning models to generate the set of predictive functions is selected based on one or more model parameters.

4. The processor implemented method as claimed in claim 3, wherein the one or more model parameters include accuracy, precision, recall, and F1 score.

5. The processor implemented method as claimed in claim 1, wherein the first test feature is distinct from the one or more test features.

6. The processor implemented method as claimed in claim 1, wherein the step of generating the plurality of subsets of the sparse matrix by splitting the sparse matrix sparse matrix for each of a plurality of data points of the plurality of imbalanced datasets enables parallel and faster processing and helps in achieving high performance.

7. A system (100), comprising:
one or more data storage devices (102) operatively coupled to one or more hardware processors (104) and configured to store instructions configured for execution via the one or more hardware processors to:
receive, a plurality of imbalanced datasets pertaining to one or more domains;
perform an exploratory data analysis (EDA) mechanism on the plurality of imbalanced datasets to determine a plurality of features corresponding to the plurality of imbalanced datasets;
determine, based on the EDA mechanism, a correlation matrix to identify one or more correlations and one or more dependencies among the plurality of features;
obtain, based on the identified one or more correlations and the identified one or more dependencies, a subset of features from the plurality of features required to predict one or more missing values pertaining to a first test feature;
automatically generate, using at least one machine learning model from a plurality of machine learning models comprised in the memory, a set of predictive functions;
apply the set of predictive functions on the plurality of imbalanced datasets to generate a sparse matrix for each of a plurality of data points of the plurality of imbalanced datasets, wherein the generated sparse matrix is further split to generate a plurality of subsets of the sparse matrix;
obtain a set of probabilistic labels for each data point of each of the plurality of subsets of the sparse matrix by inputting each of the plurality of subsets of the sparse matrix to at least one of a plurality of generative models;
concatenate, each set of the probabilistic labels obtained for each data point of each subset of the sparse matrix to obtain a final set of probabilistic labels for each of the plurality of imbalanced datasets; and
incrementally predict, based on the final set of probabilistic labels, one or more missing values of one or more test features of the plurality of imbalanced datasets.

8. The system as claimed in claim 7, wherein the subset of features from the plurality of features are determined based on correlation values obtained from the correlation matrix.

9. The system as claimed in claim 7, wherein the at least one machine learning model from the plurality of machine learning models to generate the set of predictive functions is selected based on one or more model parameters.

10. The system as claimed in claim 9, wherein the one or more model parameters include accuracy, precision, recall, and F1 score.

11. The system as claimed in claim 7, wherein the first test feature is distinct from the one or more test features.

12. The system as claimed in claim 7, wherein the step of generating the plurality of subsets of the sparse matrix by splitting the sparse matrix for each of a plurality of data points of the plurality of imbalanced datasets enables parallel and faster processing and helps in achieving high performance.

Documents

Application Documents

#	Name	Date
1	201921050826-STATEMENT OF UNDERTAKING (FORM 3) [09-12-2019(online)].pdf	2019-12-09
2	201921050826-PROVISIONAL SPECIFICATION [09-12-2019(online)].pdf	2019-12-09
3	201921050826-FORM 1 [09-12-2019(online)].pdf	2019-12-09
4	201921050826-DRAWINGS [09-12-2019(online)].pdf	2019-12-09
5	201921050826-FORM-26 [07-02-2020(online)].pdf	2020-02-07
6	201921050826-Proof of Right [19-02-2020(online)].pdf	2020-02-19
7	201921050826-FORM 3 [26-03-2020(online)].pdf	2020-03-26
8	201921050826-FORM 18 [26-03-2020(online)].pdf	2020-03-26
9	201921050826-ENDORSEMENT BY INVENTORS [26-03-2020(online)].pdf	2020-03-26
10	201921050826-DRAWING [26-03-2020(online)].pdf	2020-03-26
11	201921050826-COMPLETE SPECIFICATION [26-03-2020(online)].pdf	2020-03-26
12	Abstract1.jpg	2020-08-13
13	201921050826-FER.pdf	2021-10-19
14	201921050826-OTHERS [23-11-2021(online)].pdf	2021-11-23
15	201921050826-OTHERS [23-11-2021(online)]-1.pdf	2021-11-23
16	201921050826-FER_SER_REPLY [23-11-2021(online)].pdf	2021-11-23
17	201921050826-COMPLETE SPECIFICATION [23-11-2021(online)].pdf	2021-11-23
18	201921050826-CLAIMS [23-11-2021(online)].pdf	2021-11-23
19	201921050826-ABSTRACT [23-11-2021(online)].pdf	2021-11-23
20	201921050826-PatentCertificate16-05-2024.pdf	2024-05-16
21	201921050826-IntimationOfGrant16-05-2024.pdf	2024-05-16

Search Strategy

1	searhE_26-07-2021.pdf