Methods And Systems For Training A Machine Learning Model To Detect

< Back

Methods And Systems For Training A Machine Learning Model To Detect Noisy Labels

Abstract: Methods and server systems for training machine learning (ML) model to detect noisy labels. Method performed by server system includes accessing training dataset and first latent space from database. The training dataset includes set of labeled data points. Method includes generating plurality of second features based on subset of second labeled data points. Method includes generating second latent space based on plurality of second features. Method includes training second ML model based on performing operations. Operations include initializing the second ML based on second model parameters and determining second reconstruction loss. Then, operations include computing z-score probability based on first latent space and second latent space and generating second loss function based on second reconstruction loss and z-score probability. Then, operations include computing second latent space loss based on second loss function and optimizing one or more second model parameters based on back-propagating the second latent space loss.

Get Free WhatsApp Updates!
Notices, Deadlines & Correspondence

Patent Information

Application #

Filing Date

10 November 2023

Publication Number

20/2025

Publication Type

INA

Invention Field

COMPUTER SCIENCE

Status

Parent Application

Applicants

MASTERCARD INTERNATIONAL INCORPORATED

2000 Purchase Street, Purchase, NY 10577, United States of America

Inventors

1. Ruma Roy

868, Near K.G. School, Akaipur, Akaipur 743710, West Bengal, India

2. Maneet Singh

J-3/95, First Floor, Rajouri Garden, New Delhi, Delhi 110027, India

3. Bhushan Jayant Chaudhari

Vinayak Housing Society, Plot number-29 P-5, N-8, CIDCO Aurangabad, Aurangabad Cantonment 431003, Maharashtra, India

Specification

Description: The present disclosure relates to artificial intelligence-based processing systems and, more particularly, to electronic methods and complex processing systems for training an Artificial Intelligence (AI) or Machine Learning (ML) model to detect noisy labels in a dataset.
BACKGROUND
With the recent growth in the field of Artificial Intelligence (AI) and Machine Learning (ML) model development, these models have demonstrated themselves as a crucial tool for various industries and businesses for performing a variety of tasks with substantial accuracy. A few examples of the variety of tasks that are being performed successfully by various entities include image recognition, fraud detection, computer vision, natural language processing, data analytics, and the like. As may be understood, AI or ML models, especially supervised models require a vast amount of labeled data for their training or learning process. The term ‘labeled data’ refers to a dataset where each data point is associated with one or more labels that indicate a ground truth or outcome for that data point. In other words, labeled data provides information about what the data represents, allowing AI or ML models to learn from the data and make predictions or classifications. Therefore, it is understood that accurately labeled data is crucial for the successful training and deployment of AI or ML models.
However, obtaining accurately labeled data can be a challenging and resource-intensive task. For instance, labeled data sourced from human annotators may introduce errors due to subjective judgment, misinterpretation of guidelines, or fatigue thereby introducing noise in the labeled data. Further, labeled data sourced from different sources for the same data point may lead introduction of noise as well. Such noisy labeled data may be called noisy labeled data. As a consequence of the presence of noisy labeled data points in labeled data, a range of problems and limitations arise in the development and deployment of AI or ML models, posing a significant challenge to the learning of such models. For example, in the financial domain, during fraud detection, various entities such as an issuing banker, an acquiring banker, or a payment processor and merchant may label each transaction (i.e., a data point) differently (such as, the transaction may be labeled as either a fraud or non-fraud transaction). This labeling variation can result in ambiguity for that transaction within a labeled dataset, thereby making the label associated with the transaction to be noisy.
As may be understood, if a labeled dataset including such noisy labels is used to train or learn an AI or ML model, it would render the training process, sub-optimal. Such AI or ML models generally have poor performance, reduced classification accuracy, etc., among other related problems. Additionally, for imbalanced datasets where a particular class of labels is in the majority, the presence of noisy labels can significantly hamper the performance of the model as well. Further, noisy labels can amplify overfitting, as models may attempt to fit the noise in the training data rather than capture the underlying patterns. In an exemplary scenario of transaction processing and labeling, a few fraudulent transactions may be labeled as non-fraudulent transactions due to human error during manual label assigning or automatic annotation process. In such a scenario, if a fraud detection model is trained on such incorrectly labeled transactions, the model thus trained may let many of the fraudulent transactions go unmonitored and undetected leading to significant losses.
Thus, there exists a technological need for a more robust, automated, and efficient approach to detect and correct noisy labels in large and complex datasets. Such methods should be capable of discerning between accurate and inaccurate labels, even in the presence of substantial noise.
SUMMARY
Various embodiments of the present disclosure provide methods and systems for training an AI or ML model to detect noisy labels in a dataset.
In an embodiment, a computer-implemented method for training a Machine Learning (ML) model to detect noisy labels in a dataset is disclosed. The computer-implemented method performed by a server system includes accessing a training dataset and a first latent space from a database associated with the server system. Herein, the training dataset includes a set of labeled data points. Further, the set of labeled data points includes a subset of first labeled data points and a subset of second labeled data points. Further, the method includes generating a plurality of second features based, at least in part, on the subset of second labeled data points. Further, the method includes generating a second latent space based, at least in part, on the plurality of second features. Additionally, the method includes training a second ML model based, at least in part, on performing a second set of operations iteratively till the performance of the second ML model converges to a second predefined criteria.
The second set of operations further includes initializing the second ML model based, at least in part, on one or more second model parameters. The second ML model includes a plurality of second encoder layers and a plurality of second decoder layers. The second set of operations further includes determining a second reconstruction loss based, at least in part, on the plurality of second features. The second set of operations further includes computing a Z-score probability based, at least in part, on the first latent space and the second latent space. Herein, the Z-score probability indicates a probability of the second latent space being in the first latent space. The second set of operations further includes generating a second loss function based, at least in part, on the second reconstruction loss and the Z-score probability. The second set of operations further includes computing a second latent space loss based, at least in part, on the second loss function. The second set of operations further includes optimizing the one or more second model parameters based, at least in part, on back-propagating the second latent space loss.
In another embodiment, a computer-implemented method for training a Machine Learning (ML) model to detect noisy labels in a dataset is disclosed. The computer-implemented method performed by a server system includes accessing, a transaction dataset and a non-fraud latent space from a database associated with the server system. Further, the transaction dataset includes a set of labeled data points. Herein, the set of labeled data points further includes a subset of non-fraud labeled data points and a subset of fraud labeled data points. Further, the method includes generating a plurality of fraud features based, at least in part, on the subset of fraud labeled data points. Further, the method includes generating a fraud latent space based, at least in part, on the plurality of fraud features. Additionally, the method includes training a second ML model based, at least in part, on performing a second set of operations iteratively till the performance of the second ML model converges to a second predefined criteria.
The second set of operations further includes initializing the second ML model based, at least in part, on one or more second model parameters. The second ML model includes a plurality of second encoder layers and a plurality of second decoder layers. The second set of operations further includes determining a second reconstruction loss based, at least in part, on the plurality of fraud features. The second set of operations further includes computing a Z-score probability based, at least in part, on the non-fraud latent space and the fraud latent space. Herein, the Z-score probability indicates a probability of the fraud latent space being in the non-fraud latent space. The second set of operations further includes generating a second loss function based, at least in part, on the second reconstruction loss and the Z-score probability. The second set of operations further includes computing a second latent space loss based, at least in part, on the second loss function. The second set of operations further includes optimizing the one or more second model parameters based, at least in part, on back-propagating the second latent space loss.
In another embodiment, a server system is disclosed. The server system includes a communication interface and a memory including executable instructions. The server system also includes a processor communicably coupled to the memory. The processor is configured to execute the instructions to cause the server system, at least in part, to access a training dataset and a first latent space from a database associated with the server system. The training dataset includes a set of labeled data points wherein, the set of labeled data points includes a subset of first labeled data points and a subset of second labeled data points. The system is further caused to generate a plurality of second features based, at least in part, on the subset of second labeled data points. The system is further caused to generate a second latent space based, at least in part, on the plurality of second features. The system is further caused to train a second ML model based, at least in part, on performing a second set of operations iteratively till the performance of the second ML model converges to a second predefined criteria.
Herein, the second set of operations includes initializing the second ML model based, at least in part, on one or more second model parameters. The second ML model includes a plurality of second encoder layers and a plurality of second decoder layers. The second set of operations further includes determining a second reconstruction loss based, at least in part, on the plurality of second features. The second set of operations further includes computing a Z-score probability based, at least in part, on the first latent space and the second latent space. Herein, the Z-score probability indicates a probability of the second latent space being in the first latent space. The second set of operations further includes generating a second loss function based, at least in part, on the second reconstruction loss and the Z-score probability. The second set of operations further includes computing a second latent space loss based, at least in part, on the second loss function. The second set of operations further includes optimizing the one or more second model parameters based, at least in part, on back-propagating the second latent space loss.
In yet another embodiment, a non-transitory computer-readable storage medium is disclosed. The non-transitory computer-readable storage medium includes computer-executable instructions that, when executed by at least a processor of a server system, cause the server system to perform a method. The method includes accessing a training dataset and a first latent space from a database associated with the server system. Herein, the training dataset includes a set of labeled data points. Further, the set of labeled data points includes a subset of first labeled data points and a subset of second labeled data points. Further, the method includes generating a plurality of second features based, at least in part, on the subset of second labeled data points. Further, the method includes generating a second latent space based, at least in part, on the plurality of second features. Additionally, the method includes training a second ML model based, at least in part, on performing a second set of operations iteratively till the performance of the second ML model converges to a second predefined criteria.
The second set of operations further includes initializing the second ML model based, at least in part, on one or more second model parameters. The second ML model includes a plurality of second encoder layers and a plurality of second decoder layers. The second set of operations further includes determining a second reconstruction loss based, at least in part, on the plurality of second features. The second set of operations further includes computing a Z-score probability based, at least in part, on the first latent space and the second latent space. Herein, the Z-score probability indicates a probability of the second latent space being in the first latent space. The second set of operations further includes generating a second loss function based, at least in part, on the second reconstruction loss and the Z-score probability. The second set of operations further includes computing a second latent space loss based, at least in part, on the second loss function. The second set of operations further includes optimizing the one or more second model parameters based, at least in part, on back-propagating the second latent space loss.
The foregoing summary is illustrative only and is not intended to be in any way limiting. In addition to the illustrative aspects, embodiments, and features described above, further aspects, embodiments, and features will become apparent by reference to the drawings and the following detailed description.

BRIEF DESCRIPTION OF THE FIGURES
For a more complete understanding of example embodiments of the present technology, reference is now made to the following descriptions taken in connection with the accompanying drawings in which:
FIG. 1A illustrates an exemplary representation of an environment related to at least some example embodiments of the present disclosure;
FIG. 1B illustrates an exemplary representation of a specific environment related to at least some example embodiments of the present disclosure;
FIG. 2 illustrates a simplified block diagram of a server system, in accordance with an embodiment of the present disclosure;
FIG. 3 illustrates an exemplary representation of an architecture of a Variational Auto encoder (VAE) Model, in accordance with an example of the present disclosure;
FIG. 4 illustrates an exemplary representation of an architecture of an Auto Encoder Model, in accordance with an embodiment of the present disclosure;
FIG. 5 illustrates a process flow diagram depicting a method for detecting Noisy labels while training a first machine learning model, in accordance with an embodiment of the present disclosure;
FIG. 6 illustrates a process flow diagram depicting a method for detecting Noisy labels while training a second machine learning model, in accordance with an embodiment of the present disclosure;
FIG. 7 illustrates a process flow diagram depicting a method for detecting Noisy labels in transaction data, in accordance with an embodiment of the present disclosure; and
FIG. 8 illustrates a process flow diagram depicting a method for detecting Noisy labels in transaction data, in accordance with an embodiment of the present disclosure
The drawings referred to in this description are not to be understood as being drawn to scale except if specifically noted, and such drawings are only exemplary in nature.

DETAILED DESCRIPTION
In the following description, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the present disclosure. It will be apparent, however, to one skilled in the art that the present disclosure can be practiced without these specific details.
Reference in this specification to “one embodiment” or “an embodiment” means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment of the present disclosure. The appearance of the phrase “in an embodiment” in various places in the specification is not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Moreover, various features are described which may be exhibited by some embodiments and not by others. Similarly, various requirements are described which may be requirements for some embodiments but not for other embodiments.
Moreover, although the following description contains many specifics for the purposes of illustration, anyone skilled in the art will appreciate that many variations and/or alterations to said details are within the scope of the present disclosure. Similarly, although many of the features of the present disclosure are described in terms of each other, or in conjunction with each other, one skilled in the art will appreciate that many of these features can be provided independently of other features. Accordingly, this description of the present disclosure is set forth without any loss of generality to, and without imposing limitations upon, the present disclosure.
Embodiments of the present disclosure may be embodied as an apparatus, a system, a method, or a computer program product. Accordingly, embodiments of the present disclosure may take the form of an entirely hardware embodiment, an entirely software embodiment (including firmware, resident software, micro-code, etc.), or an embodiment combining software and hardware aspects that may all generally be referred to herein as a “circuit”, “engine”, “module”, or “system”. Furthermore, aspects of the present disclosure may take the form of a computer program product embodied in one or more computer-readable storage media having computer-readable program code embodied thereon.
The terms “account holder”, “user”, “cardholder”, “consumer”, and “buyer” are used interchangeably throughout the description and refer to a person who has a payment account or a payment card (e.g., credit card, debit card, etc.) associated with the payment account, that will be used by them at a merchant to perform a payment transaction. The payment account may be opened via an issuing bank or an issuer server.
The term “merchant”, used throughout the description generally refers to a seller, a retailer, a purchase location, an organization, or any other entity that is in the business of selling goods or providing services, and it can refer to either a single business location or a chain of business locations of the same entity.
The terms “payment network” and “card network” are used interchangeably throughout the description and refer to a network or collection of systems used for the transfer of funds through the use of cash substitutes. Payment networks may use a variety of different protocols and procedures to process the transfer of money for various types of transactions. Payment networks are companies that connect an issuing bank with an acquiring bank to facilitate an online payment. Transactions that may be performed via a payment network may include product or service purchases, credit purchases, debit transactions, fund transfers, account withdrawals, etc. Payment networks may be configured to perform transactions via cash substitutes that may include payment cards, letters of credit, checks, financial accounts, etc. Examples of networks or systems configured to perform or function as payment networks include those operated by such as Mastercard®.
The term “payment card”, used throughout the description, refers to a physical or virtual card linked with a financial or payment account that may be presented to a merchant or any such facility to fund a financial transaction via the associated payment account. Examples of the payment card include, but are not limited to, debit cards, credit cards, prepaid cards, virtual payment numbers, virtual card numbers, forex cards, charge cards, e-wallet cards, and stored-value cards. A payment card may be a physical card that may be presented to the merchant for funding the payment. Alternatively, or additionally, the payment card may be embodied in the form of data stored in a user device, where the data is associated with a payment account such that the data can be used to process the financial transaction between the payment account and a merchant's financial account.
The term “payment account”, used throughout the description refers to a financial account that is used to fund a financial transaction. Examples of the financial account include, but are not limited to a savings account, a credit account, a checking account, and a virtual payment account. The financial account may be associated with an entity such as an individual person, a family, a commercial entity, a company, a corporation, a governmental entity, a non-profit organization, and the like. In some scenarios, the financial account may be a virtual or temporary payment account that can be mapped or linked to a primary financial account, such as those accounts managed by payment wallet service providers, and the like.
The terms “payment transaction”, “financial transaction”, “event”, and “transaction” are used interchangeably throughout the description and refer to a transaction or transfer of payment of a certain amount being initiated by the cardholder. More specifically, they refer to electronic financial transactions including, for example, online payment, payment at a terminal (e.g., Point of Sale (POS) terminal), and the like. Generally, a payment transaction is performed between two entities, such as a buyer and a seller. It is to be noted that a payment transaction is followed by a payment transfer of a transaction amount (i.e., monetary value) from one entity (e.g., issuing bank associated with the buyer) to another entity (e.g., acquiring bank associated with the seller), in exchange of any goods or services.
OVERVIEW
Various embodiments of the present disclosure provide methods, systems, user devices, and computer program products for training an ML model to detect noisy labels in a dataset.
In an embodiment, the server system is a payment server associated with a payment network that is configured to access a training dataset and a first latent space from a database associated with the server system. Herein, the training dataset includes a set of labeled data points wherein, the set of labeled data points includes a subset of first labeled data points and a subset of second labeled data points.
In another embodiment, the server system is configured to generate a plurality of first features based, at least in part, on the subset of first labeled data points. Further, the server system is configured to generate a first latent space based, at least in part, on the plurality of first features. Further, the server system is configured to train a first ML model based, at least in part, on performing a first set of operations iteratively till the performance of the first ML model converges to a first predefined criteria. In an embodiment, the first set of operations includes initializing the first ML model based, at least in part, on one or more first model parameters. Herein, the first ML model includes a plurality of first encoder layers and a plurality of first decoder layers. The first set of operations further includes determining a first reconstruction loss based, at least in part, on a first loss function and the plurality of first features. The first set of operations further includes optimizing one or more first model parameters based, at least in part, on back-propagating the first reconstruction loss.
In another embodiment, the server system is configured to generate a plurality of second features based, at least in part, on the subset of second labeled data points. Further, the server system is configured to generate a second latent space based, at least in part, on the plurality of second features. Additionally, the server system is configured to train a second ML model based, at least in part, on performing a second set of operations iteratively till the performance of the second ML model converges to a second predefined criteria. The set of operations further includes initializing the second ML model based, at least in part, on one or more second model parameters. The second ML model including of a plurality of second encoder layers and a plurality of second decoder layers. The set of operations further includes determining a second reconstruction loss based, at least in part, on the plurality of second features. In an embodiment, determining the second reconstruction loss further includes generating a set of masked second features based at least in part, on a subset of second features from the plurality of second features. Further, the determining process includes computing, via the second ML model, a set of reconstructed second features based, at least in part, on the set of masked second features. Further, the determining process determining the second reconstruction loss based, at least in part, on comparing the plurality of second features and the set of reconstructed second features.
Further, the set of operations further includes computing a Z-score probability based, at least in part, on the first latent space and the second latent space such that the Z-score probability indicates a probability of the second latent space being in the first latent space. In an embodiment, computing the Z-score probability, further includes computing a mean of the first latent space and a standard deviation of first latent space. Further, the computing process includes computing via the second ML model, the Z-score probability based, at least in part, on the second reconstruction loss, the mean of first latent space and the standard deviation of first latent space. In an example, the first latent space and second latent space are dimensionally identical.
The set of operations further includes generating a second loss function based, at least in part, on the second reconstruction loss and the Z-score probability. The set of operations further includes computing a second latent space loss based, at least in part, on the second loss function. The set of operations further includes optimizing the one or more second model parameters based, at least in part, on back-propagating the second latent space loss. In an example embodiment, the first ML model is a Variational Autoencoder and the second ML model is an Autoencoder.
In an embodiment, the server system 200 is configured to access a noisy dataset from the database. Herein, the noisy dataset includes a set of candidate labeled data points. Further, the server system is configured to perform a set of operations for each candidate labeled data point from the set of candidate labeled data points. The set of operations includes computing via the second ML model, the Z-score probability. The set of operations further includes classifying each candidate labeled data point as a noisy data point based, at least in part, on the Z-score probability being higher than a predefined threshold. The set of operations further includes re-labeling each candidate labeled data point based, at least in part, on the classification step.
In another embodiment, the server system is configured to access a transaction dataset and a non-fraud latent space from a database associated with the server system. Further, the transaction dataset includes a set of labeled data points wherein, the set of labeled data points includes a subset of non-fraud labeled data points and a subset of fraud labeled data points. Further, the server system is configured to generate a plurality of fraud features based, at least in part, on the subset of fraud labeled data points. Further, the server system is configured to generate a fraud latent space based, at least in part, on the plurality of fraud features. Additionally, the server system is configured to train a second ML model based, at least in part, on performing a second set of operations iteratively till the performance of the second ML model converges to a second predefined criteria. The second set of operations further includes initializing the second ML model based, at least in part, on one or more second model parameters. The second ML model includes a plurality of second encoder layers and a plurality of second decoder layers. The second set of operations further includes determining a second reconstruction loss based, at least in part, on the plurality of fraud features. The second set of operations further includes computing a Z-score probability based, at least in part, on the non-fraud latent space and the fraud latent space. Herein, the Z-score probability indicates a probability of the fraud latent space being in the non-fraud latent space. The second set of operations further includes generating a second loss function based, at least in part, on the second reconstruction loss and the Z-score probability. The second set of operations further includes computing a second latent space loss based, at least in part, on the second loss function. The second set of operations further includes optimizing the one or more second model parameters based, at least in part, on back-propagating the second latent space loss.
In an embodiment, the server system is further configured to generate a plurality of non-fraud features based, at least in part, on the subset of non-fraud labeled data points. Further, the server system is configured to generate a non-fraud latent space based, at least in part, on the plurality of non-fraud features. Further, the server system is configured to train a first ML model based, at least in part, on performing a first set of operations iteratively till the performance of the first ML model converges to a first predefined criteria. The first set of operations includes initializing the first ML model based, at least in part, on one or more first model parameters, the first ML model includes a plurality of first encoder layers and a plurality of first decoder layers. The first set of operations further includes determining a first reconstruction loss based, at least in part, on a first loss function and the plurality of non-fraud features. The first set of operations further includes optimizing the one or more first model parameters based, at least in part, on back-propagating the first reconstruction loss.
Various embodiments of the present disclosure are described hereinafter with reference to FIGS. 1 to 7.
FIG. 1A illustrates an exemplary representation of an environment 100 related to at least some embodiments of the present disclosure. Although the environment 100 is presented in one arrangement, other embodiments may include the parts of the environment 100 (or other parts) arranged otherwise depending on, for example, training an AI or ML model for detecting noisy labels, and classifying noisy labels with the correct labels, and the like.
The environment 100 generally includes a plurality of components such as a server system 102 and a database 104 each coupled to, and in communication with (and/or with access to) a network 110. It is noted that various other components may also be present in the environment 100 for facilitating the training of AI or ML models by performing a set of operations.
In an embodiment, the network 110 may include, without limitation, a Light Fidelity (Li-Fi) network, a Local Area Network (LAN), a Wide Area Network (WAN), a Metropolitan Area Network (MAN), a satellite network, the Internet, a fiber optic network, a coaxial cable network, an Infrared (IR) network, a Radio Frequency (RF) network, a virtual network, and/or another suitable public and/or private network capable of supporting communication among two or more of the parts or users illustrated in FIG. 1A and FIG. 1B, or any combination thereof.
Various entities/components in the environment 100 may connect to the network 110 in accordance with various wired and wireless communication protocols, such as Transmission Control Protocol/Internet Protocol (TCP/IP), User Datagram Protocol (UDP), 2nd Generation (2G), 3rd Generation (3G), 4th Generation (4G), 5th Generation (5G) communication protocols, Long Term Evolution (LTE) communication protocols, future communication protocols or any combination thereof. For example, the network 110 may include multiple different networks, such as a private network made accessible by the server system 102 and a public network (e.g., the Internet, etc.) through which the server system 102 may communicate with the database 104.
It is noted that as described earlier, the conventional ML techniques suffer from various disadvantages during the training process due to the following scenarios: imbalanced training datasets, data intensiveness, false positives/negatives, domain specificity, and the like. In other words, during the training process when training datasets of varying natures and complexity are being used to train a single model, they may require additional labeled data or access to a clean, expert-labeled dataset for training. Acquiring such data can be costly and time-consuming, and implementing these techniques is impractical for some applications. Additionally, the Noisy label detection techniques are often designed for specific domains or types of noise which are also termed ‘domain specificity’. Additionally, these techniques may not be readily applicable to all types of noisy data, limiting their generalizability across different domains. As may be understood, the term ‘domain specificity’ refers to any technique which is specific to a domain.
On the other hand, the term ‘false positives/negatives’ refer to incorrectly flagging some clean samples as noisy or failing to detect certain noisy samples using noisy label detection algorithms. In an instance, certain data points may be incorrectly labeled as noisy. As may be understood, these false positives can be problematic in the context of noisy label detection because they lead to unnecessary data removal, correction, or other actions, which can degrade the quality of the dataset and potentially harm the performance of machine learning models. For instance, false positives can result in the removal or correction of perfectly valid and accurate data points. In other words, this can reduce the amount of data available for training, potentially leading to under fitting and reduced model performance.
The above-mentioned technical problem among other problems is addressed by one or more embodiments implemented by the server system 102 of the present disclosure. In one embodiment, the server system 102 is configured to perform one or more of the operations described herein.
In one embodiment, the server system 102 can be a standalone component (acting as a hub) connected to any entity that is capable of operating the server system 102 for training an AI or ML model. In some embodiments, the server system 102 may be associated with the database 104. In other scenarios, the database 104 may be incorporated in the server system 102 or maybe an individual entity connected to the server system 102 or maybe a database stored in cloud storage. In one embodiment, the database 104 may store the first ML model 106, the second ML model 108, and other necessary machine instructions required for implementing the various functionalities of the server system 102 such as firmware data, operating system, and the like.
In another embodiment, the database 104 may store a training dataset, a first machine learning (ML) model 106, and a second ML model 108. In an embodiment, the first ML model 106 and the second ML model 108 are AI or ML based models that are trained or learned together to detect noisy labels. In other words, the aim behind training both of these models is to share information or insights learned from different operations performed by each individual model to improve the overall performance of the second ML model 108. In a non-limiting example, the first machine learning model 106 is a Variational Autoencoder model and the second machine learning model 108 is an Autoencoder model.
In one embodiment, the first ML model 106 is trained using a training dataset that includes a set of labeled data points. The labeled data points used for training the first ML model 106 are clean. In other words, the labeled data points do not include noisy data points. It is noted that in some embodiments, the training data may be present in a training dataset that is accessed from the database 104 as well. The architecture of the first ML model 106 may include a plurality of first encoder layers and a plurality of second decoder layers. As may be understood, the plurality of first encoder layers may be configured to transform input data in the form of input features to feature representations or embeddings in a first latent space. On the other hand, the plurality of first decoder layers is configured to mask a few features from the input features and reconstruct these masked features to learn inferences from the training data. As such, during the learning process, the first ML model is configured to minimize the dissimilarity between the original and reconstructed data, enabling it to encompass or understand the underlying data distribution and generate new samples that conform to the same distribution. It is noted that the first ML model 106 encodes the input features of clean data samples as distributions over the latent space. In other words, the first ML model 106 is fed with clean data samples to determine input features that may be useful for learning by the first ML model 106. In various non-limiting examples, the set of input layers may be implemented using a variety of neural network layers such as, but not limited to, convolutional layers, recurrent layers, transformer layers, dense layers), etc., among other suitable neural networks.
In an embodiment, during the training of the first ML model 106, the model is fed with a subset of the first labeled data points. It is understood that the first labeled data points may be any data sample with an assigned label. To that end, the first ML model 106 may be trained with different labeled data samples based on a desired application for the first ML model. In a non-limiting example, the first ML model 106 may be used in the financial domain, medical domain, network domain, marketing domain, etc., among other suitable domains. It is understood that the first labeled data points are the data points with the first classification. For example, in the financial domain, when the task is to perform fraud detection, the subset of first labeled data points may include but is not limited to non-fraud data points which refer to the transactions labeled as non-fraud transactions. On the other hand, for the same scenario, a subset of second labeled data points be data points labeled fraud transactions. As may be understood, training the first ML model 106 using the first labeled data points, is able to extract regularized latent space distribution for the first labeled data points. For example, in a non-limiting example, the first ML model 106 may be fed with non-fraud data points to extract regularized latent space distribution for the non-fraud data points. It is noted that the first ML model 106 is a variational auto encoder that uses a first loss function during the model training process. In an instance, the first loss function may include a reconstruction error and a Kullback–Leibler (KL) divergence function.
In an embodiment, the second ML model 108 is trained using inferences from the first ML model 106 in order to detect noisy data labels within the training dataset. In a non-limiting implementation, the second ML model 108 may be an AI or ML model generated based on an Autoencoder model. In an instance, the second ML model 108 is trained on the second labeled data points. It is noted that the architecture of the second ML model 108 is similar to that of the first ML model 106 so that the embeddings and latent space generated by both of these models are compatible with each other. To that end, the architecture of the second ML model 108 includes a plurality of second encoder layers and a plurality of second decoder layers. It is understood that the second labeled data points are the data points with the second classification. For example, in the financial domain, the subset of second labeled data points may include but is not limited to fraud data points which refer to the transactions labeled as fraud. As may be understood, the second ML model is trained on the second labeled data points to extract regularized latent space distribution for the second labeled data points. For example, by using the fraud data points on the second ML model 108, the model shall extract regularized latent space distribution for the transaction labeled as fraudulent. It is noted that the second ML model 108 uses the outputs from the first ML model 106 and enables the model to identify noisy data points that are wrongly labeled.
As may be understood, since the architectures of the first ML model 106 and the second ML model 108 are similar, therefore, the regularized latent space distribution of the second labeled data points generated by the first ML model is identical to the dimensions of the regularized latent space distribution of the first labeled data points. Further, the second ML model 108 is trained using a second loss function. In an instance, the first loss function may include a reconstruction loss and latent space loss. It is understood that the second loss function is a modified loss function. Further, the second loss function is used to distinguish between the first labeled data points and the second labeled data points. In particular, the second loss function may be generated based on a second reconstruction loss and a Z-probability score. In an embodiment, the Z-score probability the Z-score probability indicates a probability of the second latent space being in the first latent space. In other words, Z-probability for the training data being considered, describes, in probabilistic terms, if there is an intersection between the first latent space and the second latent space.
It is understood that while training any AI or ML model, the training dataset is split into training data, testing data, and validation data. In other words, the predictions made by a model in terms of predicted values for an outcome can easily be validated using the training dataset that has the actual value for the actual outcome. This actual value is also known as a target value as well. In other words, the loss functions are configured to determine a loss value or a set of loss values between the predicted value and the target value. As may be understood, this loss value or the set of loss values may be fed back to the second ML model 108 to adjust its operating parameters during the model training process. This act helps to reduce a loss in the next training cycle of the second ML model 108. As may be understood, this process may be repeated a number of times till either the set of loss values ceases to exist or more likely, the set of loss values saturates or becomes stagnant between subsequent training cycles. In various non-limiting examples, the loss functions may include but are not limited to feature reconstruction loss and the like.
In an embodiment, various optimization algorithms may be used by the second ML model 108 for optimizing/ adjusting its model parameters to reduce the loss values received from the loss functions. In various non-limiting examples, the various optimization algorithms may include gradient descent, Stochastic Gradient Descent (SGD), Adaptive Moment Estimation (ADAM), Root Mean Square Propagation (RMSprop), Adaptive Gradient Algorithm (ADAGRAD), etc., among other suitable algorithms.
In an embodiment, the database 104 may be a data repository designed to efficiently store and manage information. In some embodiments, the database 104 may be integrated into the server system 102. For example, the server system 102 may include one or more hard disk drives as the database 104. In various non-limiting examples, the database 104 may include one or more hard disk drives (HDD), solid-state drives (SSD), an Advanced Technology Attachment (ATA) adapter, a Serial ATA (SATA) adapter, a Small Computer System Interface (SCSI) adapter, a redundant array of independent disks (RAID) controller, a storage area network (SAN) adapter, a network adapter, and/or any component providing the server system 102 with access to the database 104. In one implementation, the database 104 may be viewed, accessed, amended, updated, and/or deleted by an administrator (not shown) associated with the server system 102 through a database management system (DBMS) or relational database management system (RDBMS) present within the database 104. In another embodiment, the database 104 may further include a noisy dataset. In a non-limiting example, the noisy dataset further includes a set of second labeled data points. In an instance, the set of second labeled data points may include a plurality of correctly labeled data points and a plurality of data points with noisy labels.
In an embodiment, the server system 102 is configured to access the training dataset for training the first ML model 106 and the second ML model 108 from the database 104 associated with the server system. In various non-limiting examples, the training dataset may include information related to a plurality of entities that may be used to train the first ML model 106, and the second ML model 108. In other words, the training dataset includes information that can be used by the first ML model 106, and the second ML model 108 to learn and make predictions. For instance, in the financial domain, information may be related to a plurality of entities such as a plurality of transactions, a plurality of cardholders, a plurality of merchants, a plurality of acquirers, a plurality of issuers, a plurality of historical transactions, and so on.
In another embodiment, the server system 102 is configured to generate a plurality of second features based on the subset of second labeled data points. Then, the server system 102 is configured to generate a second latent space based, at least in part, on the plurality of second features. Further, the server system is configured to train the second ML model 108 based, at least in part, on performing a set of operations for a plurality of iterations till the performance of the second ML model converges to predefined criteria (described earlier). In a particular non-limiting implementation, the set of operations may include initializing the second ML model 108 based, at least in part, on one or more second model parameters. It may be understood that one or more second model parameters may refer to the weights and biases associated with the AI or ML model. More specifically, for the initialization of the second ML model 108 for a first iteration of the plurality of iterations, the server system 102 may be configured to initiate, the second ML model 108 based, at least in part, on one or more initial model parameters. Then, the server system 102 is configured to determine a second reconstruction loss based, at least in part, on the plurality of second features. Then, the server system is configured to compute a Z-score probability based on the first latent space and the second latent space. Thereafter, a second loss function is generated based on the second reconstruction loss and the z-score probability. Using this second loss function, the server system 102 computes a second latent space loss. Further, the server system 102 optimizes the one or more second model parameters based, at least in part, on back-propagating the second latent space loss.
In another embodiment, the server system 102 is configured to access a noisy dataset from the database. As described earlier, the noisy dataset includes a set of candidate labeled data points. Further, the server system 102 is configured to perform for each candidate labeled data point from the set of candidate labeled data points: (1) computing via the second ML model, the Z-score probability, (2) classifying each candidate labeled data point as a noisy data point based, at least in part, on the Z-score probability being higher than a predefined threshold, and (3) re-labeling each candidate labeled data point based, at least in part, on the classification step. As may be appreciated, this aspect of the present disclosure allows the server system to utilize the learned/trained second ML model to detect noisy labeled data points during deployment. Further, upon detection, these noisy labeled data points may either be re-labeled with the correct label or deleted, as per requirements.
It should be understood that the server system 102 is a separate part of the environment 100, and may operate apart from (but still in communication with, for example, via the network 110) any third-party external servers (to access data to perform the various operations described herein). However, in other embodiments, the server system 102 may be incorporated, in whole or in part, into one or more parts of the environment 100.
It is pertinent to note that the various embodiments of the present disclosure have been described herein with respect to examples from the financial domain, and it should be noted the various embodiments of the present disclosure can be applied to a wide variety of applications as well and the same will be covered within the scope of the present disclosure as well. For instance, for recommender systems, the plurality of entities may be users and items.
The number and arrangement of systems, devices, and/or networks shown in FIG. 1A are provided as an example. There may be additional systems, devices, and/or networks; fewer systems, devices, and/or networks; different systems, devices, and/or networks; and/or differently arranged systems, devices, and/or networks than those shown in FIG. 1A. Furthermore, two or more systems or devices as shown in FIG. 1A may be implemented within a single system or device, or a single system or device is shown in FIG. 1A may be implemented as multiple, distributed systems or devices. In addition, the server system 102 should be understood to be embodied in at least one computing device in communication with the network 110, which may be specifically configured, via executable instructions, to perform steps as described herein, and/or embodied in at least one non-transitory computer-readable media.
FIG. 1B illustrates an exemplary representation of an environment related to at least some example embodiments of the present disclosure. FIG. 1B illustrates an exemplary representation of an environment 118 related to at least some embodiments of the present disclosure. Although the environment 118 is presented in one arrangement, other embodiments may include the parts of the environment 118 (or other parts) arranged otherwise depending on, for example, training an ML model for detecting noisy labels, classifying data points with noisy labels with correct labels. The environment 118 generally includes various entities such as a server system 102, a cardholder 120, a merchant 130, an issuer server 124, an acquirer server 126, a payment network 132 including a payment server 134, and a database 128, each coupled to, and in communication with (and/or with access to) a network 110. The network 110 may include, without limitation, a light fidelity (Li-Fi) network, a local area network (LAN), a wide area network (WAN), a metropolitan area network (MAN), a satellite network, the Internet, a fiber optic network, a coaxial cable network, an infrared (IR) network, a radio frequency (RF) network, a virtual network, and/or another suitable public and/or private network capable of supporting communication among the entities illustrated in FIG. 1B, or any combination thereof.
Various entities in the environment 118 may connect to the network 110 in accordance with various wired and wireless communication protocols, such as Transmission Control Protocol and Internet Protocol (TCP/IP), User Datagram Protocol (UDP), 2nd Generation (2G), 3rd Generation (3G), 4th Generation (4G), 5th Generation (5G) communication protocols, Long Term Evolution (LTE) communication protocols, any combination thereof or any future communication protocols. For example, the network 110 may include multiple different networks, such as a private network or a public network (e.g., the Internet, etc.) through which the server system 102 and the payment server 134 may communicate.
In an embodiment, the cardholder (e.g., the cardholder 120) may be any individual, representative of a corporate entity, non-profit organization, or any other person that is presenting payment account details during an electronic payment transaction. The cardholder (e.g., the cardholder 120) may have a payment account issued by an issuing bank (not shown in figures). Further, the cardholder 120 may be provided a payment card with financial or other account information encoded onto the payment card such that the cardholder (i.e., the cardholder 120) may use the payment card to initiate and complete a payment transaction using a bank account at the issuing bank.
The cardholder 120 may use their corresponding electronic device 122 to access a mobile application or a website associated with the issuing bank, or any third-party payment application. In various non-limiting examples, the electronic device 122 may refer to any electronic device such as, but not limited to, personal computers (PCs), tablet devices, Personal Digital Assistants (PDAs), voice-activated assistants, Virtual Reality (VR) devices, smartphones, and laptops.
In an embodiment, the merchant 130 may include retail shops, restaurants, supermarkets or establishments, government and/or private agencies, or any such places equipped with POS terminals, where a cardholder such as a cardholder 120 visits for performing the financial transaction in exchange for any goods and/or services or any financial transactions.
In one scenario, the cardholder 120 may use their corresponding payment accounts to conduct payment transactions with the merchant 130. The cardholder 120 may enter payment account details on their electronic device 122 such as a mobile device to perform an online payment transaction. In another example, the cardholder 120 may utilize a payment card to perform an offline payment transaction. Generally, the term “payment transaction” refers to an agreement that is carried out between a buyer and a seller to exchange assets as a form of payment (e.g., cash, currency, etc.). For example, the cardholder 120 may enter details of the payment card on an e-commerce platform to buy goods or products. In an example, the cardholder (e.g., the cardholder 120) may transact at the merchant 130.
In one embodiment, the cardholder 120 may be associated with the issuer server 124. In one embodiment, the issuer server 124 is associated with a financial institution normally called an “issuer bank”, “issuing bank” or simply “issuer”, in which a cardholder (e.g., the cardholder 120) may have the payment account, (which also issues a payment card, such as a credit card or a debit card), and provides microfinance banking services (e.g., payment transaction using credit/debit cards) for processing electronic payment transactions, to the cardholder (e.g., the cardholder 120).
In an embodiment, the merchant 130 is associated with the acquirer server 126. In an embodiment, each merchant (e.g., the merchant 130) is associated with an acquirer server 126. In one embodiment, the acquirer server 126 is associated with a financial institution (e.g., a bank) that processes financial transactions for the merchant 130. This can be an institution that facilitates the processing of payment transactions for physical stores, merchants (e.g., the merchant 130), or institutions that own platforms that make either online purchases or purchases made via software applications possible (e.g., shopping cart platform providers and in-app payment processing providers). The terms “acquirer”, “acquiring bank”, “acquiring bank” or “acquirer server” will be used interchangeably herein.
In an exemplary scenario, the plurality of transactions performed between different cardholders or merchants may be labeled with one or more labels. These labels may include a fraudulent transaction label (or fraud label), a non-fraudulent transaction label (or non-fraud label), a first-party fraud label, a third-party fraud label, and so on. As may be understood, these labels are generally assigned to this plurality of transactions from a variety of entities such as the issuer, the acquirer, the merchant, or even some third-party entity. In some instances, these different labels may be assigned to the same payment transaction by these different entities leading to the labels associated with the transaction becoming noisy. In another instance, no entity might report the transaction as fraud, even if the fraud was indeed a fraudulent transaction. Similarly, human error while assigning labels manually to the plurality of transactions may also the labels to become noisy. As described earlier, if such noisy labeled data is used to train or learn an AI or ML model, the model thus learned will have sub-optimal performance. This occurs since the model will learn incorrect inferences from the mislabeled or noisy data leading to improper predictions while the model is deployed. To that end, it is crucial to refine or clean these noisy labels before training any model for performing downstream tasks.
The above-mentioned technical problem among other problems is addressed by one or more embodiments implemented by the server system 102 of the present disclosure. In one embodiment, the server system 102 is configured to perform one or more of the operations described herein. It is understood that since the operation of server system 102 has been described in detail with reference to FIG. 1A, the same has not been explained again.
In an embodiment, the server system 200 may be configured to access a transaction dataset, and a non-fraud latent space from the database associated with the server system 102. As may be appreciated, the process for generating the non-fraud latent space has already been described earlier in the present disclosure. Further, the transaction dataset includes a set of labeled data points wherein, the set of labeled data points includes a subset of non-fraud labeled data points and a subset of fraud labeled data points. Further, the server system 102 is configured to generate a plurality of fraud features based, at least in part, on the subset of fraud labeled data points. Further, the server system 102 is configured to generate a fraud latent space based, at least in part, on the plurality of fraud features. Additionally, the server system is configured to train an AI or ML model such as the second ML model based, at least in part, on performing a second set of operations iteratively till the performance of the second ML model converges to a second predefined criteria. The second set of operations further includes initializing the second ML model based, at least in part, on one or more second model parameters. The second ML model including of a plurality of second encoder layers and a plurality of second decoder layers. The second set of operations further includes determining a second reconstruction loss based, at least in part, on the plurality of fraud features. The second set of operations further includes computing a Z-score probability based, at least in part, on the non-fraud latent space and the fraud latent space. Herein, the Z-score probability indicates a probability of the fraud latent space being in the non-fraud latent space. The second set of operations further includes generating a second loss function based, at least in part, on the second reconstruction loss and the Z-score probability. The second set of operations further includes computing a second latent space loss based, at least in part, on the second loss function. The second set of operations further includes optimizing the one or more second model parameters based, at least in part, on back-propagating the second latent space loss.
FIG. 2 illustrates a simplified block diagram of a server system 200, in accordance with an embodiment of the present disclosure. It is noted that the server system 200 is identical to the server system 102 of FIG. 1A and FIG. 1B. In various implementations, the server system 200 may be implemented within a third-party server based on an application or industry for which the AI or ML model is being trained. In some embodiments, the server system 200 is embodied as a cloud-based and/or Software as a Service (SaaS) based architecture.
The server system 200 includes a computer system 202 and a database 204. It is noted that the database 204 is identical to the database 114 of FIG. 1A and FIG. 1B. The computer system 202 includes at least one processor 206 for executing instructions, a memory 208, a communication interface 210, a user interface 212, and a storage interface 214 that communicate with each other via a bus 216.
In some embodiments, the database 204 is integrated within the computer system 202. For example, the computer system 202 may include one or more hard disk drives as the database 204. The user interface 212 is any component capable of providing an administrator (not shown) of the server system 200, the ability to interact with the server system 200. This user interface 212 may be a GUI or Human Machine Interface (HMI) that can be used by the administrator (not shown) to configure the various operational parameters of the server system 200. The storage interface 214 is any component capable of providing the processor 206 with access to the database 204. The storage interface 214 may include, for example, an Advanced Technology Attachment (ATA) adapter, a Serial ATA (SATA) adapter, a Small Computer System Interface (SCSI) adapter, a RAID controller, a SAN adapter, a network adapter, and/or any component providing the processor 206 with access to the database 204. In one non-limiting example, the database 204 is configured to store a training dataset 228, a first ML model 230, a second ML model 232, and the like. It is noted that the first ML model 230 and the second ML model 232 are identical to the first ML model 122 and the second ML model 124 of FIG. 1A respectively.
The processor 206 includes suitable logic, circuitry, and/or interfaces to execute operations for training an AI or ML model in order to detect noisy labels, and the like. In other words, the processor 206 includes suitable logic, circuitry, and/or interfaces to execute operations for training the first ML model 122 and the second ML model 124 along with detection of noisy labels in a noisy dataset. Examples of the processor 206 include but are not limited to, an Application-Specific Integrated Circuit (ASIC) processor, a Reduced Instruction Set Computing (RISC) processor, a Graphical Processing Unit (GPU), a Complex Instruction Set Computing (CISC) processor, a Field-Programmable Gate Array (FPGA), and the like.
The memory 208 includes suitable logic, circuitry, and/or interfaces to store a set of computer-readable instructions for performing various operations described herein. Examples of the memory 208 include a Random-Access Memory (RAM), a Read-Only Memory (ROM), a removable storage drive, a Hard Disk Drive (HDD), and the like. It will be apparent to a person skilled in the art that the scope of the disclosure is not limited to realizing the memory 208 in the server system 200, as described herein. In another embodiment, the memory 208 may be realized in the form of a database server or a cloud storage working in conjunction with the server system 200, without departing from the scope of the present disclosure.
The processor 206 is operatively coupled to the communication interface 210, such that the processor 206 is capable of communicating with a remote device (i.e., to/from a remote device 218) such as a third-party server (not shown), or communicating with any entity connected to the network 108 (as shown in FIG. 1). Herein, the third-party server may be any computing server that operates uses the server system 200 to train any AI or ML model.
It is noted that the server system 200 as illustrated and hereinafter described is merely illustrative of an apparatus that could benefit from embodiments of the present disclosure and, therefore, should not be taken to limit the scope of the present disclosure. It is noted that the server system 200 may include fewer or more components than those depicted in FIG. 2.
In one implementation, the processor 206 includes a latent space generation module 220, a model generation module 222, a loss computation module 224, and a model optimization module 226. It should be noted that components, described herein, such as the latent space generation module 220, the model generation module 222, the loss computation module 224, and the model optimization module 226 can be configured in a variety of ways, including electronic circuitries, digital arithmetic, and logic blocks, and memory systems in combination with software, firmware, and embedded technologies.
In an embodiment, the latent space generation module 220 includes suitable logic and/or interfaces for accessing a training dataset 228 from the database 204. In various non-limiting examples, the training dataset 228 may include information related to a plurality of entities and a plurality of transactions. In various non-limiting examples pertaining to the financial domain, within the payment eco-system, the plurality of entities may include a plurality of cardholders, a plurality of merchants, a plurality of issuer servers, and a plurality of acquirer servers. Further, the information related to these entities may include information related to a plurality of historical payment transactions performed by the plurality of cardholders with the plurality of merchants. Further, the information related to various entities may include the classification of transactions within the training dataset. For instance, the plurality of transactions within the training dataset may be classified and labeled as ‘fraud’ and ‘non-fraud’ transactions. It is noted that although the aforementioned non-limiting example is specific to the financial industry or payment ecosystem, however, the various operations of the present disclosure are not limited to the same. To that end, the training dataset 228 can be configured to include different information specific to any field of operation. Therefore, it is understood that the various embodiments of the present disclosure apply to a variety of different fields of operation and the same is covered within the scope of the present disclosure.
Returning to the previous example, the training dataset 228 may include information related to a plurality of historical payment transactions performed within a predetermined interval of time (e.g., 6 months, 12 months, 24 months, etc.) and further classified and labeled as ‘fraud’ or ‘non-fraud’ transactions by various entities as mentioned earlier such as the merchant 116, the issuer server 118, the acquirer server 120. In some other non-limiting examples, the training dataset 228 includes information related to at least merchant name identifier, unique merchant identifier, timestamp information (i.e., transaction date/time), geo-location related data (i.e., latitude and longitude of the cardholder/merchant), Merchant Category Code (MCC), merchant industry, merchant super industry, information related to payment instruments involved in the set of historical payment transactions, cardholder identifier, Permanent Account Number (PAN), merchant name, country code, transaction identifier, transaction amount, transaction label, i.e., either fraud or non-fraud label, and the like.
In one example, the training dataset 228 may define a relationship between each of the plurality of entities. In a non-limiting example, a relationship between a cardholder account and a merchant account may be defined by a transaction performed between them. For instance, when a cardholder purchases an item from a merchant, a relationship is said to be established.
In another embodiment, the training dataset 228 may include information related to past payment transactions such as transaction date, transaction time, geo-location of a transaction, transaction amount, transaction marker (e.g., fraudulent or non-fraudulent), and the like. In yet another embodiment, the training dataset 228 may include information related to a plurality of acquirer servers such as the date of merchant registration with the acquirer server, amount of payment transactions performed at the acquirer server in a day, number of payment transactions performed at the acquirer server in a day, maximum transaction amount, minimum transaction amount, number of fraudulent merchants or non-fraudulent merchants registered with the acquirer server, and the like.
In another embodiment, the latent space generation module 220 is configured to generate a plurality of first features based, at least in part, on the subset of first labeled data points. For example, in the financial domain, the first labeled data points may be the ‘non-fraud’ labeled data points. Then, the latent space generation module 220 is configured to generate an embedding for each data point of the first labeled data points in a first latent space. Similarly, the latent space generation module 220 is configured to generate a plurality of second features based, at least in part, on the subset of second labeled data points. For example, in the financial domain, the second labeled data points may be the ‘fraud’ labeled data points. Then, the latent space generation module 220 is configured to generate an embedding for each data point of the second labeled data points in a second latent space. In an instance, the embeddings may be generated by the first ML model or the second ML model, respectively. It is understood that since the architecture of the first ML model and the second ML model are similar, the first latent space and second latent space generated by the same will be identical in dimensions as well.
It is understood that the latent space refers to a lower-dimensional space that captures essential features or representations (i.e., embeddings) of different data samples. In another embodiment, the set of first features, the first latent space, the set of second features, and the second latent space may be stored within the database such that they can be accessed at any point by the server system 200.
In various non-limiting examples, the latent space generation module 220, the model generation module 222, the loss computation module 224, and the model optimization module 226 are communicably connected to each other and are configured to utilize these modules to perform the set of operations described herein.
In an embodiment, the model generation module 222 includes suitable logic and/or interfaces for creating, training, and fine-tuning the first ML model 230 and the second ML model 232 that can make intelligent decisions or predictions based on the training data. It is noted that the process of model generation includes the first step of learning or training the model on relevant data. This data can come from various sources, including databases such as the database 204, training datasets accumulated from various sources/entities such as training dataset 228, previous user interactions, etc. To that end, an explanation regarding the same is not provided here for the sake of brevity. Further, raw data from the various sources are preprocessed which involves tasks like handling missing values, scaling features, encoding categorical variables, and splitting the data into training and testing sets. Thereafter, the process of model generation involves feature engineering which is the process of selecting, transforming, or creating new features (input variables) from the available training dataset. Further, the machine learning model is selected and trained wherein, the model learns to identify patterns and relationships within the data. It is to be noted that, model generation is often an iterative process. As more training dataset such as the training dataset 228 is collected and additional insights are gained, models can be retrained and improved to maintain or enhance their predictive accuracy. It is noted the process for training the ML models has been described in detail with reference to FIG. 3 and 4 later in the present disclosure.
In an embodiment, the loss computation module 224 includes suitable logic and/or interfaces for computing the loss associated with the first ML model 230 and the second ML model 232. In particular, for computing the loss associated with the first ML model 230, a set of first features is generated based on the subset of first labeled transactions. For example, for computing the loss associated with the first ML model 230, a set of non-fraud features is generated based on the subset of non-fraud labeled transactions in the training dataset. Further, a plurality of first embeddings for the first latent space are generated based on the set of first features. The first ML model is then configured to mask a few features from the set of first features. Thereafter, the first ML model is configured to predict/recreate/reconstruct the masked features based on the remaining first features. Finally, a comparison is performed between the reconstructed features and the actual features that were masked to determine a first reconstruction loss. In various instances, a first loss function is used to compute the first reconstruction loss.
Similarly, in another embodiment, the loss computation module 224 is configured to compute the loss associated with the second ML model 232, a set of second features is generated based on the subset of second labeled transactions. For example, for computing the loss associated with the second ML model 232, a set of fraud features is generated based on the subset of fraud labeled transactions in the training dataset. Further, a plurality of second embeddings for the second latent space are generated based on the set of second features. The second ML model 232 is then configured to mask a few features from the set of second features. Thereafter, the second ML model 232 is configured to predict/recreate/reconstruct the masked features based on the remaining first features. Finally, a comparison is performed between the reconstructed features and the actual features that were masked to determine a second reconstruction loss. Further, a Z-score probability indicating a probability of the second latent space being in the first latent space is computed by the loss computation module 224 based, at least in part, on the first latent space and the second latent space. This aspect has been described in detail later in the present disclosure with reference to FIG. 4. Thereafter, the loss computation module 224 is configured to generate a second loss function based, at least in part, on the second reconstruction loss and the z-score probability. In various instances, the second loss function is used to compute a second latent space loss.
In another embodiment, the model optimization module 224 includes suitable logic and/or interfaces for optimizing the one or more model parameters associated with the first ML model 230 and the second ML model 232 by back-propagating the first latent space loss and the second latent space loss, respectively. It is to be noted that, model optimization through backpropagation of the loss is a fundamental concept in the training of neural networks, especially in deep learning. It is a crucial step that allows a neural network to learn from the training dataset such as the training dataset 228 and adjust its model parameters to make better predictions. Additionally, the essence of model optimization occurs during backpropagation. It involves the calculation of gradients of the loss function such as the first loss function and the second loss function with respect to the model's parameters. Gradients essentially represent how much the loss would change if each parameter were adjusted slightly. Further, the above set of operations is iteratively repeated till the performance of the first ML model converges to the first predefined criteria. It is noted that predefined criteria may refer to a point in the iterative process where the values of the calculated first latent space loss and the second space loss either minimize or saturate (i.e., stops or effectively ceases to decrease with successive iterations).
Once the training of the second ML model 232 is completed, the second ML model 232 can be deployed to determine noisy labels from a noisy dataset. In an embodiment, the model generation module 222 is configured to access the noisy dataset from the database. It is understood that the noisy dataset may include any dataset that has noisy labels within itself. Then, the model generation module 222 is configured to perform for each candidate labeled data point from the set of labeled data points present in the noisy dataset a set of steps. The steps may include (1) computing via the second ML model, the Z-score probability, (2) classifying each candidate labeled data point as a noisy data point based, at least in part, on the Z-score probability being higher than a predefined threshold, and (3) re-labeling each candidate labeled data point based, at least in part, on the classification step. This aspect allows for refining noisy data points into correctly labeled data points. As may be understood, once the noisy data points are refined, they can be used to train any AI or ML model for any downstream task such that those models will have improved learning and higher performance.
FIG. 3 illustrates an exemplary representation 300 of an architecture of a first machine learning model, in accordance with an example of the present disclosure.
As described earlier, the first machine learning model may be implemented as a Variational Auto Encoder (VAE) model. In this implementation, the VAE model is trained on a subset of first labeled data points from the training dataset. For the sake of explanation, the subset of first labeled data points is assumed to be a set of non-fraud labeled data points. At first, the VAE model may be initialized based on first model parameters (such as weights, biases, etc., among other neural network parameters) such that the VAE model may include a plurality of first encoder layers and a plurality of first decoder layers. As illustrated, the plurality of first encoder layers is depicted as a neural network encoder 304, and the plurality of first decoder layers is depicted as a neural network decoder 310 in FIG. 3. The neural network encoder 304, is often referred to as the recognition network or inference network. The neural network encoder 304 is configured to take the set of non-fraud labeled data points as the input data 302. More specifically, the set of non-fraud labeled data points is used to generate a plurality of first features.
Then, the VAE model is configured to learn feature representations or embeddings for the first features and maps these representations to a latent space representation termed as the first latent space. In particular, the neural network encoder 304 is responsible for encoding the first features into a lower-dimensional, continuous latent space, where each dimension of the latent space corresponds to a feature or property of each first feature. For example, with reference to the financial domain when first ML model 106 (which is a variational autoencoder) is trained on the training dataset that corresponds to the subset of non-fraud labeled data points, the VAE model generates embeddings for the non-fraud labeled data points in the non-fraud latent space where each dimension of the non-fraud latent space corresponds to a feature or property of the non-fraud labeled data points. It is understood that the VAE model is used herein as the first ML model instead of an autoencoder because the input features are encoded as a distribution over the latent space, thereby allowing the model to generate accurate first latent space (i.e., non-fraud latent space) distribution parameters.
In an embodiment, the neural network encoder 304 may typically include several fully connected (dense) layers or convolutional layers, depending on the type of data being processed. These layers of the neural network encoder 304 are configured to gradually reduce the spatial or feature dimensions and increase the depth to capture the hierarchical features of the set of non-fraud labeled data points. Further, the final layer of the neural network encoder 304 outputs two vectors for each data point which includes the mean of first latent space, mathematically represented as (µ(h(nF)), and standard deviation of first latent space distribution, mathematically represented as (s(h(nF))). More specifically, these vectors are used to parameterize the Gaussian distribution, mathematically represented as(N(µ,s), in the first latent space. It is to be noted that mathematically, h(nF) is used to represent the first latent space of input feature encoded from encoder architecture.
In an embodiment, the first latent space is generated based, at least in part, on the plurality of first features. Further, the training of the first ML model 106 is based, at least in part, on performing a first set of operations iteratively till the performance of the first ML model 106 converges to a first predefined criteria. In an embodiment, the first set of operations includes initializing the first ML model 106 based, at least in part, on one or more first model parameters, the first ML model including a plurality of first encoder layers and a plurality of first decoder layers. The first set of operations further includes determining a first reconstruction loss based, at least in part, on a first loss function and the plurality of first features. The first set of operations further includes optimizing one or more first model parameters based, at least in part, on back-propagating the first reconstruction loss.
Additionally, the architecture of the VAE model includes a neural network decoder 310. The neural network decoder 310 takes samples from the latent space and reconstructs the original input data. It is responsible for generating data that closely resembles the training examples. The reconstructed input is shown at 312 in FIG. 3. Additionally, every point from the first latent space is sampled from the first latent space distribution, the sampled point is decoded, and the reconstruction error is computed.
In an embodiment, the first reconstruction loss is determined by generating a set of masked first features based at least in part, on a subset of first features from the plurality of first features. Further, computing, via the first ML model 106, a set of reconstructed first features based, at least in part, on the set of masked second features. Further, determining the first reconstruction loss based, at least in part, on comparing the plurality of first features and the set of reconstructed first features.
Additionally, a first reconstruction loss is computed mathematically as ||X-D(h(nF))||^2.
Here, X is the N-dimensional features of the first samples, D(h(nF)) is the N-dimensional reconstructed first feature, wherein N is a natural number.
The variational autoencoder model 300 also uses the Kullback–Leibler (KL) divergence function to compute the KL divergence loss. These metrics are combined to define a first loss function computed using the first ML model 106. In an example, the first loss function may be defined by the following Eqn. 1:
First loss function = min {Reconstruction loss + K-L divergence loss}
= min{+ ?X-D(h(nF))?¦|^2+KL[N(µ(h(nF)),s(h(nF))),N(0,1)]}¦…Eqn. 1
Here, KL[N(µ(h(nF)),s(h(nF))),N(0,1)] is the KL divergence loss wherein, N (µ, s) is the Gaussian distribution with mean µ and standard deviation s and N (0, 1) is the standard normal distribution.
FIG. 4 illustrates an exemplary representation of an architecture of a second ML model, in accordance with an embodiment of the present disclosure.
As described earlier, the second ML model may be implemented as an Auto Encoder (AE) model. In this implementation, the AE model is trained on a subset of second labeled data points from the training dataset. For the sake of explanation, the subset of second labeled data points is assumed to be a set of fraud labeled data points.
Herein, the AE model is trained on a subset of second labeled data points from the training dataset, enabling the model to identify second labeled data points that might be wrongly labeled as first labeled data points in any candidate noisy dataset. In other words, the AE model is configured with the objective or task to identify fraud labeled data points that might be wrongly labeled as non-fraud in the noisy dataset. At first, the AE model may be initialized based on second model parameters (such as weights, biases, etc., among other neural network parameters) such that the AE model may include a plurality of second encoder layers and a plurality of second decoder layers. It is noted that the architecture of the AE model of FIG. 4 is similar to the VAE of FIG. 3, with a similar number of layers and neurons in each layer of the encoder layers and the decoder layers.
As illustrated, the plurality of second encoder layers is depicted as a neural network encoder 404, and the plurality of second decoder layers is depicted as a neural network decoder 408 in FIG. 4. The neural network encoder 404, is often referred to as the recognition network or inference network. The neural network encoder 404 is configured to take the set of fraud labeled data points as the input data 302. More specifically, the set of fraud labeled data points are used to generate a plurality of second features. Then, the AE model is configured to learn feature representations or embeddings for the second features and maps these representations to a latent space representation termed as the second latent space.
In particular, the neural network encoder 404 is responsible for encoding the second features into a lower-dimensional, continuous latent space, where each dimension of the latent space corresponds to a feature or property of each second feature. For example, with reference to the financial domain when the second ML model (which is an autoencoder) is trained on the training dataset that corresponds to the subset of fraud labeled data points, the AE model generates embeddings for the fraud labeled data points in the fraud latent space where each dimension of the fraud latent space corresponds to a feature or property of the fraud labeled data points.
In an embodiment, the AE model is trained based on performing a second set of operations iteratively till the performance of the second ML model converges to a second predefined criteria.
In a non-limiting implementation, the second set of operations includes initializing the second ML model based, at least in part, on one or more second model parameters, the second ML model includes a plurality of second encoder layers and a plurality of second decoder layers. Further, the second set of operations includes determining a second reconstruction loss based, at least in part, on the plurality of second features. Further, the second set of operations includes computing a Z-score probability based, at least in part, on the first latent space and the second latent space, the Z-score probability indicating a probability of the second latent space being in the first latent space. Further, the second set of operations includes generating a second loss function based, at least in part, on the second reconstruction loss and the Z-score probability and computing a second latent space loss based, at least in part, on the second loss function.
Furthermore, the second set of operations includes optimizing the one or more second model parameters based, at least in part, on back-propagating the second latent space loss. The objective of this autoencoder 400, with its modified loss function, is to accurately distinguish between the first labeled data points and the second labeled data points, thereby enabling it to effectively detect noisy samples. The AE model uses the clean second labeled data points as the input. It also uses a second loss function which includes the second reconstruction loss and the second latent space loss. The second latent space loss is generated using the z-score probability that the second latent space lies in the first latent space distribution. In a non-limiting example, the following Eqn. 2 may be used to denote the second loss function:
min{||X-D(h(F))||^2+2*P(Z=|(h(F)-µ(h(nF)))/(s(h(nF)))|)}…Eqn. 2
wherein, X is the N-dimensional features of the subset of second labeled data points, D(h(F)) is the N-dimensional reconstructed plurality of second features, h(F) is the second latent space of the plurality of second features, µ(h(nF)) is the mean of the first latent space distribution, s(h(nF)) is a Standard deviation of the first latent space distribution, 2*P(Z=|(h(F)-µ(h(nF)))/(s(h(nF)))|) is the second latent space loss.
In an embodiment, the second reconstruction loss is the mean-squared error between the feature and the reconstructed feature obtained from the decoder. Additionally, the second latent space loss is the z-score probability that the second latent space lies in the first latent space distribution N(µ( h(nF)), s(h(nF)) obtained from the Variational Autoencoder model, i.e., the first ML model. The ‘latent space loss’ is a probability term that represents the two-tailed z-score probability for a normal distribution (see, graph 412) for each latent space vector. The latent space loss provides the probability that the second latent space obtained during training the AE model is close to the first latent space distribution obtained from the VAE model. Hence, it is the objective of this model to minimize this probability so that the trained autoencoder can accurately reconstruct a test second sample without knowing its reported label.
Further, during training, the above second loss function is used in backpropagation. The second loss obtained from the second loss function is minimized in the gradient descent direction to update the weights. This results in an effective second latent space that is distinctly different from the first latent space, ensuring better discrimination between the two.
The two-tailed z-score probability is calculated by converging the latent space vector into a z-score using the mean and standard deviation vector of first latent space, the z-score probability can be defined as:
z score=(h(F)-µ(h(nF)))/(s(h(nF)))…Eqn. 3
Here, h(F) is the m-dimensional first latent space of the input feature, µ(h(nF)) is the mean of the second latent space distribution, s(h(nF)) is a Standard deviation of second latent space distribution.
Further, in an embodiment, the z-score probability is the area covered by the normal distribution curve on both sides of the tail for a given z-score (see, graph 412). Due to the symmetric nature of normal distribution, the probability is a two-time right tail area i.e. z-score probability = 2* P ( Z >= mod(z-score) ) where Z is a normal random variable. Two-tailed probability can be calculated from a statistical z-score table as well.
In an additional embodiment, the AE model trained in the explanation of FIG. 4 using the second loss function is employed to detect noisy labeled first labeled data points in any candidate dataset. For example, in the financial domain, this model identifies samples that are fraud but have been wrongly reported as non-fraud in the dataset. Such instances will be reconstructed by the autoencoder 400, and their second reconstruction error will align with the first reconstruction error distribution. In another example, the AE model trained using the second loss function is employed to detect noisy labeled non-fraud instances in any candidate dataset. This model identifies samples that are fraudulent but have been wrongly reported as non-fraud in the dataset. Such instances will be reconstructed by the AE model.
In a non-limiting example, the non-fraud instances in the dataset are reconstructed by the fraud autoencoder. Genuine non-fraud samples will not be significantly reconstructed by the autoencoder, leading to very high reconstruction errors. However, samples that are fraudulent but mislabeled as non-fraud will be easily reconstructed by the autoencoder. A noise score is assigned to each non-fraud sample by calculating the two-tailed Z-score probability, aiding in identifying potential fraudulent instances.
If ê is the reconstruction error of the given sample after reconstructing it using fraud autoencoder (i.e., the second ML model 108), noise-score is a two-tailed Z-score probability that is computed as:
2^* P(Z=|(e ˆ-µ(e(F)))/(s(e(F)))|)…Eqn. 4
Wherein, Z is a normal random variable, ê is reconstruction error, µ(e(F)) is the mean of reconstruction error of clean fraud samples and s(e(F)) is the standard deviation of reconstruction error of clean-fraud samples.
In an embodiment, the noisy samples detected among the first labeled data points are identified by their very high z-score probability. The top n% of first labeled data points, sorted in descending order based on their z-score probabilities, can be considered noisy samples. The threshold for the z-score probability or the choice of the specific percentage of first labeled data points (n %) can be adjusted as needed.
FIG. 5 illustrates a process flow diagram depicting a method 500 for noisy label detection while training a machine learning model such as the second ML model 108, in accordance with an embodiment of the present disclosure. The method 500 depicted in the flow diagram may be executed by, for example, the server system 200. The sequence of operations of the method 500 may not be necessarily executed in the same order as they are presented. Further, one or more operations may be grouped and performed in the form of a single step, or one operation may have several sub-steps that may be performed in parallel or in a sequential manner. Operations of the method 500, and combinations of operations in the method 500 may be implemented by, for example, hardware, firmware, a processor, circuitry, and/or a different device associated with the execution of software that includes one or more computer program instructions. The plurality of operations is depicted in the process flow of the method 500. The process flow starts at operation 502.
At 502, the method 500 includes accessing, by a server system such as server system 200 of FIG. 2, a training dataset and a first latent space from a database 204 associated with the server system, the training dataset 228 including a set of labeled data points wherein, the set of labeled data points includes a subset of first labeled data points and a subset of second labeled data points.
At 504, the method 500 includes generating, by the server system 200, a plurality of first features based, at least in part, on the subset of first labeled data points.
At 506, the method 500 includes generating, by the server system 200, a first latent space based, at least in part, on the plurality of first features.
At 508, the method 500 includes training, by the server system 200, a first ML model based, at least in part, on performing a first set of operations iteratively till the performance of the first ML model converges to a first predefined criteria.
In an implementation, the first set of operations includes sub-operations 508A-508C. It is noted that predefined criteria may refer to a point in the iterative process where the values of second latent space loss minimize or saturate (i.e., stops or effectively ceases to decrease with successive iterations).
At 508A, the method 500 includes initializing the first ML model based, at least in part, on one or more first model parameters, the first ML model including a plurality of first encoder layers and a plurality of first decoder layers.
At 508B, the method 500 includes determining a first reconstruction loss based, at least in part, on a first loss function and the plurality of first feature.
At 508C, the method 500 includes optimizing the one or more first model parameters based, at least in part, on back-propagating the first reconstruction loss.
FIG. 6 illustrates a process flow diagram depicting a method 600 for noisy label detection while training a machine learning model such as the second ML model 108, in accordance with an embodiment of the present disclosure. The method 600 depicted in the flow diagram may be executed by, for example, the server system 200. The sequence of operations of the method 600 may not be necessarily executed in the same order as they are presented. Further, one or more operations may be grouped and performed in the form of a single step, or one operation may have several sub-steps that may be performed in parallel or in a sequential manner. Operations of the method 600, and combinations of operations in the method 600 may be implemented by, for example, hardware, firmware, a processor, circuitry, and/or a different device associated with the execution of software that includes one or more computer program instructions. The plurality of operations is depicted in the process flow of the method 600. The process flow starts at operation 602.
At 602, the method 600 includes accessing, by a server system such as server system 200 of FIG. 2, a training dataset 228 and a first latent space from a database 204 associated with the server system 200, the training dataset 228 including a set of labeled data points wherein, the set of labeled data points includes a subset of first labeled data points and a subset of second labeled data points.
At 604, the method 600 includes generating, by the server system such as the server system 200, a plurality of second features based, at least in part, on the subset of second labeled data points.
At 606, the method 600 includes generating, by the server system 200, a second latent space based, at least in part, on the plurality of second features.
At 608, the method 600 includes training, by the server system 200, a second ML model 108 based, at least in part, on performing a second set of operations iteratively till the performance of the second ML model converges to a second predefined criteria.
In an implementation, the second set of operations includes sub-operations 608A-608F. It is noted that predefined criteria may refer to a point in the iterative process where the values of second latent space loss minimize or saturate (i.e., stops or effectively ceases to decrease with successive iterations).
At 608A, the method 600 includes initializing the second ML model based, at least in part, on one or more second model parameters, the second ML model includes a plurality of second encoder layers and a plurality of second decoder layers.
At 608B, the method 600 includes determining a second reconstruction loss based, at least in part, on the plurality of second features.
At 608C, the method 600 includes computing a Z-score probability based, at least in part, on the first latent space and the second latent space, the Z-score probability indicating a probability of the second latent space being in the first latent space.
At 608D, the method 600 includes generating a second loss function based, at least in part, on the second reconstruction loss and the z-score probability.
At 608E, the method 600 includes computing a second latent space loss based, at least in part, on the second loss function.
At 608F, the method 600 includes optimizing the one or more second model parameters based, at least in part, on back-propagating the second latent space loss.
FIG. 7 illustrates a process flow diagram depicting a method for training a machine learning model for detectingNoisy labels in transaction data, in accordance with an embodiment of the present disclosure. The method 700 depicted in the flow diagram may be executed by, for example, the server system 200. The sequence of operations of the method 700 may not be necessarily executed in the same order as they are presented. Further, one or more operations may be grouped and performed in the form of a single step, or one operation may have several sub-steps that may be performed in parallel or in a sequential manner. Operations of the method 700, and combinations of operations in the method 700 may be implemented by, for example, hardware, firmware, a processor, circuitry, and/or a different device associated with the execution of software that includes one or more computer program instructions. The plurality of operations is depicted in the process flow of the method 700. The process flow starts at operation 702.
At 702, the method 700 includes accessing, by a server system, a transaction dataset and a non-fraud latent space from a database such as the database 204 associated with the server system 200, the transaction dataset including a set of labeled data points wherein, the set of labeled data points includes a subset of non-fraud labeled data points and a subset of fraud labeled data points.
At 704, the method 700 includes generating, by the server system, a plurality of fraud features based, at least in part, on the subset of fraud labeled data points.
At 706, the method 700 includes generating, by the server system, a fraud latent space based, at least in part, on the plurality of fraud features.
At 708, the method 700 includes training, by the server system, a second ML model based, at least in part, on performing a second set of operations iteratively till the performance of the second ML model converges to a second predefined criteria. The second set of operations may include sub-operations 708A-708F given below:
At 708A, the method 700 includes initializing the second ML model 108 based, at least in part, on one or more second model parameters, the second ML model 108 including a plurality of second encoder layers and a plurality of second decoder layers.
At 708B, the method 700 includes determining a second reconstruction loss based, at least in part, on the plurality of fraud feature.
At 708C, the method 700 includes computing a Z-score probability based, at least in part, on the non-fraud latent space and the fraud latent space, the Z-score probability indicating a probability of the fraud latent space being in the non-fraud latent space.
At 708D, the method 700 includes generating a second loss function based, at least in part, on the second reconstruction loss and the z-score probability.
At 708E, the method 700 includes computing a second latent space loss based, at least in part, on the second loss function.
At 708F, the method 700 includes optimizing the one or more second model parameters based, at least in part, on back-propagating the second latent space loss.
FIG. 8 illustrates a process flow diagram depicting a method for detecting Noisy labels in transaction data, in accordance with an embodiment of the present disclosure. The method 800 depicted in the flow diagram may be executed by, for example, the server system 200. The sequence of operations of the method 800 may not be necessarily executed in the same order as they are presented. Further, one or more operations may be grouped and performed in the form of a single step, or one operation may have several sub-steps that may be performed in parallel or in a sequential manner. Operations of the method 800, and combinations of operations in the method 800 may be implemented by, for example, hardware, firmware, a processor, circuitry, and/or a different device associated with the execution of software that includes one or more computer program instructions. The plurality of operations is depicted in the process flow of the method 800. The process flow starts at operation 802.
At 802, the method 800 includes accessing, by a server system 200, a noisy dataset from the database, the noisy dataset including a set of candidate labeled data points. It is to be noted that, candidate labeled data points are new data points for which detection of noisy labels is to done. In other words, the process includes reconstruction of non-fraud candidate training samples by feeding it into fraud Autoencoder model and finding reconstruction error.
At 804, the method 800 includes performing, by the server system 200, for each candidate labeled data point from the set of candidate labeled data points a set of operations. The set of operations may include sub-operations 804A-804C given below:
At 804A, the method 800 includes computing via the second ML model, the Z-score probability. In other words, two-tailed z-score probability of reconstruction error is computed using reconstruction distribution parameter µ(e(F)), se(F))) of fraud autoencoder model.
At 804B, the method 800 includes classifying each candidate labeled data point as a noisy data point based, at least in part, on the Z-score probability being higher than a predefined threshold.
At 804C, the method 800 includes re-labeling each candidate labeled data point based, at least in part, on the classification step. It is noted that, in an example embodiment, the detected noisy instances are non-fraud samples with highest z-score probability. Additionally, filtration on the output can be done by determining that the top N% of non-fraud samples can be assumed as noisy samples.
The disclosed method with reference to FIG. 5-8, or one or more operations of the server system 200 may be implemented using software including computer-executable instructions stored on one or more computer-readable media (e.g., non-transitory computer-readable media, such as one or more optical media discs, volatile memory components (e.g., DRAM or SRAM), or nonvolatile memory or storage components (e.g., hard drives or solid-state nonvolatile memory components, such as Flash memory components) and executed on a computer (e.g., any suitable computer, such as a laptop computer, netbook, Web book, tablet computing device, smartphone, or other mobile computing devices). Such software may be executed, for example, on a single local computer or in a network environment (e.g., via the Internet, a wide-area network, a local-area network, a remote web-based server, a client-server network (such as a cloud computing network), or other such networks) using one or more network computers. Additionally, any of the intermediate or final data created and used during the implementation of the disclosed methods or systems may also be stored on one or more computer-readable media (e.g., non-transitory computer-readable media) and are considered to be within the scope of the disclosed technology. Furthermore, any of the software-based embodiments may be uploaded, downloaded, or remotely accessed through a suitable communication means. Such a suitable communication means includes, for example, the Internet, the World Wide Web (WWW), an intranet, software applications, cable (including fiber optic cable), magnetic communications, electromagnetic communications (including RF, microwave, and infrared communications), electronic communications, or other such communication means.
Although the invention has been described with reference to specific exemplary embodiments, it is noted that various modifications and changes may be made to these embodiments without departing from the broad spirit and scope of the invention. For example, the various operations, blocks, etc., described herein may be enabled and operated using hardware circuitry (for example, Complementary Metal Oxide Semiconductor (CMOS) based logic circuitry), firmware, software, and/or any combination of hardware, firmware, and/or software (for example, embodied in a machine-readable medium). For example, the apparatuses and methods may be embodied using transistors, logic gates, and electrical circuits (for example, Application Specific Integrated Circuit (ASIC) circuitry and/or Digital Signal Processor (DSP) circuitry).
Particularly, the server system 200 and its various components may be enabled using software and/or using transistors, logic gates, and electrical circuits (for example, integrated circuit circuitry such as ASIC circuitry). Various embodiments of the invention may include one or more computer programs stored or otherwise embodied on a computer-readable medium, wherein the computer programs are configured to cause the processor or the computer to perform one or more operations. A computer-readable medium storing, embodying, or encoded with a computer program, or similar language, may be embodied as a tangible data storage device storing one or more software programs that are configured to cause the processor or computer to perform one or more operations. Such operations may be, for example, any of the steps or operations described herein. In some embodiments, the computer programs may be stored and provided to a computer using any type of non-transitory computer-readable media. Non-transitory computer-readable media includes any type of tangible storage media. Examples of non-transitory computer-readable media include magnetic storage media (such as floppy disks, magnetic tapes, hard disk drives, etc.), optical magnetic storage media (e.g. magneto-optical disks), Compact Disc Read-Only Memory (CD-ROM), Compact Disc Recordable (CD-R ), compact disc rewritable (CD-R/W ), Digital Versatile Disc (DVD), BLU-RAY® Disc (BD), and semiconductor memories (such as mask ROM, programmable ROM (PROM), (erasable PROM), flash memory, Random Access Memory (RAM), etc.). Additionally, a tangible data storage device may be embodied as one or more volatile memory devices, one or more non-volatile memory devices, and/or a combination of one or more volatile memory devices and non-volatile memory devices. In some embodiments, the computer programs may be provided to a computer using any type of transitory computer-readable media. Examples of transitory computer-readable media include electric signals, optical signals, and electromagnetic waves. Transitory computer-readable media can provide the program to a computer via a wired communication line (e.g., electric wires, and optical fibers) or a wireless communication line.
Various embodiments of the invention, as discussed above, may be practiced with steps and/or operations in a different order, and/or with hardware elements in configurations, which are different than those which, are disclosed. Therefore, although the invention has been described based on these exemplary embodiments, it is noted that certain modifications, variations, and alternative constructions may be apparent and well within the spirit and scope of the invention.
Although various exemplary embodiments of the invention are described herein in a language specific to structural features and/or methodological acts, the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as exemplary forms of implementing the claims.
, Claims: A computer-implemented method, comprising:
accessing, by a server system, a training dataset and a first latent space from a database associated with the server system, the training dataset comprising a set of labeled data points, wherein the set of labeled data points comprises a subset of first labeled data points and a subset of second labeled data points;
generating, by the server system, a plurality of second features based, at least in part, on the subset of second labeled data points;
generating, by the server system, a second latent space based, at least in part, on the plurality of second features; and
training, by the server system, a second ML model based, at least in part, on performing a second set of operations iteratively till the performance of the second ML model converges to a second predefined criteria, the second set of operations comprising:
initializing the second ML model based, at least in part, on one or more second model parameters, the second ML model comprising of a plurality of second encoder layers and a plurality of second decoder layers;
determining a second reconstruction loss based, at least in part, on the plurality of second features;
computing a Z-score probability based, at least in part, on the first latent space and the second latent space, the Z-score probability indicating a probability of the second latent space being in the first latent space;
generating a second loss function based, at least in part, on the second reconstruction loss and the Z-score probability;
computing a second latent space loss based, at least in part, on the second loss function; and
optimizing the one or more second model parameters based, at least in part, on back-propagating the second latent space loss.

The computer-implemented method as claimed in claim 1, further comprising:
generating, by the server system, a plurality of first features based, at least in part, on the subset of first labeled data points;
generating, by the server system, a first latent space based, at least in part, on the plurality of first features; and
training, by the server system, a first ML model based, at least in part, on performing a first set of operations iteratively till the performance of the first ML model converges to a first predefined criteria, the first set of operations comprising:
initializing the first ML model based, at least in part, on one or more first model parameters, the first ML model comprising of a plurality of first encoder layers and a plurality of first decoder layers;
determining a first reconstruction loss based, at least in part, on a first loss function and the plurality of first features; and
optimizing the one or more first model parameters based, at least in part, on back-propagating the first reconstruction loss.

The computer-implemented method as claimed in claim 1, wherein computing the Z-score probability further comprising:
computing, by the server system, a mean of the first latent space and a standard deviation of first latent space; and
computing, by the server system via the second ML model, the Z-score probability based, at least in part, on the second reconstruction loss, the mean of first latent space and the standard deviation of first latent space.

The computer-implemented method as claimed in claim 1, wherein the Z-score is computed as:
z = (h(F)-µ(h(nF)))/(s(h(nF)))

wherein, h(F) is the second latent space of the plurality of second features, µ(h(nF)) is the mean of first latent space distribution, s(h(nF)) is a standard deviation of the first latent space distribution.

The computer-implemented method as claimed in claim 1, wherein the second loss function is computed as:
min{||X-D(h(F))||^2+2*P(Z=|(h(F)-µ(h(nF)))/(s(h(nF)))|)}
wherein, X is the N-dimensional features of the subset of second labeled data points, D(h(F)) is the N-dimensional reconstructed plurality of second features, h(F) is the second latent space of the plurality of second features, µ(h(nF)) is the mean of first latent space distribution, s(h(nF)) is a Standard deviation of first latent space distribution, 2*P(Z=|(h(F)-µ(h(nF)))/(s(h(nF)))|) is the second latent space loss.

The computer-implemented method as claimed in claim 1, wherein determining the second reconstruction loss further comprising:
generating, by the server system, a set of masked second features based at least in part, on a subset of second features from the plurality of second features;
computing, via the second ML model, a set of reconstructed second features based, at least in part, on the set of masked second features; and
determining, by the server system, the second reconstruction loss based, at least in part, on comparing the plurality of second features and the set of reconstructed second features.

The computer-implemented method as claimed in claim 1, further comprising:
accessing, by the server system, a noisy dataset from the database, the noisy dataset comprising a set of candidate labeled data points; and
performing, by the server system, for each candidate labeled data point from the set of candidate labeled data points:
computing via the second ML model, the Z-score probability;
classifying each candidate labeled data point as a noisy data point based, at least in part, on the Z-score probability being higher than a predefined threshold; and
re-labeling each candidate labeled data point based, at least in part, on the classification step.

The computer-implemented method as claimed in claim 1, wherein the first latent space and second latent space are dimensionally identical.

The computer-implemented method as claimed in claim 1, wherein the first ML model is a Variational Autoencoder and the second ML model is an Autoencoder.

A computer-implemented method, comprising:
accessing, by a server system, a transaction dataset and a non-fraud latent space from a database associated with the server system, the transaction dataset comprising a set of labeled data points, wherein the set of labeled data points comprises a subset of non-fraud labeled data points and a subset of fraud labeled data points;
generating, by the server system, a plurality of fraud features based, at least in part, on the subset of fraud labeled data points;
generating, by the server system, a fraud latent space based, at least in part, on the plurality of fraud features; and
training, by the server system, a second ML model based, at least in part, on performing a second set of operations iteratively till the performance of the second ML model converges to a second predefined criteria, the second set of operations comprising:
initializing the second ML model based, at least in part, on one or more second model parameters, the second ML model comprising of a plurality of second encoder layers and a plurality of second decoder layers;
determining a second reconstruction loss based, at least in part, on the plurality of fraud features;
computing a Z-score probability based, at least in part, on the non-fraud latent space and the fraud latent space, the Z-score probability indicating a probability of the fraud latent space being in the non-fraud latent space;
generating a second loss function based, at least in part, on the second reconstruction loss and the Z-score probability;
computing a second latent space loss based, at least in part, on the second loss function; and
optimizing the one or more second model parameters based, at least in part, on back-propagating the second latent space loss.

The computer-implemented method as claimed in claim 10, further comprising:
generating, by the server system, a plurality of non-fraud features based, at least in part, on the subset of non-fraud labeled data points;
generating, by the server system, a non-fraud latent space based, at least in part, on the plurality of non-fraud features; and
training, by the server system, a first ML model based, at least in part, on performing a first set of operations iteratively till the performance of the first ML model converges to a first predefined criteria, the first set of operations comprising:
initializing the first ML model based, at least in part, on one or more first model parameters, the first ML model comprising of a plurality of first encoder layers and a plurality of first decoder layers;
determining a first reconstruction loss based, at least in part, on a first loss function and the plurality of non-fraud features; and
optimizing the one or more first model parameters based, at least in part, on back-propagating the first reconstruction loss.

A server system, comprising:
a memory configured to store instructions;
a communication interface; and
a processor in communication with the memory and the communication interface, the processor configured to execute the instructions stored in the memory and thereby cause the server system to perform, at least in part, to:
access a training dataset and a first latent space from a database associated with the server system, the training dataset comprising a set of labeled data points wherein, the set of labeled data points comprises a subset of first labeled data points and a subset of second labeled data points;
generate a plurality of second features based, at least in part, on the subset of second labeled data points;
generate a second latent space based, at least in part, on the plurality of second features; and
train a second ML model based, at least in part, on performing a second set of operations iteratively till the performance of the second ML model converges to a second predefined criteria, the second set of operations comprising:
initializing the second ML model based, at least in part, on one or more second model parameters, the second ML model comprising of a plurality of second encoder layers and a plurality of second decoder layers;
determining a second reconstruction loss based, at least in part, on the plurality of second features;
computing a Z-score probability based, at least in part, on the first latent space and the second latent space, the Z-score probability indicating a probability of the second latent space being in the first latent space;
generating a second loss function based, at least in part, on the second reconstruction loss and the Z-score probability;
computing a second latent space loss based, at least in part, on the second loss function; and
optimizing the one or more second model parameters based, at least in part, on back-propagating the second latent space loss.

The server system as claimed in claim 12, wherein the server system is further caused, at least in part, to:
generate a plurality of first features based, at least in part, on the subset of first labeled data points;
generate a first latent space based, at least in part, on the plurality of first features; and
train a first ML model based, at least in part, on performing a first set of operations iteratively till the performance of the first ML model converges to a first predefined criteria, the first set of operations comprising:
initializing the first ML model based, at least in part, on one or more first model parameters, the first ML model comprising of a plurality of first encoder layers and a plurality of first decoder layers;
determining a first reconstruction loss based, at least in part, on a first loss function and the plurality of first features; and
optimizing the one or more first model parameters based, at least in part, on back-propagating the first reconstruction loss.

The server system as claimed in claim 12, wherein to compute the Z-score probability, the server system is further caused, at least in part, to:
compute a mean of the first latent space and a standard deviation of first latent space; and
compute via the second ML model, the Z-score probability based, at least in part, on the second reconstruction loss, the mean of first latent space and the standard deviation of first latent space.

The server system as claimed in claim 12, wherein the Z-score is computed as:
Z = (h(F)-µ(h(nF)))/(s(h(nF)))

The server system as claimed in claim 12, wherein the second loss function is computed as:
min{||X-D(h(F))||^2+2*P(Z=|(h(F)-µ(h(nF)))/(s(h(nF)))|)}
wherein, X is the N-dimensional features of the subset of second labeled data points, D(h(F)) is the N-dimensional reconstructed plurality of second features, h(F) is the second latent space of the plurality of second features, µ(h(nF)) is the mean of first latent space distribution, s(h(nF)) is a Standard deviation of first latent space distribution, 2*P(Z=|(h(F)-µ(h(nF)))/(s(h(nF)))|) is the second latent space loss.

The server system as claimed in claim 12, wherein to determine the second reconstruction loss, the server system is further caused, at least in part, to:
generate a set of masked second features based at least in part, on a subset of second features from the plurality of second features;
compute, via the second ML model, a set of reconstructed second features based, at least in part, on the set of masked second features; and
determine the second reconstruction loss based, at least in part, on comparing the plurality of second features and the set of reconstructed second features.

The server system as claimed in claim 12, wherein the first latent space and second latent space are dimensionally identical.

The server system as claimed in claim 12, wherein the server system is further caused, at least in part, to:
access a noisy dataset from the database, the noisy dataset comprising a set of candidate labeled data points; and
perform for each candidate labeled data point from the set of candidate labeled data points:
computing via the second ML model, the Z-score probability;
classifying each candidate labeled data point as a noisy data point based, at least in part, on the Z-score probability being higher than a predefined threshold; and
re-labeling each candidate labeled data point based, at least in part, on the classification step.

A non-transitory computer-readable storage medium comprising computer-executable instructions that, when executed by at least a processor of a server system, cause the server system to perform a method comprising:
accessing a training dataset and a first latent space from a database associated with the server system, the training dataset comprising a set of labeled data points wherein, the set of labeled data points comprises a subset of first labeled data points and a subset of second labeled data points;
generating a plurality of second features based, at least in part, on the subset of second labeled data points;
generating a second latent space based, at least in part, on the plurality of second features; and
training a second ML model based, at least in part, on performing a second set of operations iteratively till the performance of the second ML model converges to a second predefined criteria, the second set of operations comprising:
initializing the second ML model based, at least in part, on one or more second model parameters, the second ML model comprising of a plurality of second encoder layers and a plurality of second decoder layers;
determining a second reconstruction loss based, at least in part, on the plurality of second features;
computing a Z-score probability based, at least in part, on the first latent space and the second latent space, the Z-score probability indicating a probability of the second latent space being in the first latent space;
generating a second loss function based, at least in part, on the second reconstruction loss and the Z-score probability;
computing a second latent space loss based, at least in part, on the second loss function; and
optimizing the one or more second model parameters based, at least in part, on back-propagating the second latent space loss.

Documents

Application Documents

#	Name	Date
1	202341076902-STATEMENT OF UNDERTAKING (FORM 3) [10-11-2023(online)].pdf	2023-11-10
2	202341076902-POWER OF AUTHORITY [10-11-2023(online)].pdf	2023-11-10
3	202341076902-FORM 1 [10-11-2023(online)].pdf	2023-11-10
4	202341076902-FIGURE OF ABSTRACT [10-11-2023(online)].pdf	2023-11-10
5	202341076902-DRAWINGS [10-11-2023(online)].pdf	2023-11-10
6	202341076902-DECLARATION OF INVENTORSHIP (FORM 5) [10-11-2023(online)].pdf	2023-11-10
7	202341076902-COMPLETE SPECIFICATION [10-11-2023(online)].pdf	2023-11-10
8	202341076902-Proof of Right [12-03-2024(online)].pdf	2024-03-12