Sign In to Follow Application
View All Documents & Correspondence

Methods And Systems For Detecting Out Of Distribution Data Points In An Imbalanced Dataset

Abstract: Embodiments provide methods and systems for detecting Out-of-Distribution (OOD) data points in an imbalanced dataset. The method includes accessing the imbalanced dataset and performing for each data point: generating task-specific feature(s) from the imbalanced dataset, and generating category-related latent representation(s) from the task-specific feature(s). The method includes generating category distribution cluster(s) in a vector space based on the category-related latent representation(s) and categories in training dataset. The method includes computing, via a machine learning (ML) model, a relative confidence score for each data point. The method includes classifying the data points to be included in the category distribution cluster(s) based on the relative confidence score of each data point being at least equal to a predefined threshold score. Alternatively, the method includes classifying the data points to be included in the OOD category based on the relative confidence score of each data point being lower than the predefined threshold score.

Get Free WhatsApp Updates!
Notices, Deadlines & Correspondence

Patent Information

Application #
Filing Date
12 September 2023
Publication Number
11/2025
Publication Type
INA
Invention Field
COMPUTER SCIENCE
Status
Email
Parent Application

Applicants

MASTERCARD INTERNATIONAL INCORPORATED
2000 Purchase Street, Purchase, NY 10577, United States of America

Inventors

1. Priyanka Chudasama
A-1/375, dal mill road, Uttam Nagar, New Delhi 110059, Delhi, India
2. Deepak Chaurasiya
289 D, Humayunpur South Aryanagar, Gorakhpur 273001, Uttar Pradesh, India
3. Aakarsh Malhotra
BB 10C, DDA Flats Munirka, New Delhi 110067, Delhi, India
4. Alok Singh
T2-505, Valley View Estate Apartment Gwal Pahadi, Gurgaon 122011, Haryana, India

Specification

Description: The present disclosure relates to artificial intelligence-based processing systems and, more particularly, to electronic methods and complex processing systems for detecting Out-of-Distribution data points in an imbalanced dataset.
BACKGROUND
In modern times, as the world has become more data-driven, one of the most popular and commonly performed tasks is classification. The classification task generally includes classifying data into several groups and has a wide range of applications in the industry such as speech recognition, image classification, fraud detection, medical diagnostic testing, email spam detection, etc. Different types of classification tasks include binary classification, multi-class classification, multi-label classification, and imbalanced classification. As may be understood, a machine learning (ML)-based classification model has to be trained before it may perform a classification task on a dataset and the dataset has to be processed or refined to remove data points with vague labels. Imbalanced classification is a task where distinctly labeled data points in a dataset have to be classified. Therefore, before performing classification, the imbalanced label distribution within the data points of the imbalanced dataset has to be processed or refined to create a balanced label distribution within the data points. To that end, several conventional approaches may perform this refinement process.
With the rapid advancement in the field of Artificial Intelligence (AI) and Machine Learning (ML), it has become crucial for software developers to improve the performance of their AI or ML models to ensure that any predictions made by such models are accurate. As may be understood, the performance of any AI or ML model directly correlates to the training and testing process followed while designing these models. It is noted that before any AI or ML model is trained or tested using a dataset, this dataset has to be refined or processed to ensure that the AI or ML model is able to clearly learn from the various data points in this dataset. To that end, data pre-processing is considered to be a crucial stage in the development of AI or ML models since this stage acts as the foundation of the model training and testing process.
As may be understood for supervised learning-based models, the dataset may be processed by removing any unlabeled data points to generate a labeled dataset where each data point is labeled. However, it is noted labels for a data point may be derived from a variety of sources which in turn creates ambiguity with regard to a correct label for the dataset. For instance, in the financial domain in a chargeback scenario, a transaction performed by a cardholder may be labeled as a third-party fraud by an issuing bank while the same transaction may be labeled as first-party fraud by an acquiring bank. Thus, the same data point (i.e., the transaction) is labeled with different labels which leads to ambiguity in the correct label for this data point. This problem is exacerbated when many data sources are used for labeling the data points in a dataset. This ambiguity or vagueness in the labels corresponding to a plurality of data points in a dataset when used for training any AI or ML model will lead to poor performance for that model. To that end, various techniques have been developed to refine or process a dataset to remove such vague data points for improving the performance of AI or ML models.
One such technique includes performing Out-of-Distribution (OOD) detection for the plurality of data points in the dataset to detect data points that may not belong to a predefined class. An example of this approach is known as a Deep Multi-class Data Description (MCDD) technique which optimizes Deep Neural Networks (DNNs) so that the latent representations of data points in the same class (i.e., In-Distribution (ID) data samples) gather together forming an independent sphere of minimum volume such that the OOD data points fall outside the sphere. However, this approach is only applicable to datasets where the label distribution for the classes is uniform. Due to this limitation, this approach fails to provide appropriate results in the case of imbalanced label distribution in the input dataset. In other words, this approach is unable to detect OOD data points when the input dataset is an imbalanced dataset. In addition, it is noted that the probabilities of a data point belonging to multiple classes don’t sum up to unity, due to which cross entropy doesn’t act as a push loss in this approach which is undesirable. Further, the confidence parameter used by this approach is unbounded which makes it difficult to detect an OOD data point.
Thus, there exists a technological need for more efficient methods and systems that can address the above-mentioned technical problems and provide refining an imbalanced dataset for a specific task, and detecting Out-of-Distribution data points in the imbalanced dataset.
SUMMARY
Various embodiments of the present disclosure provide methods and systems for refining an imbalanced dataset and detecting Out-of-Distribution data points in the imbalanced dataset.
In an embodiment, a computer-implemented method for refining an imbalanced dataset for a specific task and detecting Out-of-Distribution (OOD) data points in the imbalanced dataset is disclosed. The computer-implemented method performed by a server system includes accessing the imbalanced dataset from a database associated with the server system. Herein, the imbalanced dataset includes a plurality of data points related to a plurality of entities. Each data point of the plurality of data points is labeled with at least one category label for a training dataset extracted from the imbalanced dataset, and with a label distribution being imbalanced. The method further includes performing for each data point of the plurality of data points: generating one or more task-specific features from the imbalanced dataset, and generating a category-related latent representation from the one or more task-specific features. Further, the method includes generating one or more category distribution clusters in a vector space based, at least in part, on a plurality of category-related latent representations and one or more categories in the training dataset. Herein, each category distribution cluster of the one or more category distribution clusters is clustered around a corresponding center position in the vector space. Furthermore, the method includes computing, via a machine learning (ML) model, a relative confidence score corresponding to each data point of the plurality of data points based, at least in part, on a relative distance of the category-related latent representation corresponding to each data point from center positions of the one or more category distribution clusters. Herein, the relative confidence score for an individual data point indicates if the individual data point from the plurality of data points belongs to a particular category distribution cluster from one of the one or more category distribution clusters. The method includes classifying the plurality of data points to be included in an OOD category based, at least in part, on the relative confidence score of each data point of the plurality of data points being lower than a predefined threshold score.
In another embodiment, a server system is disclosed. The server system includes a communication interface and a memory including executable instructions. The server system also includes a processor communicably coupled to the memory. The processor is configured to execute the instructions to cause the server system, at least in part, to access an imbalanced dataset from a database associated with the server system. Herein, the imbalanced dataset includes a plurality of data points related to a plurality of entities. Each data point of the plurality of data points is labeled with at least one category label for a training dataset extracted from the imbalanced dataset, and with a label distribution being imbalanced. The server system is further caused to perform for each data point of the plurality of data points to: generate one or more task-specific features from the imbalanced dataset, and generate a category-related latent representation from the one or more task-specific features. Further, the server system is caused to generate one or more category distribution clusters in a vector space based, at least in part, on the plurality of category-related latent representations and one or more categories in the training dataset. Herein, each category distribution cluster of the one or more category distribution clusters is clustered around a corresponding center position in the vector space. Furthermore, the server system is caused to compute via a machine learning (ML) model, a relative confidence score corresponding to each data point of the plurality of data points based, at least in part, on a relative distance of the category-related latent representation corresponding to each data point from center positions of the one or more category distribution clusters. Herein, the relative confidence score for an individual data point indicates if the individual data point from the plurality of data points belongs to a particular category distribution cluster from one of the one or more category distribution clusters. The server system is caused to classify the plurality of data points to be included in an Out-of-Distribution (OOD) category based, at least in part, on the relative confidence score of each data point of the plurality of data points being lower than a predefined threshold score.
In yet another embodiment, a computer-implemented method for refining an imbalanced dataset for fraud detection in payment transactions is disclosed. The computer-implemented method performed by a server system includes accessing the imbalanced dataset from a database associated with the server system. Herein, the imbalanced dataset includes a plurality of payment transactions related to a network of a plurality of cardholders and a plurality of merchants. Each payment transaction of the plurality of payment transactions is labeled with at least one category label from a training dataset extracted from the imbalanced dataset, and with a label distribution being imbalanced. The method further includes performing for each payment transaction of the plurality of payment transactions: generating one or more task-specific features from the imbalanced dataset, and generating a category-related latent representation from the one or more task-specific features. Further, the method includes generating one or more category distribution clusters in a vector space based, at least in part, on the plurality of category-related latent representations and one or more categories in the training dataset. Herein, each category distribution cluster of the one or more category distribution clusters is clustered around a corresponding center position in the vector space. Furthermore, the method includes computing, via a machine learning (ML) model, a relative confidence score corresponding to each payment transaction of the plurality of payment transactions based, at least in part, on a relative distance of the category-related latent representation corresponding to each payment transaction from center positions of the one or more category distribution clusters. Herein, the relative confidence score for an individual payment transaction indicates if the individual payment transaction from the plurality of payment transactions belongs to a particular category distribution cluster from one of the one or more category distribution clusters. The method includes classifying the plurality of payment transactions to be included in an Out-of-Distribution (OOD) category based, at least in part, on the relative confidence score of each payment transaction of the plurality of payment transactions being lower than a predefined threshold score.
In yet another embodiment, a non-transitory computer-readable storage medium is disclosed. The non-transitory computer-readable storage medium includes computer-executable instructions that, when executed by at least a processor of a server system, cause a server system to perform a method. The method includes accessing an imbalanced dataset from a database associated with the server system. Herein, the imbalanced dataset includes a plurality of data points related to a plurality of entities. Each data point of the plurality of data points is labeled with at least one category label for a training dataset extracted from the imbalanced dataset, and with a label distribution being imbalanced. The method further includes performing for each data point of the plurality of data points: generating one or more task-specific features from the imbalanced dataset, and generating a category-related latent representation from the one or more task-specific features. Further, the method includes generating one or more category distribution clusters in a vector space based, at least in part, on a plurality of category-related latent representations, and one or more categories in the training dataset. Herein, each category distribution cluster of the one or more category distribution clusters is clustered around a corresponding center position in the vector space. Furthermore, the method includes computing, via a machine learning (ML) model, a relative confidence score corresponding to each data point of the plurality of data points based, at least in part, on a relative distance of the category-related latent representation corresponding to each data point from center positions of the one or more category distribution clusters. Herein, the relative confidence score for an individual data point indicates if the individual data point from the plurality of data points belongs to a particular category distribution cluster from one of the one or more category distribution clusters. The method further includes classifying the plurality of data points to be included in one of the one or more categories based, at least in part, on the relative confidence score of each data point of the plurality of data points being at least equal to a predefined threshold score. The method also includes classifying the plurality of data points to be included in an Out-of-Distribution (OOD) category based, at least in part, on the relative confidence score of each data point of the plurality of data points being lower than a predefined threshold score.

BRIEF DESCRIPTION OF THE FIGURES
For a more complete understanding of example embodiments of the present technology, reference is now made to the following descriptions taken in connection with the accompanying drawings in which:
FIG. 1 illustrates an exemplary representation of an environment related to at least some example embodiments of the present disclosure;
FIG. 2 illustrates a simplified block diagram of a server system, in accordance with an embodiment of the present disclosure;
FIG. 3 illustrates a flow diagram representation of a process flow of training of a machine learning (ML) model, in accordance with an embodiment of the present disclosure;
FIG. 4 illustrates a graphical representation of intermediate stages implemented in an example vector for computing a relative confidence score and classifying each data point of the plurality of data points of an example imbalanced dataset, in accordance with an embodiment of the present disclosure;
FIG. 5A illustrates a graphical representation of a particular implementation of classifying the plurality of data points based on the relative confidence score, in accordance with an example embodiment of the present disclosure;
FIG. 5B illustrates a graphical representation of another implementation of classifying the plurality of data points based on the relative confidence score, in accordance with another example embodiment of the present disclosure;
FIG. 6 illustrates a block diagram representation of an environment related to at least some example embodiments of the present disclosure;
FIG. 7 illustrates a graphical representation of an example implementation of classifying one or more data points in the environment of FIG. 6 for comparative study with a conventional approach, in accordance with an embodiment of the present disclosure;
FIG. 8 illustrates a process flow diagram depicting a method for refining an imbalanced dataset for a specific task and detecting Out-of-Distribution (OOD) data points from the imbalanced dataset, in accordance with an embodiment of the present disclosure;
FIGS. 9A and 9B, collectively, illustrate a detailed process flow diagram depicting a method for refining an imbalanced dataset for a specific task and detecting OOD data points from the imbalanced dataset, in accordance with an embodiment of the present disclosure; and
FIG. 10 illustrates a process flow diagram depicting a method for refining an imbalanced dataset for fraud detection and detecting OOD payment transactions from the imbalanced dataset, in accordance with an embodiment of the present disclosure.
The drawings referred to in this description are not to be understood as being drawn to scale except if specifically noted, and such drawings are only exemplary in nature.

DETAILED DESCRIPTION
In the following description, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the present disclosure. It will be apparent, however, to one skilled in the art that the present disclosure can be practiced without these specific details. Descriptions of well-known components and processing techniques are omitted so as to not unnecessarily obscure the embodiments herein. The examples used herein are intended merely to facilitate an understanding of ways in which the embodiments herein may be practiced and to further enable those of skill in the art to practice the embodiments herein. Accordingly, the examples should not be construed as limiting the scope of the embodiments herein.
Reference in this specification to “one embodiment” or “an embodiment” means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment of the present disclosure. The appearances of the phrase “in an embodiment” in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Moreover, various features are described which may be exhibited by some embodiments and not by others. Similarly, various requirements are described which may be requirements for some embodiments but not for other embodiments.
Moreover, although the following description contains many specifics for the purposes of illustration, anyone skilled in the art will appreciate that many variations and/or alterations to said details are within the scope of the present disclosure. Similarly, although many of the features of the present disclosure are described in terms of each other, or in conjunction with each other, one skilled in the art will appreciate that many of these features can be provided independently of other features. Accordingly, this description of the present disclosure is set forth without any loss of generality to, and without imposing limitations upon, the present disclosure.
Embodiments of the present disclosure may be embodied as an apparatus, a system, a method, or a computer program product. Accordingly, embodiments of the present disclosure may take the form of an entire hardware embodiment, an entire software embodiment (including firmware, resident software, micro-code, etc.), or an embodiment combining software and hardware aspects that may all generally be referred to herein as a “circuit”, “engine”, “module”, or “system”. Furthermore, aspects of the present disclosure may take the form of a computer program product embodied in one or more computer-readable storage media having computer-readable program code embodied thereon.
The terms “imbalanced data” and “imbalanced dataset” may have been used interchangeably throughout the description and may refer to a classification dataset associated with a classification problem with skewed class proportions. It may be noted that classes that make up a large portion of a dataset are called majority classes and those that make up a smaller proportion are minority classes. For example, in the payment industry, the majority of implementations have highly imbalanced label distribution such as in case of fraud detection, the majority of payment transactions may be legitimate transactions and hence may be termed as a majority class. The rest of the transactions may either belong to a Third-Party Fraud (TPF) class or a First-Party Fraud (FPF) class. Moreover, the FPF class may have a handful of labeled data points, and hence its cluster if a small cluster, making the FPF class a minority class. Another example includes a rare disease diagnosis, in which a rarely occurring disease such as in case of cancer, rarely occurring type of cancer may include anal cancer, stomach cancer, and laryngeal cancer. Therefore, an anal cancer class, a stomach cancer, and a laryngeal cancer class, each having clusters of patients that may have suffered from such cancers may be considered to be minority classes, and other classes such as bladder cancer, lung cancer, kidney cancer, etc., that are mostly observed in several patients may belong to a majority class.
The term “data classification” may have been used throughout the description and may refer to a supervised and/or unsupervised learning concept in machine learning (ML) that groups data in classes. This concept is mostly used to predict a class label to be assigned to the given input field of data. Examples of a classification problem may include document classification, sentiment analysis, spam filtering, customer behavior classification, image classification, credit card fraud detection, etc. Whereas, the term “imbalanced data classification” may refer to a supervised learning concept in ML that groups imbalanced data in classes. Examples of imbalanced data classification may include rare disease diagnosis, fraud detection, outlier detection, etc.
The terms “Out-of-Distribution”, “Out-of-Distribution data points”, “Out-of-Distribution dataset”, “outlier distribution”, “outlier distribution data points”, and “outlier distribution dataset” may have been used interchangeably throughout the description and may refer to a class of data that is not well represented in the training dataset while training an ML model, however, that can appear in the real-time dataset on which the ML model may be applied.
OVERVIEW
Various embodiments of the present disclosure provide methods, systems electronic devices, and computer program products for refining an imbalanced dataset for a specific task and detecting Out-of-Distribution (OOD) data points in the imbalanced dataset. In an embodiment, the present disclosure describes a server system for refining the imbalanced dataset for a specific task and detecting the OOD data points in the imbalanced dataset. The server system includes a processor and a memory. In an embodiment, the server system is configured to access the imbalanced dataset from a database associated with the server system. Herein, the imbalanced dataset includes a plurality of data points related to a plurality of entities. Each data point of the plurality of data points is labeled with at least one category label, and with a label distribution being imbalanced.
In an embodiment, the server system is further configured to perform for each data point of the plurality of data points to: generate one or more task-specific features from the imbalanced dataset, and generate a category-related latent representation from the one or more task-specific features.
Further, the server system is configured to generate one or more category distribution clusters in a vector space based, at least in part, on a plurality of category-related latent representations, and one or more categories in the training dataset. Herein, each category distribution cluster of the one or more category distribution clusters is clustered around a corresponding center position in the vector space.
Furthermore, the server system is configured to compute via a machine learning (ML) model, a relative confidence score corresponding to each data point of the plurality of data points based, at least in part, on a relative distance of the category-related latent representation corresponding to each data point from center positions of the one or more category distribution clusters. Herein, the relative confidence score for an individual data point indicates if the individual data point from the plurality of data points belongs to a particular category distribution cluster from one of the one or more category distribution clusters.
It may be noted that for the server system to be able to use the ML model, the server system may have to generate or train the ML model for accurately classifying the plurality of data points either under the one or more category distribution clusters or an OOD category based on the relative confidence score computed for the corresponding plurality of data points. Therefore, in an embodiment, the server system may be configured to generate the ML model based, at least in part, on performing a set of operations iteratively till the performance of the ML model converges to predefined criteria. Herein, the set of operations may include initializing the ML model based, at least in part, on one or more model parameters. Further, the server system may be configured to compute, via the ML model, one or more loss function values for each data point of the plurality of data points for reducing intra-class cluster distance and increasing inter-class cluster distance, based at least on using one or more loss functions and the relative confidence score. The server system may be further configured to compute a net loss function value for each data point of the plurality of data points based, at least in part, on aggregating the one or more loss function values. Lastly, the server system may be configured to optimize the one or more model parameters associated with the ML model based, at least in part, on back-propagating the net loss function value.
In a non-limiting example, the predefined criteria may be a saturation of the net loss function value. Herein, the net loss function value may get saturated after a plurality of iterations of the set of operations is performed. Further, in an embodiment, the one or more loss functions may include at least one pull loss to reduce the intra-class cluster distance, and at least two push losses to increase the inter-class cluster distance and to configure an imbalanced setting. In a non-limiting example, the ML model may be a Gaussian discriminator-based model.
Upon generating the ML model, the server system may use the ML model for computing the relative confidence score. Further, in an embodiment, for computing the relative confidence score, the server system may be configured to determine, via the ML model, the center position corresponding to each category distribution cluster of the one or more category distribution clusters based, at least in part, on one or more decision boundaries surrounding each category distribution cluster of the one or more category distribution clusters. Herein, in an embodiment, the one or more decision boundaries may include one or more spherical decision boundaries.
The server system may be further configured to determine, via the ML model, a distance of the category-related latent representation corresponding to each data point from the center positions of the one or more category distribution clusters. Further, the server system may be configured to compute a radius of each category distribution cluster of the one or more category distribution clusters based, at least in part, on the corresponding center positions and the corresponding one or more decision boundaries of the corresponding category distribution clusters. Furthermore, the server system may be configured to compute the relative distance for the category-related latent representation corresponding to each data point from the center positions of the one or more category distribution clusters based, at least in part, on the radius and the distance. Finally, the server system may be configured to compute, via the ML model, the relative confidence score for each data point of the plurality of data points based, at least in part, on the relative distance and predefined confidence computation criteria.
Later, in an embodiment, the server system is configured to classify the plurality of data points to be included in one of the one or more categories based, at least in part, on the relative confidence score of each data point of the plurality of data points being at least equal to a predefined threshold score. Alternatively, the server system is configured to classify the plurality of data points to be included in the OOD category based, at least in part, on the relative confidence score of each data point of the plurality of data points is lower than the predefined threshold score
Various embodiments of the present disclosure offer multiple advantages and technical effects. For instance, the present disclosure is an AI/ML-based approach for automatically predicting which labels are right and which cannot be used for AI/ML model training i.e., detecting Out-of-Distribution (OOD) data points. Herein, it may be noted that the data samples with labels that cannot be used for model training may be vaguely labeled, and the approach proposed in the present disclosure is detecting such data samples. Further, it may be noted that data samples that are vaguely labeled may be removed from the training dataset to eliminate any kind of vagueness from the training dataset. Therefore, in the future, the models trained using such a dataset (a balanced and refined dataset that is free from vaguely labeled data points) have higher performance. In other words, the present disclosure is capable of detecting OOD data points from an imbalanced dataset and hence provides more precise results in case of fraud detection, rare disease diagnosis, spam filtering, and the like having an imbalanced distribution of labels in the input dataset.
Further, in the case of fraud detection in the payment industry, the present disclosure solves the problem of wrongly classifying non-First-Party Fraud (FPF) data points within the FPF class due to the imbalanced distribution of labels in the input dataset by removing non-FPF data points from the FPF class, thereby refining the input dataset. This is achieved by bounding the confidence parameter upon setting a threshold for the distribution of each class cluster represented in a vector space. In addition, it may be noted that better and clean labels that are obtained upon refining an imbalanced dataset using the present disclosure, assist in improved learning for fraud classification, resulting in better performance for AI/ML models for classifying Third-Party Fraud (TPF) models, FPF models, Decision Intelligence (DI) score models, and the like during deployment. This in turn has a positive impact on merchants and acquirers in a way that they can take upfront actions for likely FPF transactions so that are not held liable for any later disputes/chargebacks in the name of TPF. This also has a positive impact on issuers in a way that it helps the issuers to take the right and timely decisions. Also, in case of disputes, the liability could be with the issuers, thereby helping the issuers to avoid unnecessary disputes.
Various example embodiments of the present disclosure are described hereinafter with reference to FIGS. 1 to 11.
FIG. 1 illustrates a block diagram representation of an environment 100 related to at least some example embodiments of the present disclosure. Although the environment 100 is presented in one arrangement, other embodiments may include the parts of the environment 100 (or other parts) arranged otherwise depending on, for example, refining an imbalanced dataset for a specific task, detecting Out-of-Distribution (OOD) data points in the imbalanced dataset, and removing the OOD data points from the imbalanced dataset to generate a refined dataset.
The environment 100 generally includes a plurality of components such as a server system 102, a plurality of entities 104(1), 104(2), … 104(N), where ‘N’ is a non-zero Natural number (collectively, interchangeably referred to hereinafter as a plurality of entities 104 or entities 104), a plurality of data sources 106(1), 106(2), … 106(N) where ‘N’ is a non-zero Natural number (collectively, interchangeably referred to hereinafter as a plurality of data sources 106 or data sources 106), and a database 108 each coupled to, and in communication with (and/or with access to) a network 110. Herein, the network 110 may include, without limitation, a light fidelity (Li-Fi) network, a local area network (LAN), a wide area network (WAN), a metropolitan area network (MAN), a satellite network, the Internet, a fiber optic network, a coaxial cable network, an infrared (IR) network, a radio frequency (RF) network, a virtual network, and/or another suitable public and/or private network capable of supporting communication among two or more of the parts or users illustrated in FIG. 1, or any combination thereof.
Various components in the environment 100 may connect to the network 110 in accordance with various wired and wireless communication protocols, such as Transmission Control Protocol and Internet Protocol (TCP/IP), User Datagram Protocol (UDP), 2nd Generation (2G), 3rd Generation (3G), 4th Generation (4G), 5th Generation (5G) communication protocols, Long Term Evolution (LTE) communication protocols, New Radio (NR) communication protocol, any future communication protocol, or any combination thereof. In some instances, the network 110 may utilize a secure protocol (e.g., Hypertext Transfer Protocol (HTTP), Secure Socket Lock (SSL), and/or any other protocol, or set of protocols for communicating with the various components depicted in FIG. 1.
In an embodiment, the entities 104 may include a person, an object, a place or location, an institution, an organization, a group of institutions/organizations, systems, set-ups, establishments, or the like that have a self-contained existence. For instance, in the financial domain, the entities 104 may refer to any or all cardholders, merchants, issuers, acquirers, banks, financial institutions, regulatory institutions, third-party organizations, and the like. Similarly, in the medical domain, the entities 104 may refer to any or all patients, medical practitioners, doctors, nurses, departments, pharmaceuticals, and the like. Therefore, it may be noted that entities 104 as mentioned in the present disclosure may vary for different applications or tasks, and the same would be covered within the scope of the present disclosure. In some instances, entities 104 may also be referred to as task-specific entities.
In another embodiment, the data sources 106 may correspond to various data sources that may be associated with the entities 104 and are responsible for collecting and collating data related to the entities 104. Herein, the data sources 106 may act as a data provider for the server system 102 and a data collector for the entities 104. In an embodiment, the data sources 106 may collect data from the entities 104 via the network 110. It may be noted that the data sources 106 may be independent institutions that are independent of the entities 104 or may be associated with the entities 104. Further, in an embodiment, the data sources 106 may be local storage units or cloud/remote storage units. In another embodiment, the data sources 106 may be owned by third-party organizations.
In a non-limiting example, each data source of the data sources 106 may collect and store data pertaining to a particular or specific task. In other words, each data source collects and stores task-specific data. For instance, the data source 106(1) may be a data source for collecting and storing payment transaction-specific data. To that end, the data source 106(1) may be an issuer server associated with an issuing bank, an acquirer server associated with an acquiring bank, a payment server associated with a payment network, or the like. In the present example, the payment transaction-specific data may include data related to a plurality of payment transactions performed between multiple cardholders and multiple merchants using multiple payment means. In another instance, the data source 106(2) may be a medical-specific data repository associated with a hospital server that collects and stores a plurality of patient-related details, medical authorities-related details, staff details, medical students’ details, and the like.
It may be noted that, in an embodiment, the data may be collected from the one or more entities 104(1)-104(N) via the data sources 106 for performing a specific task. Herein, as the data is collected from different data sources (e.g., the data sources 106), the collected data may have data points having multiple and different labels, thereby making the collected data vague. In a non-limiting example, the task may include speech recognition, image classification, fraud detection, performing medical diagnosis, email spam detection, or the like. As may be understood that the data includes a plurality of data points that may be used for training an AI or ML model for performing the task. Herein, each data point of the plurality of data points may be pre-labeled based on task-specific historic activities and hence belong to a specific category or class. For instance, by using pre-classified images of dogs and cats, an AI/ML model may be trained to perform the task of automatically classifying newly fed photos of cats and dogs for use in computer vision applications.
Further, in the financial domain, if a financial dataset is provided as input to the corresponding model, and if the model is trained using a conventional method such as a Deep Multi-class Data Description (MCDD), then a transaction that remains unrecognized by the model gets classified as an OOD sample. However, in a chargeback scenario, if fraud detection is to be performed on an imbalanced dataset i.e., the distribution of the dataset includes a large number of representations labeled as Legitimate transactions, a small number of representations are labeled as First-party Frauds (FPFs), and a moderate number of representations are labeled as Third-party Frauds (TPFs), then one or more of representations for an FPF can wrongly get classified in either the Legitimated transaction class or the TPF class depending upon its distance from each class cluster. This is because the Deep MCDD approach does not take into consideration the imbalanced distribution of the labels in the representations while the training phase. Therefore, it may be understood that the imbalanced dataset includes vaguely labeled data points, and data points that belong to different classes may be distributed in a latent space by forming clusters, each having a different area of distribution.
Therefore, to address the above-mentioned technical problem and to detect OOD data points in an imbalanced dataset for a specific task with greater preciseness and accuracy, one or more embodiments of the server system 102 proposed in the present disclosure are configured to perform one or more operations described herein.
In one embodiment, the server system 102 is configured to facilitate entities 104 in refining the imbalanced dataset for a specific task, detecting OOD data points in the imbalanced dataset using various AI/ML models, and removing these OOD data points from the imbalanced dataset to generate a refined dataset. In some embodiments, the server system 102 may be deployed as a standalone server or may be implemented in the cloud as software as a service (SaaS). The server system 102 may be configured to provide or host a software application on one or more electronic devices used by some of the entities 104 for refining the imbalanced dataset.
It should be understood that the server system 102 is a separate part of the environment 100, and may operate apart from (but still in communication with, for example, via the network 110) any third-party external servers (to access data to perform the various operations described herein). However, in other embodiments, the server system 102 may be incorporated, in whole or in part, into one or more parts of the environment 100.
As may be understood, for the server system 102 to be able to facilitate such a feature, the server system 102 needs to collect an imbalanced dataset from the entities 104. Therefore, in an embodiment, the data sources 106 may collect the imbalanced dataset from the entities 104 and the server system 102 collects the imbalanced dataset from the data sources 106 via the network 110. In another embodiment, the server system 102 may collect the imbalanced dataset from the data sources 106 directly via the network 110. Therefore, in an embodiment, the environment 100 may further include a database 108 coupled with the server system 102. The database 108 may be incorporated in the server system 102 or maybe an individual entity connected to the server system 102 or maybe a database stored in cloud storage.
In various non-limiting examples, the database 108 may include one or more hard disk drives (HDD), solid-state drives (SSD), an Advanced Technology Attachment (ATA) adapter, a Serial ATA (SATA) adapter, a Small Computer System Interface (SCSI) adapter, a redundant array of independent disks (RAID) controller, a storage area network (SAN) adapter, a network adapter, and/or any component providing the server system 102 with access to the database 108. In one implementation, the database 108 may be viewed, accessed, amended, updated, and/or deleted by an administrator (not shown) associated with the server system 102 through a database management system (DBMS) or relational database management system (RDBMS) present within the database 108 (not shown).
In an embodiment, the database 108 stores an imbalanced dataset 112 as shown in FIG. 1. Further, the server system 102 is configured to access the imbalanced dataset 112 from the database 108. Herein, the imbalanced dataset 112 may include a plurality of data points related to the entities 104. Further, it may be noted that each data point of the data points may be labeled with at least one category label for a training dataset extracted from the imbalanced dataset 112, and with the label distribution being imbalanced. Herein, the at least one category label may be specific to a particular category under which the data points may have been classified. For example, in the case of an image classification task of cats and dogs, the at least one category label may include ‘cat’ and ‘dog’. In an embodiment, the database 108 may also store the various AI/ML models such as an ML model 114. An example of the ML model 114 may be a Gaussian discriminator-based model. In some scenarios, the ML model 114 may be generated by the server system 102. This process for generating the ML model 114 is described in the present disclosure with reference to FIG. 3.
In some embodiments, the database 108 may further store information associated with the entities 104. Herein, the information may include, but is not limited to, entity personal details, entity professional details, entity business details, entity activities-related details, task-related details, and the like.
Further, the server system 102 is configured to perform for each data point of the data points to: (1) generate one or more task-specific features from the imbalanced dataset 112, and (2) generate a category-related latent representation from the one or more task-specific features. It may be noted that the one or more task-specific features may be specific to a task that needs to be performed by the server system 102. For example, in the case of an image classification task, the one or more task-specific features may include edges, color, shape, a count of pixels, and the like.
Moreover, in another embodiment, the one or more task-specific features may be generated by the server system 102 by transforming and processing an input dataset such as the imbalanced dataset 112 to create meaningful and informative features for improving model’s performance and capturing relevant patterns. For example, in case of an image classification task, feature generation may involve extracting features from the raw pixel values, such as using techniques like edge detection, color histograms, or convolutional neural networks (CNNs) to obtain more meaningful representations of the images.
Upon obtaining the one or more task-specific features, a plurality of category-related latent representations for the plurality of data points may have to be generated for dimensionality reduction. It may be noted that the plurality of category-related latent representations may capture essential information and patterns of the imbalanced dataset 112 i.e., the one or more task-specific features, which can be used to perform the task. In some embodiments, the generation of the plurality of category-related latent representations may not be needed, however, in some other embodiments, it may be needed to further improve the model’s performance.
Furthermore, in another embodiment, the server system 102 is configured to generate one or more category distribution clusters in a vector space based, at least in part, on the plurality of category-related latent representations, and one or more categories in the training dataset. Herein, each category distribution cluster of the one or more category distribution clusters is clustered around a corresponding center position in the vector space. It may be noted that when the plurality of category-related latent representations is generated for the plurality of data points, representations that have identical features get positioned close to each other in the vector space or a latent space. In other words, if two data points have similar representations, they will be positioned close in the vector space and might be clustered in the same cluster. For example, two representations that belong to the same class or category get positioned close to each other, and two representations each belonging to different classes get positioned distant from each other. Such a positioning in the vector space leads to formation of the one or more category distribution clusters in the vector space. Therefore, it may be noted that each category distribution cluster of the one or more category distribution clusters belongs to a different category or class and is distributed in the vector space distinctly showing each category distribution cluster.
Moreover, the server system 102 is configured to compute via the ML model 114, a relative confidence score corresponding to each data point of the data points based, at least in part, on a relative distance of the category-related latent representation corresponding to each data point from center positions of the one or more category distribution clusters. Herein, the relative confidence score for an individual data point may indicate or represent if the individual data point from the data points belongs to a particular category distribution cluster from one of the one or more category distribution clusters.
For example, a relative confidence score computed by the server system 102 for a test image in the image classification task indicates whether the test image belongs to a cat category or a dog category using the ML model 114 that is trained to differentiate between cats and dogs, and in case if a new image appears which is neither of a cat or a dog, then classify it as an OOD image. Herein, the relative confidence score is dependent on a relative distance of a category-related latent representation corresponding to the test image from a cat category distribution cluster and a dog category distribution cluster in the vector space. In other words, the relative confidence score is dependent on a relative distance of a category-related latent representation corresponding to the test image from the center positions of the one or more category distribution clusters in the vector space.
In one embodiment, the server system 102 is configured to classify the data points to be included in the one or more category distribution clusters based, at least in part, on the relative confidence score of each data point of the data points being at least equal to a predefined threshold score. Herein, it may be noted that each of the one or more category distribution clusters may belong to a different category, each category being task specific. In another embodiment, the server system 102 is configured to classify the data points to be included in an OOD category based, at least in part, on the relative confidence score of each data point of the data points being lower than the predefined threshold score.
For example, if the relative confidence score corresponding to the test image is greater than or equal to the predefined threshold score, then the test image is classified to be included either in the cat category or the dog category based on a relative distance of the category-related latent representation corresponding to the test image from the cat category distribution cluster and the dog category distribution cluster. Herein, if the category-related latent representation corresponding to the test image is close to the cat category distribution cluster in comparison to the dog category distribution cluster, then the test image is classified under the cat category, otherwise in the dog category. Alternatively, if the relative confidence score corresponding to the test image is less than the predefined threshold score, then the test image is classified to be included in the OOD category.
It may be noted that the process of the classification of the data points based on the relative confidence score and the process of deciding the predefined threshold score is explained in the present disclosure with reference to FIG. 4.
The number and arrangement of systems, devices, and/or networks shown in FIG. 1 are provided as an example. There may be additional systems, devices, and/or networks; fewer systems, devices, and/or networks; different systems, devices, and/or networks; and/or differently arranged systems, devices, and/or networks than those shown in FIG. 1. Furthermore, two or more systems or devices shown in FIG. 1 may be implemented within a single system or device, or a single system or device is shown in FIG. 1 may be implemented as multiple, distributed systems or devices. In addition, the server system 102 should be understood to be embodied in at least one computing device in communication with the network 110, which may be specifically configured, via executable instructions, to perform steps as described herein, and/or embodied in at least one non-transitory computer-readable media.
FIG. 2 illustrates a simplified block diagram of a server system 200, in accordance with an embodiment of the present disclosure. The server system 200 is identical to the server system 102 of FIG. 1. In some embodiments, the server system 200 is embodied as a cloud-based and/or SaaS-based (software as a service) architecture.
As depicted, the server system 200 includes a computer system 202 and a database 204. The computer system 202 includes at least one processor 206 for executing instructions, a memory 208, a communication interface 210, a user interface 212, and a storage interface 214. The one or more components of the computer system 202 communicate with each other via a bus 216. The components of the server system 200 provided herein may not be exhaustive and the server system 200 may include more or fewer components than those depicted in FIG. 2. Further, two or more components depicted in FIG. 2 may be embodied in one single component, and/or one component may be configured using multiple sub-components to achieve the desired functionalities.
In some embodiments, the database 204 is integrated into the computer system 202. In one non-limiting example, the database 204 is configured to store an imbalanced dataset 218, and a Machine Learning (ML) model 220. It is noted that the imbalanced data 218 and the ML model 220 are identical to the imbalanced dataset 112 and the ML model 114 of FIG. 1.
Further, the computer system 202 may include one or more hard disk drives as the database 204. The user interface 212 is an interface such as a Human Machine Interface (HMI) or a software application that allows users such as an administrator to interact with and control the server system 200 or one or more parameters associated with the server system 200. It may be noted that the user interface 212 may be composed of several components that vary based on the complexity and purpose of the application. Examples of components of the user interface 212 may include visual elements, controls, navigation, feedback and alerts, user input and interaction, responsive design, user assistance and help, accessibility features, and the like. More specifically these components may correspond to icons, layout, color schemes, buttons, sliders, dropdown menus, tabs, links, error/success messages, mouse and touch interactions, keyboard shortcuts, tooltips, screen readers, and the like.
The storage interface 214 is any component capable of providing the processor 206 access to the database 204. The storage interface 214 may include, for example, an Advanced Technology Attachment (ATA) adapter, a Serial ATA (SATA) adapter, a Small Computer System Interface (SCSI) adapter, a RAID controller, a SAN adapter, a network adapter, and/or any component providing the processor 206 with access to the database 204.
The processor 206 includes suitable logic, circuitry, and/or interfaces to execute various operations such as (1) computing a relative confidence score for data points, (2) classifying the data points for refining the imbalanced dataset 218 for a specific task, (3) detecting Out-of-Distribution data points, (4) generating balanced dataset, and the like. Examples of the processor 206 include, but are not limited to, an application-specific integrated circuit (ASIC) processor, a reduced instruction set computing (RISC) processor, a graphical processing unit (GPU), a complex instruction set computing (CISC) processor, a field-programmable gate array (FPGA), and the like.
The memory 208 includes suitable logic, circuitry, and/or interfaces to store a set of computer-readable instructions for performing various operations. Examples of the memory 208 include a random-access memory (RAM), a read-only memory (ROM), a removable storage drive, a hard disk drive (HDD), and the like. It will be apparent to a person skilled in the art that the scope of the disclosure is not limited to realizing the memory 208 in the server system 200, as described herein. In another embodiment, the memory 208 may be realized in the form of a database server or a cloud storage working in conjunction with the server system 200, without departing from the scope of the present disclosure.
The processor 206 is operatively coupled to the communication interface 210, such that the processor 206 is capable of communicating with a remote device 222 such as the entities 104, the data sources 106, or communicating with any component connected to the network 110 (as shown in FIG. 1).
It is noted that the server system 200 as illustrated and hereinafter described is merely illustrative of an apparatus that could benefit from embodiments of the present disclosure and, therefore, should not be taken to limit the scope of the present disclosure. It is noted that the server system 200 may include fewer or more components than those depicted in FIG. 2. It should be noted that the server system 200 is identical to the server system 102 described in reference to FIG. 1.
In one implementation, the processor 206 includes a data pre-processing module 224, a training module 226, a score computation module 228, and a classification module 230. It should be noted that components, described herein, such as the data pre-processing module 224, the training module 226, the score computation module 228, and the classification module 230 can be configured in a variety of ways, including electronic circuitries, digital arithmetic, and logic blocks, and memory systems in combination with software, firmware, and embedded technologies.
In an embodiment, the data pre-processing module 224 includes suitable logic and/or interfaces for accessing the imbalanced dataset 218 from the database 204 associated with the server system 200. Herein, the imbalanced dataset 218 may include a plurality of data points related to a plurality of entities (e.g., the entities 104). Further, each data point of the plurality of data points is labeled with at least one category label for a training dataset extracted from the imbalanced dataset 218, and with a label distribution being imbalanced. Herein, the at least one category label may be task specific. For instance, for an image classification task for classifying cats and dogs images from an input set of images, the category labels may include ‘cat’ and ‘dog’.
In a non-limiting example, the imbalanced dataset 218 may include data having an imbalanced distribution of labels assigned to data points of the corresponding data. In an embodiment, the data that may be present in the imbalanced dataset 218 may have data points that are labeled or unlabeled. Further, the labeled dataset may include a dataset having data points that are labeled with a specific label, wherein the label indicates a category to which the corresponding data point belongs. In some embodiments, the labeled dataset may have data points that have uniformly distributed labels. In some other embodiments, the labeled dataset may have data points that are having non-uniformly distributed labels, and hence such a dataset may be considered as the imbalanced dataset 218.
For instance, for the image classification task, the imbalanced dataset 218 may include a collection of labeled images with an imbalanced label distribution for the classification of the images. In another instance, for fraud detection in payment transactions, the imbalanced dataset 218 may include a plurality of historical transactions with an imbalanced label distribution for fraud and non-fraud labels. In yet another instance, for medical diagnostic testing, the imbalanced dataset 218 may include details related to multiple patients with imbalanced label distribution for labels ‘yes’ or ‘no’. Therefore, it may be understood that the imbalanced dataset 218 may vary based on the task that needs to be performed. Herein, the embodiments of the present invention may be employed in a variety of data processing or refining tasks.
The data pre-processing module 224 may further be configured to perform for each data point of the data points: (1) generating the one or more task-specific features from the imbalanced dataset 218, and (2) generating a category-related latent representation from the one or more task-specific features.
It may be understood that the one or more task-specific features thus generated may correspond to insights, useful information, and relevant patterns that are obtained upon preprocessing the imbalanced dataset 218 for improving the model’s performance. In a non-limiting example, preprocessing the imbalanced dataset 218 or each data point of the imbalanced dataset 218 may include performing several operations on the imbalanced dataset 218 to make the imbalanced dataset 218 suitable for training. For instance, the operations may include removing noise, handling missing values, normalizing or scaling data, analyzing characteristics of the data, and converting the imbalanced dataset 218 into a format that the ML model 220 can process.
For example, for an image classification task, the one or more task-specific features may include features of an image such as, but are not limited to, edges, color, shape, a count of pixels, an image size, etc. In another instance, for fraud detection in payment transactions, the one or more task-specific features may include a transaction amount, a transaction frequency, a fraud rate, a dispute rate, a Decision Intelligence (DI) score, etc. In yet another instance, for medical diagnostic testing, the one or more task-specific features may include a type of disease, a count of patients that suffered from a particular disease in a particular hospital, a period for which the disease remained, a frequency of re-appearance of the disease with a particular time span, etc.
In some embodiments, the one or more task-specific features may be directly provided to the ML model 220 for training the ML model 220 to perform the task. In some other embodiments, the generation of the plurality of category-related latent representations for the plurality of data points may be needed for dimensionality reduction. Further, it may be noted that the performing of the steps of the generation of the one or more task-specific features and the plurality of category-related latent representations depends on the type of the ML model 220 being used.
Further, as the plurality of category-related latent representations may be generated in a vector space, the plurality of category-related latent representations may get clustered as per their classes. Therefore, it may be understood that the data pre-processing module 224 is configured to generate one or more category distribution clusters in the vector space based, at least in part, on the plurality of category-related latent representations and one or more categories in the training dataset. Herein, each category distribution cluster of the one or more category distribution clusters is clustered around a corresponding center position in the vector space.
In an embodiment, the plurality of category-related latent representations may be generated as a part of a preprocessing step. In another embodiment, the plurality of category-related latent representations may be generated using the ML model 220. In various non-limiting examples, the data pre-processing module 224 may utilize any feature or embedding generation approach such as, but not limited to, one-hot embedding, entity-embeddings, and the like to generate the one or more task-specific features and/or category-related latent representations. It is understood that such features and embedding generation techniques are already known in the art, therefore the same are explained here for the sake of brevity.
It may be noted that the imbalanced dataset 218 may be split into at least one of training dataset and a testing dataset. Further, the objective of the present disclosure is to accurately classify data points from the testing dataset of the imbalanced dataset 218 into one or more predetermined categories for performing a specific task. It may be noted that the one or more predetermined categories may be determined based on the at least one category label associated with the plurality of data points. Therefore, an AI/ML model needs to be trained and used by the server system 200, for the server system 200 to be able to achieve the objective of the present disclosure. It may be noted that the plurality of category-related latent representations or the one or more task-specific features may be provided as input to the training model 226.
In an embodiment, the training model 226 includes suitable logic and/or interfaces for generating the ML model 220 based, at least in part, on performing a set of operations iteratively till the performance of the ML model 220 converges to predefined criteria. The set of operations may include: (1) initializing the ML model 220, (2) computing, via the ML model 220, one or more loss function values and a net loss function value, and (3) optimizing one or more model parameters by back-propagating the net loss function value. In an embodiment, the predefined criteria correspond to a saturation of the net loss function value. Herein, the net loss function value is saturated after a plurality of iterations of the set of operations is performed. In a non-limiting example, the ML model 220 may be a Gaussian discriminator-based model. Herein, the process of the generation of the ML model 220 is explained with reference to FIG. 3.
In an embodiment, the score computation module 228 includes suitable logic and/or interfaces for computing a relative confidence score corresponding to each data point of the plurality of data points based, at least in part, on a relative distance of the category-related latent representation corresponding to each data point from center positions of the one or more category distribution clusters. In an embodiment, the relative confidence score may be computed via the ML model 220. Herein, the relative confidence score for an individual data point indicates or represents if the individual data point from the plurality of data points belongs to a particular category distribution cluster from the one or more category distribution clusters.
As used herein, the term “confidence score” refers to a parameter that is used to indicate a likelihood of a respective data point being assigned to a particular class. In other words, the confidence score may represent the degree of confidence of the ML model 220 regarding a particular prediction (i.e., the classification of the data point into a particular class or OOD) and how sure a machine learning model is about the respective intent that has been correctly assigned by the model. In a particular implementation, the confidence score can have a value between 0 and 1, depending on how the neural network used for training the model works. Whereas, the term “relative confidence score” refers to a confidence score indicating a degree of confidence corresponding to one data point has been assigned correctly in comparison to other data points in a given vector space.
In an embodiment, for the score computation module 228 to be able to compute the relative confidence score, the score computation module 228 may have to perform a set of intermediate steps. Thus, the score computation module 228 may be configured to determine the center position corresponding to each category distribution cluster of the one or more category distribution clusters based, at least in part, on one or more decision boundaries surrounding each category distribution cluster of the one or more category distribution clusters. Herein, in an embodiment, the one or more decision boundaries may include one or more spherical decision boundaries. In an embodiment, the score computation module 228 may determine the center position via the ML model 220.
It may be noted that while training the ML model 220, the one or more decision boundaries surrounding the one or more category distribution clusters may be determined. Further, as may be understood, the predictions made by the ML model 220 regarding the classification of the data points may have probability values. Herein, it is known that these probability values are high at the center position of the one or more category distribution clusters and decrease as move away from the center position and toward the one or more decision boundaries. Therefore, based on the probability values and the positioning of the decision boundaries, the score computation module 228 may determine, the center position corresponding to each category distribution cluster of the one or more category distribution clusters using the ML model 220.
The score computation module 228 may further be configured to compute a radius of each category distribution cluster of the one or more category distribution clusters based, at least in part, on the corresponding center positions and the corresponding one or more decision boundaries of the corresponding category distribution clusters.
Upon obtaining values for parameters such as the center position and the radii for each category distribution cluster of the one or more category distribution clusters in the vector space, the data points in the testing dataset may have to be classified. Further, as may be understood that as the category-related latent representations corresponding to the data points distant from the center positions the probability value of the predictions made by the ML model 220 decreases. Thus, it may be noted the distance between the category-related latent representation in the vector space and the center positions of the one or more category distribution clusters is another important parameter. The score computation module 228 may thus be configured to determine, via the ML model 220, the distance of the category-related latent representation corresponding to each data point from the center positions of the one or more category distribution clusters.
However, since the testing dataset is taken from the imbalanced dataset 218 which is also imbalanced, the exact distance value may not provide accurate results. Therefore, the distance value may be normalized using the radius value of the one or more category distribution clusters. Herein, the process of the normalization of the distance is explained in the present disclosure with reference to FIG. 4. Therefore, it may be noted that the score computation module 228 may be configured to compute the relative distance for the category-related latent representation corresponding to each data point from the center positions of the one or more category distribution clusters based, at least in part, on the radius and the distance.
The score computation module 228 may further be configured to compute, via the ML model 220, the relative confidence score for each data point of the plurality of data points based, at least in part, on the relative distance and predefined confidence computation criteria.
In an embodiment, the predefined confidence computation criteria may correspond to a relative confidence score computation formula which is stated by the following Equation 1:
?Conf?_k (x)=exp((-D_k (x)+R_k)/R_k -1) … Eqn. 1
Herein, ‘?Conf?_k (x)’ refers to the relative confidence score for ‘x’ samples that correspond to a category-related latent representation corresponding to a data point. For instance, while using an already trained ML model (e.g., the ML model 220) for determining a category of a data point from a testing dataset provided to the ML model 220, ‘x’ corresponds to a feature vector (also be termed as a category-related latent representation) of the corresponding data point. Further, ‘k’ corresponds to a count of categories or the category distribution clusters generated while training the ML model 220 for a training dataset. Further, ‘D_k (x)’ refers to a Distance of sample x from k^th class cluster and ‘R_k (x)’ refers to a Radius of k^th class cluster (hereinafter, interchangeably also referred to as ‘category distribution cluster’). Herein, the ‘R_k (x)’ has the following formula:
R_k=a_k s_k … Eqn. 2

Herein, ‘a_k’ refers to a learnable parameter and ‘s_k’ refers to the standard deviation of kth cluster. From Eqn. 1 it may be understood that the distance of each class or category is normalized such that the maximum confidence is given when the sample ‘x’ is at the center position of the category cluster of its category. It is noted that the distance decreases as a data point moves toward the cluster boundary of the distribution cluster. It may be noted that normalization tackles the class imbalance issue.
Upon obtaining the relative confidence score for each data point in the testing dataset, the corresponding data point may have to be classified. Therefore, the relative confidence scores for each data point are provided as input to the classification module 230.
In an embodiment, the classification module 230 includes suitable logic and/or interfaces for classifying the plurality of data points to be included in the one or more category distribution clusters based, at least in part, on the relative confidence score of each data point of the plurality of data points being at least equal to a predefined threshold score. In another embodiment, the classification module 230 is further configured to classify the plurality of data points to be included in an OOD category based, at least in part, on the relative confidence score of each data point of the plurality of data points being lower than the predefined threshold score.
Herein, in a particular instance, the predefined threshold score may be computed using the following formula:
class assignment={¦(argmax(?Conf?_k (x)),if ?Conf?_k (x) =0 @else mark it OOD)¦ … Eqn. 3
As used herein, the function ‘Argmax’ refers to an operation that computes the argument that gives the maximum value from a target function. As may be understood, Argmax is most commonly used in machine learning for finding the class with the largest predicted probability. Therefore, it may be understood that, if the relative confidence score for a data point is greater than or equal to 0, then a largest predicted probability value may be selected and a category of that particular data point may be found. Alternatively, if the relative confidence score for the data point is less than 0, then the corresponding data point may be marked with an OOD label, thereby declaring that the data point does not belong to any of the one or more predetermined categories.
In an alternative embodiment, it may be noted that, because of the normalization of the distance using the radius value, the predefined threshold score may also be approximated to the radius value instead of ‘0’. Herein, not the relative confidence score, but a confidence score may be compared with the radius value. In such an embodiment, the confidence score may be approximated using the following Equation:
Score_k=?- D?_k+aR_k … Eqn. 4
Herein, the terminologies remain the same as mentioned above, however, the term ‘Scorek’ may correspond to the confidence score. Therefore, based on Equations 2, 3, and 4, it may be understood that if the relative confidence score for a data point is greater than or equal to 0 or the confidence score is greater than or equal to radius (Rk), then a largest predicted probability value may be selected and a category of that particular data point may be found. Alternatively, if the relative confidence score for the data point is less than 0 or the confidence score is greater less the radius (Rk), then the corresponding data point may be marked with an OOD label, thereby declaring that the data point does not belong to any of the one or more predetermined categories.
FIG. 3 illustrates a flow diagram representation 300 of a process flow of training of the ML model (e.g., the ML model 220), in accordance with an embodiment of the present disclosure. In an embodiment, the server system 200 may generate the ML model 220 via the training module 226. As may be understood the training module 226 may be configured to perform the set of operations iteratively till the performance of the ML model 220 converges to the predefined criteria. Upon performing the set of operations, the ML model 220 may get trained to accurately classify data points from a testing dataset of the imbalanced dataset 218 under one or more predetermined categories. It may be noted that the one or more predetermined categories may include task-specific categories. In addition, in an embodiment, the one or more predetermined categories may also include an OOD category. Moreover, it may be noted that the one or more predetermined categories can be associated with the at least one category label that is also assigned to the one or more category distribution clusters. Herein, the one or more category distribution clusters are generated at the time of pre-processing of the training dataset of the imbalanced dataset 218 in the vector space.
As mentioned above, the ML model 220 may be a Gaussian discriminator-based model. As used herein, the term “Gaussian discriminator-based model” refers to a generative model that aims to learn and generate samples from a given data distribution. It is a variant of the Generative Adversarial Network (GAN) architecture, where the discriminator is modified to predict a probability distribution rather than a binary output.
In a specific embodiment, the set of operations may include model initialization 302, loss function value computation 304, net loss computation 306, and backpropagation 308. Upon performing these steps, the training module 226 may check whether a performance of the ML model 220 converges to the predefined criteria or not (see, 310). If the performance of the ML model 220 converges to the predefined criteria, the training process of the ML model 220 stops (see, 312). Alternatively, if the performance of the ML model 220 does not converge to the predefined criteria, the set of operations repeat and are iteratively performed via the ML model 220 as shown in FIG. 3. Therefore, it may be understood that the ML model 220 performs the set of operations as mentioned above until the performance of the ML model 220 converges to the predefined criteria.
Further, in the case of the ML model 220 being the Gaussian discriminator-based model, the ML model 220 may also include a generator 314 and a discriminator 316 as shown in FIG. 3. In an embodiment, the operation of model initialization 302 may refer to a step of initializing the ML model 220 using one or more model parameters. In a non-limiting example, the one or more model parameters may include weights and biases of the generator 314 and the discriminator 316, fixing a count of iterations or epochs, and the like.
In an embodiment, the operation of the loss function value computation 304 may refer to an operation of computing, via the ML model 220, one or more loss function values for each data point of the plurality of data points for reducing intra-class cluster distance and increasing inter-class cluster distance, based at least on using one or more loss functions and the relative confidence score.
In a non-limiting example, it may be noted that intermediate steps for generating the one or more loss function values may include training of the generator 314 and the discriminator 316. In an embodiment, the training of the generator 314 or a generative learning approach may include facilitating the generator 314 to capture a distribution of each category separately. For example, the generator 314 may analyze the distribution of infected patients and healthy patients separately and learn each of the distribution’s features separately, when a new example is introduced it is compared to both the distributions, the class to which the data example resembles the most may be assigned to it.
Further, the training of the discriminator 316 may include an operation of trying to find decision boundaries between different categories during the training process or a learning process. For example, given a classification problem to predict whether a patient has malaria or not, the discriminator 316 creates a classification boundary (or a decision boundary) to separate two types of patients, and when a new example is introduced it is checked on which side of the boundary the example lies to classify it. In combination, for capturing a distribution of each data point of the data points, the ML model 220 fits a Gaussian Distribution to every category of the data points separately. It may be noted that the probability of a prediction, in this case, will be high if it lies near the center position of a contour corresponding to its category and decreases as we move away from the center position of the contour.
It is noted that for each training sample; x, a distance of the ‘x’ sample from each center position corresponding to each category distribution cluster of the one or more category distribution clusters, in case there exists ‘k’ of the category distribution clusters may include D1, D2, … Dk. Herein, the ML model 220 may assume for each category ‘k’ the underlying distribution is isotropic Gaussian, i.e., N(µ_k s_k^2 I). Herein, ‘s_k’ is the standard deviation and ‘µ_k’is the mean of kth cluster.
Further, the one or more loss function values may be computed using the one or more loss functions. In a non-limiting example, the one or more loss functions may include at least one pull loss to reduce the intra-class cluster distance, and at least two push losses to increase the inter-class cluster distance. Herein, in a particular instance, for computing the one or more loss functions, the following Equations may be used:
L_Pull=?_x?B¦(D_k (x)*Y_x)/B,Y_x … Eqn. 5
L_(Push_1)=-*log?(?Conf?_k (x)) … Eqn. 6
L_(Push_2)=?_x?B¦(1-Y_x)/(?(D?_k (x)+0.001)B) … Eqn. 7
Herein, ‘L_Pull’ refers to one of the three losses defined in the proposed approach to reduce intra-cluster distance. Further, ‘L_(Push_2)’ and ‘L_(Push_2)’ refers to the other two losses that help in increasing inter-cluster distance. Further, ‘D_k (x)’ refers to a distance of the sample x from k^th class cluster, ‘Y_x’ refers to a one-hot label vector for training data point, ‘x’. In other words, if the training data has k classes or categories, then ‘Y_x’ is a k-dimension vector with all zeros except for the index when i = y. Further, ‘Y_x^k’ refers to k^th entry of Y_x vector. Furthermore, ‘B’ refers to a Batch of training data sent at a time into the ML model 220.
Therefore, upon using the above-mentioned loss functions, the ML model 220 may compute the one or more loss function values of each data point of the data points. In an embodiment, upon computing the one or more loss function values, a net loss function value may be computed by aggregating the one or more loss function values that correspond to the step of net loss computation 306. Therefore, the net loss function value may be computed using the following formula:
L = L_Pull + L_(Push_1) + L_(Push_2) … Eqn. 8
Herein, ‘L’ refers to the net loss function for which the net loss function value may be computed using Equation 7. Upon computing the net loss function value, it is backpropagated to the ML model 220 for optimizing the one or more model parameters. This operation is performed to reduce the net loss function value. Lastly, the performance of the ML model 220 is evaluated and checked if it converged to the predefined criteria. The performance of the ML model 220 provides information about the performance of the ML model 220 for generating synthetic data samples that resemble real data distribution. Then, the ML model 220 is saved and can be used for classifying the data points.
Moreover, it may be noted that the set of operations is iteratively performed till the performance of the ML model converges to the predefined criteria. In a non-limiting example, the predefined criteria may be a saturation of the net loss function value. Herein, it may be noted that the net loss function value is saturated after a plurality of iterations of the set of operations is performed. Saturation may refer to a stage in the model training process after a certain number of iterations where the net loss function value becomes constant, i.e., the difference in the net loss function value for an iteration and its subsequent iteration becomes the same or negligible.
Therefore, in a non-limiting example, the pseudo-code for the implementation of the ML model 220 for classifying the data points either in the one or more category distribution clusters or the OOD category may be as follows:
Divide the training data into batches to be fed into the ML model;
For (??,??)? ?? (batch) transmit B through a neural network of the ML model while constraining the embeddings distribution from the network to be Gaussian;
Calculate all three losses for each batch to propagate it back to the network, wherein two losses are designed to increase the inter-class distance and one loss is for decreasing intra-class distance;
For each ?? ? ?? predict a label based on the argument of maximum confidence, ?Conf?_k (x) value for each class k; and
During training a threshold is set on the confidence ?Conf?_k (x), if ?Conf?_k (x) is less than the threshold the data point is marked as OOD else the argument of a maximum of ?Conf?_k (x) is assigned.

FIG. 4 illustrates a graphical representation 400 of intermediate stages 402, 404, 406, and 408 implemented in an example vector space for computing a relative confidence score and classifying each data point of the plurality of data points of an example imbalanced dataset (e.g., the imbalanced dataset 218), in accordance with an embodiment of the present disclosure. In an embodiment, the server system 200 may compute the relative confidence score using the score computation module 228. Further, the server system 200 may classify the data points using the classification module 230.
As may be noted, a distribution of the plurality of category-related latent representations corresponding to the plurality of data points in vector space is depicted. It may also be observed that the plurality of category-related latent representations is clustered in one or more category distribution clusters 410, 412, and 414, and representations that do not belong to any of the one or more category distribution clusters 410, 412, and 414 may be outside of the corresponding one or more category distribution clusters 410, 412, and 414. It may be noted that three category distribution clusters i.e., 410, 412, and 414 are considered for the sake of explanation, however, there could be any number of category distribution clusters. At stage 402, the distribution shows an imbalanced distribution of the one or more category-related latent representations in the vector space as a spread of each category distribution cluster in the vector space is different, as depicted.
Further, it may be noted that the plurality of category-related latent representations that are provided to the ML model 220 may have a distribution as shown in FIG. 4 at stage 402. As may be understood that during training of the ML model 220, the one or more decision boundaries 416 as shown in FIG. 4 at stage 404 may be found based on the distribution and formation of the one or more category distribution clusters 410, 412, and 414 in the example vector space.
Moving ahead, it may be known that predictions made by the ML model 220 about the classification of the data points may be in the form of probability values. Herein, it is understood that a higher probability value means that a data point will be closer to the center position of a particular cluster probability values are high at the center position of the one or more category distribution clusters 410, 412, and 414 and decrease as moved away from the center position and towards the one or more decision boundaries 416. Therefore, based on the probability values and the positioning of the one or more decision boundaries 416, the server system 200 via the score computation module 228 may determine, the center positions C1, C2, and C3 corresponding to each category distribution cluster of the one or more category distribution clusters 410, 412, and 414 respectively using the ML model 220.
Similarly, the server system 200compute radii R1, R2, and R3 of each category distribution cluster of the one or more category distribution clusters 410, 412, and 414 respectively based, at least in part, on the corresponding center positions C1, C2, and C3 and the corresponding one or more decision boundaries 416 of the corresponding category distribution clusters 410, 412, and 414 using the score computation module 228 at stage 406.
In a non-limiting example, for every training sample while training and every testing sample while testing i.e., ‘x’, a distance D1, D2, and D3 of the sample ‘x’ from center positions C1, C2, and C3 of each of the category distribution clusters 410, 412, and 414 may have to be found as shown in FIG. 4 at stage 408. Therefore, the server system 200 using the score computation module 228 may determine the distance D1, D2, and D3 of the category-related latent representation (e.g., the sample ‘x’) corresponding to each data point from the center positions C1, C2, and C3 of the one or more category distribution clusters 410, 412, and 414, respectively. Herein, the distances computed may not provide accurate results because of the imbalanced nature of the distribution of the labels of the one or more category distribution clusters 410, 412, and 414. Therefore, the distances D1, D2, and D3 have to be normalized using corresponding radii R1, R2, and R3 of the corresponding one or more category distribution clusters 410, 412, and 414 respectively. Upon normalization, updated distance values may be referred to as relative distances which are used for computing the relative confidence score. Therefore, the server system 200 using the score generation module 228 may compute the relative confidence score for each data point of the plurality of data points based, at least in part, on the relative distance and the predefined confidence computation criteria.
In a non-limiting example, it may be noted that in order to tackle imbalanced data herein, the server system 200 normalizes the distances based on per class radius using Eqn. 1 instead of using just distances D1, D2, and D3 from each class center C1, C2, and C3 to predict the class labels (which works when class labels are equally distributed). Then, the relative confidence score may act as input to the classification module 230. The server system 200 via the classification module 230 may classify the data points either under the one or more category distribution clusters 410, 412, and 414, or under the OOD category based on the relative confidence score and the predefined threshold score. Herein, the predefined threshold score may be obtained using Eqn. 3. In the current scenario, as the sample ‘x’ is outside the one or more category distribution clusters 410, 412, and 414, the data point corresponding to the sample ‘x’ may have to be classified under the OOD category. Therefore, it may be noted that a value for the relative confidence score may have been obtained such that the condition of the relative confidence score is less than 0 might get matched. Thus, classifying the data point corresponding to the sample ‘x’ under the OOD category. Further, it may be noted that upon training and testing the ML model 220 for its functionality, the ML model 220 can be saved and used for classifying real-time data points for performing a specific task.
FIG. 5A illustrates a graphical representation 500 a particular implementation of classifying the plurality of data points based on the relative confidence score, in accordance with the particular implementation of the present disclosure. It may be noted that a distribution of the plurality of category-related latent representations corresponding to the plurality of data points in an example vector space may be as shown in FIG. 5A. Herein, it may be observed that the plurality of category-related latent representations are clustered in one or more category distribution clusters 502, 504, and 506, and representations that do not belong to any of the one or more category distribution clusters 502, 504, and 506 may be outside of the corresponding one or more category distribution clusters 502, 504, and 506. It may be noted that three category distribution clusters i.e., 502, 504, and 506 are considered for the sake of explanation, however, there could be any number of category distribution clusters.
In the current particular implementation, it may be observed that the distribution is an imbalanced distribution of the plurality of category-related latent representations in the particular vector space as a spread of each category distribution cluster in the particular vector space is different as shown in FIG. 5A. Further, the server system 200 using the ML model 220 may obtain center positions C1, C2, and C3, and radii R1, R2, and R3 corresponding to each category distribution cluster of the one or more category distribution clusters 502, 504, and 506 respectively.
In a non-limiting example and the current implementation, a test sample 508 may be considered to be within the category distribution cluster 502. Therefore, Eqn. 1 may be used for computing a value for the relative confidence score. Herein, a value of -D_k (x)+R_k maybe obtained, wherein k = 1, i.e., a value for -D_1 (test sample 508)+R_1 is obtained. As per Eqn. 1 and Eqn. 2, the value for this factor may be positive and greater than or equal to radius R1 for the test sample 508, and declared within the cluster 502 by labeling the test sample 508 with a label associated with the cluster 502. However, a value of -D_2 (test sample 508)+R_2 and -D_3 (test sample 508)+R_3 may be negative, thereby declaring that the test sample 508 does not belong to the clusters 504 and 506. For instance, if labels assigned to each cluster 502, 504, and 506 are green, blue, and yellow, in the above-mentioned implementation, the same test sample 508 may get labeled as ‘green’.
FIG. 5B illustrates a graphical representation 520 a particular implementation of classifying the plurality of data points based on the relative confidence score, in accordance with another implementation of the present disclosure. It may be noted that the distribution shown in FIG. 5B may be similar to that shown in FIG. 5A. Therefore, the one or more category distribution clusters are the same as the one or more category distribution clusters 502, 504, and 506. However, in this particular implementation, a test sample 522 may be considered. Herein, the test sample 522, as shown in FIG. 5B is considered to be outside each of the one or more category distribution clusters 502, 504, and 506. Thus, a value of -D_k (x)+R_k in Eqn. 1 for k being 1, 2, and 3 would be negative. Herein, according to class assignment as defined in Eqn. 2, the test sample 522 will be marked as OOD.
FIG. 6 illustrates a block diagram representation of an environment 600 in accordance with a particular embodiment of the present disclosure. Although the environment 600 is presented in one arrangement, other embodiments may include the parts of the environment 600 (or other parts) arranged otherwise depending on, for example, refining the imbalanced dataset of payment transactions for fraud detection, detecting Out-of-Distribution (OOD) data points the imbalanced dataset of the payment transactions, generating a refined dataset. Also, it may be noted that the environment 600 is an example implementation for the environment 100 shown in FIG. 1 with the difference being the entities 104 corresponding to a network of merchants and cardholders and the data points corresponding to the payment transactions performed between the corresponding merchants and the cardholders in a payment network.
Thus, the environment 600 generally includes a plurality of components such as a server system 602, a plurality of cardholders 604(1), 604(2), … 604(N), where ‘N’ is a non-zero Natural number (collectively, interchangeably referred to hereinafter as plurality of cardholders 604 or cardholders 604) associated with a plurality of cardholder devices 606(1), 606(2), … 606(N), where ‘N’ is a non-zero Natural number (collectively, interchangeably referred to hereinafter as plurality of cardholder devices 606 or cardholder devices 606), a plurality of merchants 608(1), 608(2), … 608(N) where ‘N’ is a non-zero Natural number (collectively, interchangeably referred to hereinafter as plurality of merchants 608 or merchants 608), a plurality of issuer servers 610(1), 610(2), … 610(N) where ‘N’ is a non-zero Natural number (collectively, interchangeably referred to hereinafter as plurality of issuer servers 610 or issuer servers 610), a plurality of acquirer servers 612(1), 612(2), … 612(N) where ‘N’ is a non-zero Natural number (collectively, interchangeably referred to hereinafter as plurality of acquirer servers 612 or acquirer servers 612), a payment network 614 including a payment server 616, and a database 618 each coupled to, and in communication with (and/or with access to) a network 620. In an embodiment, the environment 600 further includes a plurality of data sources 622(1), 622(2), … 622(N) where ‘N’ is a non-zero Natural number (collectively, interchangeably referred to hereinafter as a plurality of data sources 622 or data sources 622). However, it may be noted that the plurality of components of the environment 600 is substantially similar to the corresponding components of the environment 100 of FIG 1.
As used herein, the term “cardholder” refers to a person who has a payment account or a payment card (e.g., credit card, debit card, etc.) associated with the payment account, that will be used by a merchant to perform a payment transaction. The payment account may be opened via an issuing bank or an issuer server. Similarly, as used herein, the term “merchant” refers to a seller, a retailer, a purchase location, an organization, or any other entity that is in the business of selling goods or providing services, and it can refer to either a single business location or a chain of business locations of the same entity. Further, as used herein, the term “payment network” refers to a network or collection of systems used for the transfer of funds through the use of cash substitutes. Payment networks are companies that connect an issuing bank with an acquiring bank to facilitate online payment. Examples of networks or systems configured to perform as payment networks include those operated by such as Mastercard®.
In an example, the cardholders 104 may use their corresponding cardholder devices 606 which are electronic devices for accessing a mobile application or a website associated with the issuing bank, or any third-party payment application to perform a payment transaction. In various non-limiting examples, electronic devices may refer to any electronic devices such as, but not limited to, personal computers (PCs), tablet devices, Personal Digital Assistants (PDAs), voice-activated assistants, Virtual Reality (VR) devices, smartphones, laptops, and the like.
Usually, in the payment industry to predict the points i.e., the payment transactions which are not in any class within the imbalanced dataset of a payment network (e.g., Mastercard®), there is also a new category of Out-of-Distribution (OOD) introduced by the present disclosure. Hence, if the server system 602 sees a data point i.e., a payment transaction that is not from a training distribution that was used while training the ML model 220 used by the server system 602, the corresponding payment transaction can be labeled with an OOD category. During testing, the proposed method assigns a relative confidence score with every class prediction.
As may be understood in the transaction space, the space of fraudulent transactions is highly imbalanced. Furthermore, if all the disputed transactions are considered, then they can be roughly divided into three categories such as first-party fraud (FPF), third-party fraud (TPF), and a legitimate dispute. Therefore, labels that may be associated with the one or more category distribution clusters in a vector space may include one or more of the above-mentioned three categories.
Conventionally, there exist mixed labels for FPF and TPF transactions. Thus, the ML model 220 can be trained to learn the distribution of FPF and TPF transactions. If the ML model 220 identifies any payment transaction that does not belong to either of the categories, then the ML model 220 can declare that the corresponding payment transaction belongs to the OOD category. Herein, the ML model 220 is trained by performing various operations and the ML model 220 is used by the server system 602 for refining the imbalanced dataset of the payment transaction for performing fraud detection by more accurately detecting OOD samples. It may be noted that the operations are explained above with reference to previously referred Figures and not repeated here for the sake of brevity.
FIG. 7 illustrates a graphical representation 700 of an example implementation of classifying one or more data points in the environment 600 of FIG. 6 for a comparative study with a conventional approach, in accordance with an embodiment of the present disclosure. Herein, since the environment 600 is for a payment industry or a payment network 614 (e.g., Mastercard®), the one or more data points may correspond to one or more payment transactions.
In the example implementation, the imbalanced dataset 624 may include information related to a plurality of payment transactions that place within the payment network 614. As may be understood, labels associated with payment transactions are generally highly imbalanced, thus the information related to the plurality of transactions is said to be the imbalanced dataset 624. Herein, the objective of the example implementation is to train the ML model 626 so that a real-time payment transaction can be predicted to be either as an authorized legitimate transaction (Hereinafter and interchangeably referred to as a Legit-Auth), a third-party fraud (TPF), or a first party fraud (FPF).
Further, the server system 602 may generate the one or more task-specific features for each payment transaction. For example, considered the one or more task-specific features that may be considered for fraud detection may include two features such as, Decision Intelligence (DI) score, Dispute rate, and the like. Further, the server system 602 may generate a category-related latent representation corresponding to each payment transaction based on the one or more task-specific features.
Later the ML model 626 is trained based on the one or more category-related latent representations. While training, the one or more category distribution clusters get generated in a vector space. As shown in FIG. 7, it may be understood that the vector space may be plotted graphically with DI score v/s dispute rate. Herein, a cluster that is positioned such as both the DI score and the dispute rate are low represents all authorized legitimate transactions (i.e., Legit-Auth). Herein, as these transactions usually tend to have low DI scores and low dispute rates; it forms the major chunk of all transaction space. Further, a cluster that is positioned far away from the Legit-Auth cluster, represents TPF transactions. These transactions typically have high DI score and high dispute rates. Finally, a handful of FPF-labeled transactions may be shown at a position that is next to the Legit-Auth cluster. These transactions have an average dispute rate but typically a low DI score.
It may be noted that traditional deep learning (DL)/ML models assume uniform label distribution and try to transform the payment transactions into a vector space where classes can be separated easily. In such a scenario, since approximately the same number of samples are present per class, the model is not biased towards one sample and can create confident separation boundaries. An example of the conventional approach may include Deep Multi-class Data Description (Deep MCDD).
However, generally in payment networks such as Mastercard®, the majority of the implementations have highly imbalanced label distribution. Herein, the DL/ML model may have to be tuned for highly imbalanced data. Moreover, FPF class has a handful of labeled data points. Its clusters are not generalizable and hence small clusters. Thus, for a data point like ‘x’, it may be difficult to say if it is FPF or legit-Auth transaction. Such unconfident samples outside clusters may have to be said as outlier/Out-of-Distribution (OOD) samples.
Therefore, the training objective of the server system 602 essentially has two parts: (i) Pull loss to reduce an intra-class cluster distance, (ii) Push loss to increase an inter-class cluster distance. Furthermore, as the training data has uneven class distribution, particularly highly imbalanced class distribution, the loss function should be formulated so that it is unbiased when cluster size is considered. While testing the ML model 626 should be able to filter out transactions which doesn’t belong to any of the training classes, essentially, the ML model 626 should be able to filter out those testing samples which are not coming from the training distribution, that is, Out-of-Distribution (OOD) samples. Hence, alongside distance from the cluster center, a Radius R of each cluster is also encoded to see where the classes end. Anything beyond them can be labeled as OOD. Also, based on how close a point is to the cluster center, a relative confidence score is assigned. Also, at the time of training, the one or more decision boundaries 702 are also found differentiating the different clusters from each other. Further, it may be noted that the sample ‘x’ may correspond to either X1, X2, X3, X4, X5, or X6. Herein, a position of X1 may correspond to a center of the Legit-Auth cluster/category, X2 may correspond to a center and a boundary of the Legit-Auth category, X3 may correspond to a boundary of the Legit-Auth category, X4 may correspond to between boundaries of the Legit-Auth category and FPF category, X5 may correspond to at an FPF boundary, and X6 may correspond to inside the FPF category cluster. Herein, the proposed method may be applied to the payment transactions that use the ML model 626 and then results are compared with the conventional approach such as the Deep MCDD. The following table shows the comparative study results:
Case Where is x? Deep MCDD Proposed method
Class prediction Deep MCCD Confidence Class prediction Proposed method confidence
1 Center of Legit-Auth Legit-Auth Very high Legit-Auth Very High
2 Between center and boundary of Legit-Auth Legit-Auth High Legit-Auth High
3 At boundary of Legit-Auth FPF Low (wrong prediction) Legit-Auth Medium (Correct prediction)
4 Between boundaries of Legit-Auth and FPF FPF Low OOD Low (Legit-AuthP
5 At FPF boundary FPF High FPF Medium
6 Inside EPF FPF High FPF High
Table 1: Comparative study results for a conventional approach and a proposed method
From Table 1 it may be clear that for a data point like X4 as shown in the FIG. 7, it may be difficult to surely say whether it's FPF or legit-Auth transaction while testing. This indicates a need to detect points that do not confidently lie inside any of the clusters. Thus, we need to detect outliers/Out-of-Distribution (OOD) during the testing phase using the ML model 626.
In a non-limiting example, the following table 2 depicted below shows a performance comparison of the proposed method with the conventional Deep-MCDD model.
Performance on transaction dataset Proposed method Deep MCDD
FPF TPF FPF TPF
Accuracy 96.76 % 96.13 %
Precision 93.95 % 97.08 % 90.52 % 97.06 %
Recall 78.55 % 99.29 % 76.28 % 98.89 %
Table 2: Performance comparison results for payment transactions
In the table shown above (Table 2), class-wise precision and recall for data points whose actual labels were present and were one among TPF and FPF are shown. Further, after comparing both the model’s performance it may be noted that the proposed method performs better than the conventional Deep MCDD model.
In another embodiment, another comparison study may have been made on a different imbalanced dataset such as a gas-sensor dataset that is publicly available at a UCI library. The following Table 3 shows a comparative study showing the performance results of applying the proposed method and the conventional Deep MCDD model on the Gas-sensor dataset.
Proposed method Deep MCDD
Class label No. of samples Accuracy f1_score Accuracy f1_score
OOD class 2564 85.936073 87.246377 54.726027 38.320373
Class 1 468 99.452055 97.413793 99.429224 97.337593
Class 2 263 99.200913 92.900609 99.611872 96.673190
Class 3 310 89.041096 54.110899 55.136986 23.807677
Class 4 482 99.497717 97.684211 99.315068 96.951220
Class 5 293 98.333333 85.769981 98.767123 89.849624
Overall 4380 90.718219 85.854311 70.079778 73.823280
Table 3: Performance comparison results for the Gas-sensor dataset
Herein, as shown in Table 3, the Gas-sensor dataset has 6 classes out of which class 1 to class 5 may be considered as in-distribution classes which are used for training and test data that includes all 6 classes. Further, the task of both the proposed method and the conventional Deep-MCDD model will label all the data points in a test as either class 1, class 2 class 3, class 4, class 5, or OOD. Further, class-wise accuracy and f1-score for all 6 classes as well as the overall accuracy and f1-score for the entire test data.
As may be seen from table 3, the proposed method provides approximately 20% more accuracy and approximately 12% more f1-score, then the conventional approach. To that end, the proposed approach is better in capturing OOD data points as well as classifying in-distribution data points. Further, it may be noted that the proposed method is valid for tabular datasets, however, as long as data samples can be converted to tabular datapoints, the present approach may be applied to image datasets as well. For the image dataset, each image scan is sent in after flattening the entire image.
FIG. 8 illustrates a process flow diagram depicting a method 800 for refining an imbalanced dataset for a specific task and detecting Out-of-Distribution (OOD) data points from the imbalanced dataset, in accordance with an embodiment of the present disclosure. The method 800 depicted in the flow diagram may be executed by, for example, the server system 200. The sequence of operations of the method 800 may not be necessarily executed in the same order as they are presented. Further, one or more operations may be grouped and performed in the form of a single step, or one operation may have several sub-steps that may be performed in parallel or in a sequential manner. Operations of the method 800, and combinations of operations in the method 800 may be implemented by, for example, hardware, firmware, a processor, circuitry, and/or a different device associated with the execution of software that includes one or more computer program instructions. The plurality of operations is depicted in the process flow of the method 800. The process flow starts at operation 802.
At 802, the method 800 includes accessing, by a server system (e.g., the server system 200), the imbalanced dataset (e.g., the imbalanced dataset 218) from a database (e.g., the database 204) associated with the server system 200. Herein, the imbalanced dataset 218 may include a plurality of data points related to a plurality of entities (e.g., the entities 104). Each data point of the plurality of data points is labeled with at least one category label, and with a label distribution being imbalanced.
At 804, the method 800 includes performing, by the server system 200, for each data point of the plurality of data points: sub-steps 804A and 804B.
At 804A, the method 800 includes generating one or more task-specific features from the imbalanced dataset 218.
At 804B, the method 800 includes generating a category-related latent representation from the one or more task-specific features.
At 806, the method 800 includes generating, by the server system 200, one or more category distribution clusters in a vector space based, at least in part, on the plurality of category-related latent representations and one or more categories in the training dataset. Herein, each category distribution cluster of the one or more category distribution clusters is clustered around a corresponding center position in the vector space.
At 808, the method 800 includes computing, by the server system 200 via a machine learning (ML) model 220, a relative confidence score corresponding to each data point of the plurality of data points based, at least in part, on a relative distance of the category-related latent representation corresponding to each data point from center positions of the one or more category distribution clusters. Herein, the relative confidence score for an individual data point indicates if the individual data point from the plurality of data points belongs to a particular category distribution cluster from one of the one or more category distribution clusters.
At 810, the method 800 includes classifying the plurality of data points to be included in an Out-of-Distribution (OOD) category, based, at least in part, on the relative confidence score of each data point of the plurality of data points being lower than a predefined threshold score.
FIGS. 9A and 9B collectively illustrate a detailed process flow diagram depicting a method 900 for refining an imbalanced dataset (e.g., the imbalanced dataset 218) for a specific task and detecting OOD data points from the imbalanced dataset 218, in accordance with an embodiment of the present disclosure. The method 900 depicted in the flow diagram may be executed by, for example, the server system 200. The sequence of operations of the method 900 may not be necessarily executed in the same order as they are presented. Further, one or more operations may be grouped and performed in the form of a single step, or one operation may have several sub-steps that may be performed in parallel or in a sequential manner. Operations of the method 900, and combinations of operations in the method 900 may be implemented by, for example, hardware, firmware, a processor, circuitry, and/or a different device associated with the execution of software that includes one or more computer program instructions. The plurality of operations is depicted in the process flow of the method 900. The process flow starts at operation 902.
At 902, the method 900 includes accessing, by a server system (e.g., the server system 200), the imbalanced dataset (e.g., the imbalanced dataset 218) from a database (e.g., the database 204) associated with the server system 200. Herein, the imbalanced dataset 218 may include a plurality of data points related to a plurality of entities (e.g., the entities 104). Each data point of the plurality of data points is labeled with at least one category label, and with a label distribution being imbalanced.
At 904, the method 900 includes performing, by the server system 200, for each data point of the plurality of data points: sub-steps 904A and 904B.
At 904A, the method 900 includes generating one or more task-specific features from the imbalanced dataset 218.
At 904B, the method 900 includes generating a category-related latent representation from the one or more task-specific features.
At 906, the method 900 includes generating, by the server system 200, one or more category distribution clusters in a vector space based, at least in part, on the category-related latent representations and one or more categories in the training dataset. Herein, each category distribution cluster of the one or more category distribution clusters being clustered around a corresponding center position in the vector space.
At 908, the method 900 includes computing, by the server system 200 via a machine learning (ML) model 220, a relative confidence score corresponding to each data point of the plurality of data points based, at least in part, on a relative distance of the category-related latent representation corresponding to each data point from center positions of the one or more category distribution clusters. Herein, the relative confidence score for an individual data point indicates if the individual data point from the plurality of data points belongs to a particular category distribution cluster from one of the one or more category distribution clusters.
In an embodiment, at 908, to compute the relative confidence score, the process flow moves to sub steps 908A followed by sub-steps 908B, 908C, 908D, and 908E.
At 908A, the method 900 includes determining, by the server system 200 via the ML model 220, the center position corresponding to each category distribution cluster of the one or more category distribution clusters based, at least in part, on one or more decision boundaries (e.g., the one or more decision boundaries 416) surrounding each category distribution cluster of the one or more category distribution clusters.
At 908B, the method 900 includes computing, by the server system 200, a radius of each category distribution cluster of the one or more category distribution clusters based, at least in part, on the corresponding center positions and the corresponding one or more decision boundaries 416 of the corresponding category distribution clusters.
At 908C, the method 900 includes determining, by the server system 200 via the ML model 220, a distance of the category-related latent representation corresponding to each data point from the center positions of the one or more category distribution clusters.
At 908D, the method 900 includes computing, by the server system 200, the relative distance for the category-related latent representation corresponding to each data point from the center positions of the one or more category distribution clusters based, at least in part, on the radius and the distance.
At 908E, the method 900 includes computing, by the server system 200 via the ML model 220, the relative confidence score for each data point of the plurality of data points based, at least in part, on the relative distance and predefined confidence computation criteria.
At 910, the method 900 includes identifying, by the server system 200, if the relative confidence score of each data point of the plurality of data points is at least equal to a predefined threshold score. Herein, if the relative confidence score of each data point of the plurality of data points is at least equal to the predefined threshold score, then the process flow moves to step 912, otherwise, the process flow moves to step 914.
At step 912, the method 900 includes classifying the plurality of data points to be included in one of the one or more categories.
At step 914, the method 900 includes classifying the plurality of data points to be included in an Out-of-Distribution (OOD) category.
FIG. 10 illustrates a process flow diagram depicting a method 1000 for refining an imbalanced dataset for fraud detection and detecting OOD payment transactions from the imbalanced dataset, in accordance with an embodiment of the present disclosure. The method 1000 depicted in the flow diagram may be executed by, for example, the server system 602. The sequence of operations of the method 1000 may not be necessarily executed in the same order as they are presented. Further, one or more operations may be grouped and performed in the form of a single step, or one operation may have several sub-steps that may be performed in parallel or in a sequential manner. Operations of the method 1000, and combinations of operations in the method 1000 may be implemented by, for example, hardware, firmware, a processor, circuitry, and/or a different device associated with the execution of software that includes one or more computer program instructions. The plurality of operations is depicted in the process flow of the method 1000. The process flow starts at operation 1002.
At 1002, the method 1000 includes accessing, by a server system (e.g., the server system 602), the imbalanced dataset (e.g., the imbalanced dataset 624) from a database (e.g., the database 618) associated with the server system 602. Herein, the imbalanced dataset 218 may include information related to a plurality of payment transactions related to a network of a plurality of cardholders (e.g., the cardholders 604) and a plurality of merchants (e.g., the merchants 608). Each data point of the plurality of data points is labeled with at least one category label, and with a label distribution being imbalanced.
At 1004, the method 1000 includes performing, by the server system 602, for each payment transaction of the plurality of payment transactions: sub-steps 1004A and 1004B.
At 1004A, the method 1000 includes generating one or more task-specific features from the imbalanced dataset 624.
At 1004B, the method 1000 includes generating a category-related latent representation from the one or more task-specific features.
At 1006, the method 1000 includes generating, by the server system 602, one or more category distribution clusters in a vector space based, at least in part, on the plurality of category-related latent representations and one or more categories in the training dataset. Herein, each category distribution cluster of the one or more category distribution clusters being clustered around a corresponding center position in the vector space.
At 1008, the method 1000 includes computing, by the server system 602 via a machine learning (ML) model 626, a relative confidence score corresponding to each payment transaction of the plurality of payment transactions based, at least in part, on a relative distance of the category-related latent representation corresponding to each payment transaction from center positions of the one or more category distribution clusters. Herein, the relative confidence score for an individual payment transaction indicates if the individual payment transaction from the plurality of payment transactions belongs to a particular category distribution cluster from one of the one or more category distribution clusters.
At 1010, the method 1000 includes classifying the plurality of payment transactions to be included in an Out-of-Distribution (OOD) category based, at least in part, on the relative confidence score of each d payment transaction of the plurality of payment transactions being lower than a predefined threshold score.
Although the invention has been described with reference to specific exemplary embodiments, it is noted that various modifications and changes may be made to these embodiments without departing from the broad scope of the invention. For example, the various operations, blocks, etc., described herein may be enabled and operated using hardware circuitry (for example, complementary metal oxide semiconductor (CMOS) based logic circuitry), firmware, software, and/or any combination of hardware, firmware, and/or software (for example, embodied in a machine-readable medium). For example, the apparatuses and methods may be embodied using transistors, logic gates, and electrical circuits (for example, application-specific integrated circuit (ASIC) circuitry and/or in Digital Signal Processor (DSP) circuitry).
Particularly, the server system 200 and its various components may be enabled using software and/or using transistors, logic gates, and electrical circuits (for example, integrated circuit circuitry such as ASIC circuitry). Various embodiments of the invention may include one or more computer programs stored or otherwise embodied on a computer-readable medium, wherein the computer programs are configured to cause a processor or the computer to perform one or more operations. A computer-readable medium storing, embodying, or encoded with a computer program, or similar language, may be embodied as a tangible data storage device storing one or more software programs that are configured to cause a processor or computer to perform one or more operations. Such operations may be, for example, any of the steps or operations described herein. In some embodiments, the computer programs may be stored and provided to a computer using any type of non-transitory computer-readable media. Non-transitory computer-readable media includes any type of tangible storage media. Examples of non-transitory computer-readable media include magnetic storage media (such as floppy disks, magnetic tapes, hard disk drives, etc.), optical magnetic storage media (e.g. magneto-optical disks), CD-ROM (compact disc read-only memory), CD-R (compact disc recordable), CD-R/W (compact disc rewritable), DVD (Digital Versatile Disc), BD (BLU-RAY® Disc), and semiconductor memories (such as mask ROM, PROM (programmable ROM), EPROM (erasable PROM), flash memory, RAM (random access memory), etc.). Additionally, a tangible data storage device may be embodied as one or more volatile memory devices, one or more non-volatile memory devices, and/or a combination of one or more volatile memory devices and non-volatile memory devices. In some embodiments, the computer programs may be provided to a computer using any type of transitory computer-readable media. Examples of transitory computer-readable media include electric signals, optical signals, and electromagnetic waves. Transitory computer-readable media can provide the program to a computer via a wired communication line (e.g., electric wires, and optical fibers) or a wireless communication line.
Various embodiments of the invention, as discussed above, may be practiced with steps and/or operations in a different order, and/or with hardware elements in configurations, which are different from those which, are disclosed. Therefore, although the invention has been described based on these exemplary embodiments, it is noted that certain modifications, variations, and alternative constructions may be apparent and well within the scope of the invention.
Although various exemplary embodiments of the invention are described herein in a language specific to structural features and/or methodological acts, the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as exemplary forms of implementing the claims.
, Claims:1. A computer-implemented method for refining an imbalanced dataset for a specific task, the computer-implemented method comprising:
accessing, by a server system, the imbalanced dataset from a database associated with the server system, the imbalanced dataset comprising a plurality of data points related to a plurality of entities, each data point of the plurality of data points being labeled with at least one category label for a training dataset extracted from the imbalanced dataset, and with a label distribution being imbalanced;
performing, by the server system, for each data point of the plurality of data points:
generating one or more task-specific features from the imbalanced dataset, and
generating a category-related latent representation from the one or more task-specific features;
generating, by the server system, one or more category distribution clusters in a vector space based, at least in part, on a plurality of category-related latent representations and one or more categories in the training dataset, each category distribution cluster of the one or more category distribution clusters being clustered around a corresponding center position in the vector space;
computing, by the server system via a machine learning (ML) model, a relative confidence score corresponding to each data point of the plurality of data points based, at least in part, on a relative distance of the category-related latent representation corresponding to each data point from center positions of the one or more category distribution clusters, wherein the relative confidence score for an individual data point indicates if the individual data point from the plurality of data points belongs to a particular category distribution cluster from one of the one or more category distribution clusters; and
classifying, by the server system, the plurality of data points to be included in an Out-of-Distribution (OOD) category based, at least in part, on the relative confidence score of each data point of the plurality of data points being lower than a predefined threshold score.
2. The computer-implemented method as claimed in claim 1, further comprising:
classifying, by the server system, the plurality of data points to be included in one of the one or more categories based, at least in part, on the relative confidence score of each data point of the plurality of data points being at least equal to the predefined threshold score.
3. The computer-implemented method as claimed in claim 1, further comprising:
generating, by the server system, the ML model based, at least in part, on performing a set of operations iteratively till performance of the ML model converges to predefined criteria, the set of operations comprising:
initializing the ML model based, at least in part, on one or more model parameters;
computing via the ML model, one or more loss function values for each data point of the plurality of data points for reducing an intra-class cluster distance and increasing inter-class cluster distance based, at least in part, on using one or more loss functions and the relative confidence score;
computing a net loss function value for each data point of the plurality of data points based, at least in part, on aggregating the one or more loss function values; and
optimizing the one or more model parameters associated with the ML model based, at least in part, on back-propagating the net loss function value.
4. The computer-implemented method as claimed in claim 3, wherein the one or more loss functions comprise at least one pull loss and at least two push losses, the at least one pull loss being adapted to reduce the intra-class cluster distance and the at least two push losses being adapted to increase an inter-class cluster distance and to configure an imbalanced setting.
5. The computer-implemented method as claimed in claim 3, wherein the predefined criteria is a saturation of the net loss function value, the net loss function value being saturated after a plurality of iterations of the set of operations is performed.
6. The computer-implemented method as claimed in claim 1, wherein computing the relative confidence score comprises:
determining, by the server system via the ML model, the center position corresponding to each category distribution cluster of the one or more category
distribution clusters based, at least in part, on one or more decision boundaries surrounding each category distribution cluster of the one or more category distribution clusters;
computing, by the server system, a radius of each category distribution cluster of the one or more category distribution clusters based, at least in part, on the corresponding center positions and the corresponding one or more decision boundaries of the corresponding category distribution clusters;
determining, by the server system via the ML model, a distance of the category-related latent representation corresponding to each data point from the center positions of the one or more category distribution clusters;
computing, by the server system, the relative distance for the category-related latent representation corresponding to each data point from the center positions of the one or more category distribution clusters based, at least in part, on the radius and the distance; and
computing, by the server system via the ML model, the relative confidence score for each data point of the plurality of data points based, at least in part, on the relative distance and predefined confidence computation criteria.

7. The computer-implemented method as claimed in claim 1, wherein the ML model comprises a Gaussian discriminator-based model.
8. The computer-implemented method as claimed in claim 1, further comprising: removing, by the server system, the OOD data points from the imbalanced dataset to generate a refined dataset.
9. A server system, comprising:
a communication interface;
a memory comprising executable instructions; and
a processor communicably coupled to the communication interface and the memory, the processor configured to cause the server system to at least:
access an imbalanced dataset from a database associated with the server system, the imbalanced dataset comprising a plurality of data points related to a plurality of entities, each data point of the plurality of data points being labeled with at least one category label for a training dataset extracted from the imbalanced dataset, and with a label distribution being imbalanced;
perform for each data point of the plurality of data points to:
generate one or more task-specific features from the imbalanced dataset, and
generate a category-related latent representation from the one or more task-specific features;
generate one or more category distribution clusters in a vector space based, at least in part, on a plurality of category-related latent representations and one or more categories in the training dataset, each category distribution cluster of the one or more category distribution clusters being clustered around a corresponding center position in the vector space;
compute via a machine learning (ML) model, a relative confidence score corresponding to the each data point of the plurality of data points based, at least in part, on a relative distance of the category-related latent representation corresponding to each data point from center positions of the one or more category distribution clusters, wherein the relative confidence score for an individual data point indicates if the individual data point from the plurality of data points belongs to a particular category distribution cluster from one of the one or more category distribution clusters; and
classify the plurality of data points to be included in an Out-of-Distribution category based, at least in part, on the relative confidence score of each data point of the plurality of data points being lower than a predefined threshold score.
10. The server system as claimed in claim 9, wherein the server system is further caused to: classify the plurality of data points to be included in one of the one or more categories based, at least in part, on the relative confidence score of each data point of the plurality of data points being at least equal to the predefined threshold score.
11. The server system as claimed in claim 9, wherein the server system is further caused to:
generate the ML model based, at least in part, on performing a set of operations iteratively till performance of the ML model converges to predefined criteria, the set of operations comprising:
initialize the ML model based, at least in part, on one or more model parameters;
compute, via the ML model, one or more loss function values for each data point of the plurality of data points for reducing intra-class cluster distance and increasing inter-class cluster distance, based at least on using one or more loss functions and the relative confidence score;
compute a net loss function value for each data point of the plurality of data points based, at least in part, on aggregating the one or more loss function values; and
optimize the one or more model parameters associated with the ML model based, at least in part, on back-propagating the net loss function value.
12. The server system as claimed in claim 11, wherein the predefined criteria is a saturation of the net loss function value, the net loss function value being saturated after a plurality of iterations of the set of operations is performed.
13. The server system as claimed in claim 11, wherein the one or more loss functions comprise at least one pull loss to reduce the intra-class cluster distance, and at least two push losses to increase the inter-class cluster distance and to configure an imbalanced setting.
14. The server system as claimed in claim 10, wherein to compute the relative confidence score, the server system is further caused to:
determine, via the ML model, the center position corresponding to each category distribution cluster of the one or more category distribution clusters based, at least in part, on one or more decision boundaries surrounding each category distribution cluster of the one or more category distribution clusters;
determine, via the ML model, a distance of the category-related latent representation corresponding to each data point from the center positions of the one or more category distribution clusters;
compute a radius of each category distribution cluster of the one or more category distribution clusters based, at least in part, on the corresponding center positions and the corresponding one or more decision boundaries of the corresponding category distribution clusters;
compute the relative distance for the category-related latent representation corresponding to each data point from the center positions of the one or more category distribution clusters based, at least in part, on the radius and the distance; and
compute, via the ML model, the relative confidence score for each data point of the plurality of data points based, at least in part, on the relative distance and predefined confidence computation criteria.

15. The server system as claimed in claim 9, wherein the ML model comprises a Gaussian discriminator-based model.
16. The server system as claimed in claim 9, wherein the server system is further caused to remove the OOD data points from the imbalanced dataset to generate a refined dataset

17. A computer-implemented method for refining an imbalanced dataset for fraud detection in payment transactions, the computer-implemented method comprising:
accessing, by a server system, the imbalanced dataset from a database associated with the server system, the imbalanced dataset comprising information related to a plurality of payment transactions performed by a plurality of cardholders and a plurality of merchants, each payment transaction of the plurality of payment transactions being labeled with at least one category label for a training dataset extracted from the imbalanced dataset, and with a label distribution being imbalanced;
performing, by the server system, for each payment transaction of the plurality of payment transactions:
generating one or more payment-specific features from the imbalanced dataset; and
generating a category-related latent representation from the one or more task-specific features;
generating, by the server system, one or more category distribution clusters in a vector space based, at least in part, on a plurality of category-related latent representations and one or more categories in the training dataset, each category distribution cluster of the one or more category distribution clusters being clustered around a corresponding center position in the vector space;
computing, by the server system via a machine learning (ML) model, a relative confidence score corresponding to the each payment transaction of the plurality of payment transactions based, at least in part, on a relative distance of the category-related latent representation corresponding to each payment transaction from center positions of the one or more category distribution clusters, wherein the relative confidence score for an individual payment transaction indicates if the individual payment transaction from the plurality of payment transactions belongs to a particular category distribution cluster from one of the one or more category distribution clusters; and
classifying the plurality of payment transactions to be included in an Out-of-Distribution category based, at least in part, on the relative confidence score of each d payment transaction of the plurality of payment transactions being lower than a predefined threshold score.
18. The computer-implemented method as claimed in claim 17, wherein classifying the plurality of payment transactions to be included in one of the one or more categories comprises classifying the plurality of payment transactions to be included in one of a First-Party Fraud (FPF), a Third-Party Fraud (TPF), and a Legitimate dispute.
19. The computer-implemented method as claimed in claim 17, wherein the server system is a payment server associated with a payment network.
20. A non-transitory computer-readable storage medium comprising computer-executable instructions that, when executed by at least a processor of a server system, cause the server system to perform a method comprising:
accessing the imbalanced dataset from a database associated with the server system, the imbalanced dataset comprising a plurality of data points related to a plurality of entities, each data point of the plurality of data points being labeled with at least one category label for a training dataset extracted from the imbalanced dataset, and with a label distribution being imbalanced;
performing for each data point of the plurality of data points:
generating one or more task-specific features from the imbalanced dataset, and
generating a category-related latent representation from the one or more task-specific features;
generating one or more category distribution clusters in a vector space based, at least in part, on a plurality of category-related latent representations and one or more categories in the training dataset, each category distribution cluster of the one or more category distribution clusters being clustered around a corresponding center position in the vector space;
computing, via a machine learning (ML) model, a relative confidence score corresponding to each data point of the plurality of data points based, at least in part, on a relative distance of the category-related latent representation corresponding to each data point from center positions of the one or more category distribution clusters, wherein the relative confidence score for an individual data point indicates if the individual data point from the plurality of data points belongs to a particular category distribution cluster from one of the one or more category distribution clusters; and
classifying the plurality of data points to be included in an Out-of-Distribution category based, at least in part, on the relative confidence score of each data point of the plurality of data points being lower than the predefined threshold score.

Documents

Application Documents

# Name Date
1 202341061378-STATEMENT OF UNDERTAKING (FORM 3) [12-09-2023(online)].pdf 2023-09-12
2 202341061378-POWER OF AUTHORITY [12-09-2023(online)].pdf 2023-09-12
3 202341061378-FORM 1 [12-09-2023(online)].pdf 2023-09-12
4 202341061378-FIGURE OF ABSTRACT [12-09-2023(online)].pdf 2023-09-12
5 202341061378-DRAWINGS [12-09-2023(online)].pdf 2023-09-12
6 202341061378-DECLARATION OF INVENTORSHIP (FORM 5) [12-09-2023(online)].pdf 2023-09-12
7 202341061378-COMPLETE SPECIFICATION [12-09-2023(online)].pdf 2023-09-12
8 202341061378-Proof of Right [24-11-2023(online)].pdf 2023-11-24