Methods And Systems For Training A Student Machine Learning Model With

< Back

Methods And Systems For Training A Student Machine Learning Model With Graph Structure Information

Abstract: Methods and server systems for training a student Machine Learning (ML) model with graph structure information are described herein. Method performed by server system includes accessing a homogenous graph including a set of nodes. Set of nodes includes a set of labeled nodes and a set of unlabeled nodes such that each node of the set of nodes is associated with features. Method includes generating, by a teacher ML model, a soft label for each unlabeled node based on the features corresponding to each unlabeled node. Method includes generating an adjacency matrix based on the set of nodes. Method includes extracting a local adjacency matrix from the adjacency matrix. Method includes training a student ML model based on performing a first and second set of operations iteratively till a performance of the student ML model reaches a first predefined criterion and a second predefined criterion, respectively. (To be published with FIG. 3a)

Get Free WhatsApp Updates!
Notices, Deadlines & Correspondence

Patent Information

Application #

Filing Date

02 January 2024

Publication Number

27/2025

Publication Type

INA

Invention Field

COMPUTER SCIENCE

Status

Parent Application

Applicants

MASTERCARD INTERNATIONAL INCORPORATED

2000 Purchase Street, Purchase, NY 10577, United States of America

Inventors

1. Akshay Sethi

A-80, Meera Bagh, Paschim Vihar, New Delhi, Delhi 110087, India

2. Sanjay Kumar Patnala

Second floor, building number-15, J-8, DLF Phase-2, Gurgaon 122002, Haryana, India

3. Siddhartha Asthana

7/108 Malviya Nagar, New Delhi, Delhi 110017, India

4. Sonia Gupta

D-801, Emaar Emerald Estate, Gurgaon, Haryana 122001, India

5. Sumedh B G

#1616, 6th cross , 6th main, 2nd stage Vijayanagar, Mysore 570017, Karnataka, India

Specification

Description:METHODS AND SYSTEMS FOR TRAINING A STUDENT MACHINE LEARNING MODEL WITH GRAPH STRUCTURE INFORMATION

TECHNICAL FIELD
The present disclosure relates to artificial intelligence-based processing systems and, more particularly, to electronic methods and complex processing systems for training a student machine learning model with graph structure information.
BACKGROUND
In the Artificial Intelligence (AI) or Machine Learning (ML) domain, various datasets can often be converted to homogeneous graphs so that they can be analyzed to learn insights from the dataset thus, performing a task based on these insights. The term ‘homogeneous graph’ refers to versatile graph structures that represent a relationship between a set of nodes of the same type. Conventionally, Graph Neural Networks (GNNs) are used to learn insights from homogeneous graphs. With the recent development in GNNs, the handling of graph structured data of the homogeneous graphs has been significantly improved. This improvement has translated into improved performance while performing a wide range of graph-based tasks or downstream tasks such as but not limited to recommendations on social networks, fraud detection in payment networks, disease diagnosis using patient similarity graphs in the medical sector, and so on. This performance of GNNs stems from an architecture that is reliant on the communication and integration of node representations from their immediate and extended surroundings within the homogeneous graphs.
Despite these advantages of GNNs, their use in broad, real-world contexts is still limited due to various technical challenges. Such technical challenges arise from the extensive computing requirements of GNNs while processing large graphs, which makes their operation time consuming as well. The time-consuming nature of GNNs renders them ineffective for tasks where swift responses are required. The discrepancy between a GNN’s theoretical capabilities and their practical applicability in industry arises mainly from the inherent challenges in scaling and implementing these networks, which is a consequence of their dependency on data. The primary factor contributing to the latency experienced with Graph Neural Networks (GNNs) during inference time is due to the requirement for accessing features of a concerned node and the structural properties of the nodes in the neighborhood of the concerned node. In particular, to predict an output corresponding to a concerned node, node features from the adjacent nodes are accessed and aggregated, more often in a weighted manner by the GNN. This implicit dependence of the GNNs on graph topology during inference time increases their latency. For example, GNNs cannot be applied for performing fraud detection for ongoing payment transactions due to their inherent latency due to excessive time consumption while processing the transaction graphs.
Further, GNNs face scalability constraints as well since the computational resources required for processing a graph increase with the size of the graph. Therefore, for tasks where swift responses are desired, it is common to use simpler classification models such as a Multi-Layer Perceptron (MLP) despite their subpar performance with graph structured data and their focus on just the node’s content.
To address these problems, various conventional approaches have been developed. One such conventional approach aims to solve the scalability constraint of GNNs by distilling the knowledge from trained GNNs to MLPs. In other words, knowledge is distilled from a teacher GNN model to a student MLP model to solve the scalability constraint. However, this approach still requires the graph structure representation during the inference process, which in turn increases the inference latency a lot. Although there exist strategies to sparsify or simplify the graph structure and enhance the speed of GNNs, such as minimizing computations like multiplication and accumulation through pruning and quantization, the inherent graph dependency persists. In other words, the main hurdle that remains unaddressed is the interdependence of data for learning GNNs, which significantly limits the potential for further speed improvement.
Thus, there exists a technological need for technical solutions for learning or training a student ML model that does not rely on graph dependencies during inference or learning while still being capable of effectively capturing the topological information or structural information of the homogeneous graph.
SUMMARY
There exists a need for techniques to overcome one or more limitations stated above such as scalability constraints, poor latency, intensive computational requirements, and so on. Various embodiments of the present disclosure provide methods and systems for training a student ML model that does not rely on graph dependencies during inference or learning while still being capable of effectively capturing the topological information or structural information of a homogeneous graph.
To achieve the above and other objectives of the present disclosure, in one embodiment, a computer-implemented method for training a student machine learning model with graph structure information is disclosed.
The computer-implemented method performed by a server system includes accessing a homogenous graph from a database associated with the server system. The homogenous graph includes a set of nodes. The set of nodes includes a set of labeled nodes and a set of unlabeled nodes. Herein, a set of edges exists between the set of labeled nodes and the set of unlabeled nodes such that each node of the set of nodes is associated with a plurality of features. The computer-implemented method further includes generating, by a teacher ML model associated with the server system, a soft label for each unlabeled node of the plurality of unlabeled nodes based, at least in part, on the plurality of features corresponding to each unlabeled node. The computer-implemented method further includes generating an adjacency matrix based, at least in part, on the set of nodes. The adjacency matrix indicates a global structure of the set of nodes in the homogeneous graph. The computer-implemented method further includes extracting a local adjacency matrix from the adjacency matrix such that the local adjacency matrix indicates a local structure of the set of labeled nodes in the homogeneous graph. The computer-implemented method further includes training a student ML model based, at least in part, on performing a first set of operations iteratively till a performance of the student ML model reaches a first predefined criterion and a second set of operations iteratively till a performance of the student ML model reaches a second predefined criterion. The first set of operations includes initializing the student ML model based, at least in part, on a set of model parameters. The first set of operations further includes determining a neighborhood contrastive loss based, at least in part, on the local adjacency matrix. The first set of operations includes performing for each labeled node of the set of labeled nodes: (1) generating a set of labeled node features based, at least in part, on masking a label of the each labeled node, (2) generating, by the student ML model, a first embedding based, at least in part, on the corresponding set of labeled node features, (3) determining, by the student ML model, a first label prediction based, at least in part, on the corresponding first embedding, and determining a first cross entropy loss based, at least in part, on the corresponding first label prediction and the corresponding label. The first set of operations further includes updating the set of model parameters based, at least in part, on backpropagating the neighborhood contrastive loss and the first cross entropy loss of each labeled node.
In another embodiment, a server system is disclosed. The server system includes a communication interface and a memory including executable instructions. The server system also includes a processor communicably coupled to the memory. The processor is configured to execute the instructions to cause the server system, at least in part, to access a homogenous graph from a database associated with the server system. The homogenous graph includes a set of nodes. The set of nodes includes a set of labeled nodes and a set of unlabeled nodes. Herein, a set of edges exists between the set of labeled nodes and the set of unlabeled nodes such that each node of the set of nodes is associated with a plurality of features. The server system is further caused to generate, by a teacher ML model associated with the server system, a soft label for each unlabeled node of the plurality of unlabeled nodes based, at least in part, on the plurality of features corresponding to each unlabeled node. The server system is further caused to generate an adjacency matrix based, at least in part, on the set of nodes. The adjacency matrix indicates a global structure of the set of nodes in the homogeneous graph. The server system is further caused to extract a local adjacency matrix from the adjacency matrix such that the local adjacency matrix indicates a local structure of the set of labeled nodes in the homogeneous graph. The server system is further caused to train a student ML model based, at least in part, on performing a first set of operations iteratively till a performance of the student ML model reaches a first predefined criterion and a second set of operations iteratively till a performance of the student ML model reaches a second predefined criterion. The first set of operations includes initializing the student ML model based, at least in part, on a set of model parameters. The first set of operations further includes determining a neighborhood contrastive loss based, at least in part, on the local adjacency matrix. The first set of operations includes performing for each labeled node of the set of labeled nodes: (1) generating a set of labeled node features based, at least in part, on masking a label of the each labeled node, (2) generating, by the student ML model, a first embedding based, at least in part, on the corresponding set of labeled node features, (3) determining, by the student ML model, a first label prediction based, at least in part, on the corresponding first embedding, and determining a first cross entropy loss based, at least in part, on the corresponding first label prediction and the corresponding label. The first set of operations further includes updating the set of model parameters based, at least in part, on backpropagating the neighborhood contrastive loss and the first cross entropy loss of each labeled node.
In yet another embodiment, a non-transitory computer-readable storage medium is disclosed. The non-transitory computer-readable storage medium includes computer-executable instructions that, when executed by at least a processor of a server system, cause the server system to perform a method. The method includes accessing a homogenous graph from a database associated with the server system. The homogenous graph includes a set of nodes. The set of nodes includes a set of labeled nodes and a set of unlabeled nodes. Herein, a set of edges exists between the set of labeled nodes and the set of unlabeled nodes such that each node of the set of nodes is associated with a plurality of features. The method further includes generating, by a teacher ML model associated with the server system, a soft label for each unlabeled node of the plurality of unlabeled nodes based, at least in part, on the plurality of features corresponding to each unlabeled node. The method further includes generating an adjacency matrix based, at least in part, on the set of nodes. The adjacency matrix indicates a global structure of the set of nodes in the homogeneous graph. The method further includes extracting a local adjacency matrix from the adjacency matrix such that the local adjacency matrix indicates a local structure of the set of labeled nodes in the homogeneous graph. The method further includes training a student ML model based, at least in part, on performing a first set of operations iteratively till a performance of the student ML model reaches a first predefined criterion and a second set of operations iteratively till a performance of the student ML model reaches a second predefined criterion. The first set of operations includes initializing the student ML model based, at least in part, on a set of model parameters. The first set of operations further includes determining a neighborhood contrastive loss based, at least in part, on the local adjacency matrix. The first set of operations includes performing for each labeled node of the set of labeled nodes: (1) generating a set of labeled node features based, at least in part, on masking a label of the each labeled node, (2) generating, by the student ML model, a first embedding based, at least in part, on the corresponding set of labeled node features, (3) determining, by the student ML model, a first label prediction based, at least in part, on the corresponding first embedding, and determining a first cross entropy loss based, at least in part, on the corresponding first label prediction and the corresponding label. The first set of operations further includes updating the set of model parameters based, at least in part, on backpropagating the neighborhood contrastive loss and the first cross entropy loss of each labeled node.
The foregoing summary is illustrative only and is not intended to be in any way limiting. In addition to the illustrative aspects, embodiments, and features described above, further aspects, embodiments, and features will become apparent by reference to the drawings and the following detailed description.

BRIEF DESCRIPTION OF THE FIGURES
For a more complete understanding of example embodiments of the present technology, reference is now made to the following descriptions taken in connection with the accompanying drawings in which:
FIG. 1 illustrates a representation of an environment related to at least some example embodiments of the present disclosure;
FIG. 2 illustrates a simplified block diagram of a server system, in accordance with an embodiment of the present disclosure;
FIGS. 3A, 3B, and 3C, collectively, illustrate an architecture for training and deploying a student ML model, in accordance with an embodiment of the present disclosure;
FIGS. 4A, 4B, and 4C, collectively, illustrate experimental results of various experiments performed, in accordance with one or more embodiments of the present disclosure;
FIG. 5 illustrates a process flow diagram depicting a method for training a teacher ML model, in accordance with an embodiment of the present disclosure;
FIG. 6 illustrates a process flow diagram depicting a method for training the student ML model, in accordance with an embodiment of the present disclosure;
FIG. 7 illustrates a process flow diagram depicting a method for performing a first set of operations, in accordance with an embodiment of the present disclosure;
FIGS. 8A and 8B, collectively, illustrate a process flow diagram depicting a method for performing a second set of operations, in accordance with an embodiment of the present disclosure; and
FIG. 9 illustrates a simplified block diagram of a payment server, in accordance with an embodiment of the present disclosure.
The drawings referred to in this description are not to be understood as being drawn to scale except if specifically noted, and such drawings are only exemplary in nature.

DETAILED DESCRIPTION
In the following description, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the present disclosure. It will be apparent, however, to one skilled in the art that the present disclosure can be practiced without these specific details.
Reference in this specification to “one embodiment” or “an embodiment” means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment of the present disclosure. The appearance of the phrase “in an embodiment” in various places in the specification is not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Moreover, various features are described which may be exhibited by some embodiments and not by others. Similarly, various requirements are described which may be requirements for some embodiments but not for other embodiments.
Moreover, although the following description contains many specifics for the purposes of illustration, anyone skilled in the art will appreciate that many variations and/or alterations to said details are within the scope of the present disclosure. Similarly, although many of the features of the present disclosure are described in terms of each other, or in conjunction with each other, one skilled in the art will appreciate that many of these features can be provided independently of other features. Accordingly, this description of the present disclosure is set forth without any loss of generality to, and without imposing limitations upon, the present disclosure.
Embodiments of the present disclosure may be embodied as an apparatus, a system, a method, or a computer program product. Accordingly, embodiments of the present disclosure may take the form of an entirely hardware embodiment, an entirely software embodiment (including firmware, resident software, micro-code, etc.), or an embodiment combining software and hardware aspects that may all generally be referred to herein as a “circuit”, “engine”, “module”, or “system”. Furthermore, aspects of the present disclosure may take the form of a computer program product embodied in one or more computer-readable storage media having computer-readable program code embodied thereon.
The terms “account holder”, “user”, “cardholder”, “consumer”, and “buyer” are used interchangeably throughout the description and refer to a person who has a payment account or a payment card (e.g., credit card, debit card, etc.) associated with the payment account, that will be used by them at a merchant to perform a payment transaction. The payment account may be opened via an issuing bank or an issuer server.
The term “merchant”, used throughout the description generally refers to a seller, a retailer, a purchase location, an organization, or any other entity that is in the business of selling goods or providing services, and it can refer to either a single business location or a chain of business locations of the same entity.
The terms “payment network” and “card network” are used interchangeably throughout the description and refer to a network or collection of systems used for the transfer of funds through the use of cash substitutes. Payment networks may use a variety of protocols and procedures in order to process the transfer of money for various types of transactions. Payment networks are companies that connect an issuing bank with an acquiring bank to facilitate online payment. Transactions that may be performed via a payment network may include product or service purchases, credit purchases, debit transactions, fund transfers, account withdrawals, etc. Payment networks may be configured to perform transactions via cash substitutes that may include payment cards, letters of credit, checks, financial accounts, etc. Examples of networks or systems configured to perform or function as payment networks include those operated by such as Mastercard®. (Mastercard is a registered trademark of Mastercard International Incorporated located in Purchase, N.Y.).
The term “payment card”, used throughout the description, refers to a physical or virtual card linked with a financial or payment account that may be presented to a merchant or any such facility to fund a financial transaction via the associated payment account. Examples of the payment card include, but are not limited to, debit cards, credit cards, prepaid cards, virtual payment numbers, virtual card numbers, forex cards, charge cards, e-wallet cards, and stored-value cards. A payment card may be a physical card that may be presented to the merchant for funding the payment. Alternatively, or additionally, the payment card may be embodied in the form of data stored in a user device, where the data is associated with a payment account such that the data can be used to process the financial transaction between the payment account and a merchant's financial account.
The term “payment account”, used throughout the description refers to a financial account that is used to fund a financial transaction. Examples of the financial account include, but are not limited to, a savings account, a credit account, a checking account, and a virtual payment account. The financial account may be associated with an entity such as an individual person, a family, a commercial entity, a company, a corporation, a governmental entity, a non-profit organization, and the like. In some scenarios, the financial account may be a virtual or temporary payment account that can be mapped or linked to a primary financial account, such as those accounts managed by payment wallet service providers, and the like.
The terms “payment transaction”, “financial transaction”, “event”, and “transaction” are used interchangeably throughout the description and refer to a transaction or transfer of payment of a certain amount being initiated by the cardholder. More specifically, they refer to electronic financial transactions including, for example, online payment, payment at a terminal (e.g., Point Of Sale (POS) terminal), and the like. Generally, a payment transaction is performed between two entities, such as a buyer and a seller. It is to be noted that a payment transaction is followed by a payment transfer of a transaction amount (i.e., monetary value) from one entity (e.g., issuing bank associated with the buyer) to another entity (e.g., acquiring bank associated with the seller), in exchange of any goods or services.
OVERVIEW
Various embodiments of the present disclosure provide methods, systems, user devices, and computer program products for training a student ML model with graph structure information.
In an embodiment, a server system is configured to train the student ML model with graph structure information. The server system is configured to access a homogenous graph from a database associated with the server system. In an example, the homogenous graph may include a set of nodes. The set of nodes may include a set of labeled nodes and a set of unlabeled nodes such that a set of edges exists between the set of labeled nodes and the set of unlabeled nodes. Each node of the set of nodes is associated with a plurality of features. In particular, accessing the homogenous graph may further cause the server system to access an entity-related dataset from the database. In one example, the entity-related dataset includes information related to a plurality of entities. Further, the server system is configured to generate a set of features corresponding to each entity of the plurality of entities based, at least in part, on the information related to a plurality of entities. Furthermore, the server system is configured to generate the homogeneous graph based, at least in part, on the set of features for each entity such that each particular node of the homogeneous graph corresponds to each particular entity of the plurality of entities. In a non-limiting implementation, the plurality of entities may be at least one of a plurality of cardholders, a plurality of merchants, a plurality of issuers, and a plurality of acquirers.
In another embodiment, the server system is configured to train a teacher ML model based, at least in part, on performing a third set of operations iteratively till a performance of the teacher ML model reaches a third predefined criterion. In an example, the third set of operations includes initializing the teacher ML model based, at least in part, on a set of teacher model parameters. Then, the third set of operations includes performing for each labeled node of the set of labeled nodes: (1) generating a set of labeled node features based, at least in part, on masking the label of each labeled node, (2) generating, by the teacher ML model, a fourth embedding based, at least in part, on the corresponding set of labeled node features, (3) determining, by the teacher ML model, a teacher label prediction based, at least in part, on the corresponding fourth embedding, (4) determining a teacher cross entropy loss based, at least in part, on the corresponding teacher label prediction and the corresponding label, and (5) updating a set of teacher ML model parameters based, at least in part, on backpropagating the teacher cross entropy loss of each labeled node.
In another embodiment, the server system is configured to generate using the teacher ML model, a soft label for each unlabeled node of the set of unlabeled nodes based, at least in part, on the plurality of features corresponding to each unlabeled node. In an example, the teacher ML model is a Graph Neural Network (GNN) based ML model.
In another embodiment, the server system is configured to generate an adjacency matrix based, at least in part, on the set of nodes. The adjacency matrix indicates a global structure of the set of nodes in the homogeneous graph. Further, the server system extracts a local adjacency matrix from the adjacency matrix such that the local adjacency matrix indicates a local structure of the set of labeled nodes in the homogeneous graph.
In another embodiment, the server system is configured to train the student ML model based, at least in part, on performing a first set of operations iteratively till a performance of the student ML model reaches a first predefined criterion and a second set of operations iteratively till the performance of the student ML model reaches a second predefined criterion. In an example, the student ML model is a classifier-based ML model.
In an example, the first set of operations may include initializing the student ML model based, at least in part, on a set of model parameters. Then, the first set of operations may include determining a neighborhood contrastive loss based, at least in part, on the local adjacency matrix. Then, the first set of operations may include performing for each labeled node of the set of labeled nodes (1) generating a set of labeled node features based, at least in part, on masking a label of the each labeled node, (2) generating, by the student ML model, a first embedding based, at least in part, on the corresponding set of labeled node features, (3) determining, by the student ML model, a first label prediction based, at least in part, on the corresponding first embedding, and (4) determining a first cross entropy loss based, at least in part, on the corresponding first label prediction and the corresponding label. Then, the first set of operations may include updating the set of model parameters based, at least in part, on backpropagating the neighborhood contrastive loss and the first cross entropy loss of each labeled node.
In an example, the second set of operations may include generating a global embedding for each node of the homogenous graph based, at least in part, on the plurality of features and the adjacency matrix. Especially for generating the global embedding, the server system is configured to generate an initial global embedding for each node of the homogeneous graph based, at least in part, on the plurality of features. Then, the server system is configured to train a neural network model to predict the initial global embedding for each node based, at least in part, on the local adjacency matrix. Further, the server system is configured to determine via the neural network model, the global embedding for each node of the homogeneous graph.
Then, the second set of operations may include re-initializing the student ML model based, at least in part, on the set of model parameters. Then, the second set of operations may include performing for each labeled node of the set of labeled nodes: (1) generating the set of labeled node features based, at least in part, on masking the label of each labeled node, (2) generating, by the student ML model, a second embedding based, at least in part, on the corresponding set of labeled node features, (3) determining, by the student ML model, a second label prediction based, at least in part, on the corresponding second embedding and the corresponding global embedding, (4) determining a second cross entropy loss based, at least in part, on the corresponding second label prediction and the corresponding label.
Then, the second set of operations may include performing for each unlabeled node of the set of unlabeled nodes: (1) generating a set of unlabeled features based, at least in part, on masking the soft label of each unlabeled node, (2) generating, by the student ML model, a third embedding based, at least in part, on the corresponding set of unlabeled features, (3) determining, by the student ML model, a third label prediction based, at least in part, on the corresponding third embedding and the corresponding global embedding, and (4) determining a Kullback–Leibler (KL) divergence loss based, at least in part, on the corresponding third label prediction and the corresponding soft label. Then, the second set of operations may include updating the set of model parameters based, at least in part, on backpropagating the second cross entropy loss of each labeled node and the KL divergence loss for each unlabeled node.
In another embodiment, the server system is configured to receive a classification request for an entity associated with an individual node from the set of nodes for a classification task. In response, the server system is configured to generate, by the student ML model, a classification prediction for the individual node based, at least in part, on corresponding set of features of the individual nodes.
The various embodiments of the present disclosure provide multiple advantages and technical effects while addressing technical problems such as how to train a student ML model with the structure information or topological information of graph while reducing the latency and computational resources during model deployment while performing classification tasks.
To that end, the various embodiments of the present disclosure provide an approach for training the student ML model with graph structure information. As described herein, the server system is configured to train the student ML model with local structural information through neighborhood contrastive learning loss. The present approach deliberately includes and adaptively blending information found in the graph topology into the student ML model. The student ML model is pre-trained with structural information along with the Knowledge Distillation (KD), referred to as structure induction with KD. Through experiments (described later), it is observed that the inference time for the student ML model is approximately 200 times faster than conventional GNNs, without sacrificing much on accuracy on average. Further, the student ML model has enhanced accuracy by approximately 11% compared to standalone conventional MLPs while outperforming the conventional Graph Less Neural Networks (GLNNs) by approximately 1.47% on average. Further, training the student ML model with global embedding allows the model to learn from the global structure of the graph as well.
It is understood that the proposed student ML model does not rely on graph dependencies during inference yet is still capable of effectively capturing the topological information of the graph. This is achieved by pre-training the student ML model to induce structural information before the actual distillation process. As described in detail later, contrastive learning-based methodology for this purpose. The student ML model is pre-trained on the Neighborhood Contrastive loss (NC loss). The essence of using this NC loss is, that each node’s k-hop neighbors are viewed as positive instances, while all other nodes are deemed as negative instances. This aspect encourages the positive instances to be nearer to the intended node in the embedding space, at the same time pushing the negative instances to be more distant.
To conclude, the student ML model does not require the graph structure at inference time. Further, the structural pre-training of the student ML model induces structural information into the student ML model. Furthermore, the proposed student ML model performs on par with GNNs and simultaneously is approximately 200 times faster in inference. Furthermore, the student ML model outperforms GLNN and MLP by a measurable margin.
Various embodiments of the present disclosure are described hereinafter with reference to FIGS. 1 to 9.
FIG. 1 illustrates an exemplary representation of an environment 100 related to at least some embodiments of the present disclosure. Although the environment 100 is presented in one arrangement, other embodiments may include the parts of the environment 100 (or other parts) arranged otherwise depending on, for example, training a student ML model, and the like.
The environment 100 generally includes a plurality of components such as a server system 102, a payment network 112 including a payment server 114, a plurality of entities such as a plurality of cardholders 104(1), 104(2), … 104(N) (collectively, referred to as the ‘plurality of cardholders 104’ and ‘N’ is a Natural number), a plurality of merchants 106(1), 106(2), …, 106(N) (collectively, referred to as the ‘plurality of merchants 106’ and ‘N’ is a Natural number), a plurality of acquirers 108(1), 108(2),.., 108(N) (collectively, referred to as the ‘plurality of acquirers 108’ and ‘N’ is a Natural number), and a plurality of issuers 110(1), 110(2),.., 110(N) (collectively, referred to as the ‘plurality of issuers 110’ and ‘N’ is a Natural number), each coupled to, and in communication with (and/or with access to) a network 116.
It is noted that these entities may be classified or segregated based, at least in part, on their relationship in the payment network 112 or the network 116 with each other in the payment ecosystem. The network 116 may include, without limitation, a Light Fidelity (Li-Fi) network, a Local Area Network (LAN), a Wide Area Network (WAN), a Metropolitan Area Network (MAN), a satellite network, the Internet, a fiber optic network, a coaxial cable network, an Infrared (IR) network, a Radio Frequency (RF) network, a virtual network, and/or another suitable public and/or private network capable of supporting communication among two or more of the parts or users illustrated in FIG. 1, or any combination thereof.
Various entities in the environment 100 may connect to the network 116 in accordance with various wired and wireless communication protocols, such as Transmission Control Protocol/Internet Protocol (TCP/IP), User Datagram Protocol (UDP), 2nd Generation (2G), 3rd Generation (3G), 4th Generation (4G), 5th Generation (5G) communication protocols, Long Term Evolution (LTE) communication protocols, future communication protocols or any combination thereof. For example, the network 116 may include multiple different networks, such as a private network made accessible by the server system 102 and a public network (e.g., the Internet, etc.) through which the server system 102, the plurality of acquirer 108 (also referred as acquirer servers 108), the plurality of issuers110 (also referred as issuer servers 110), and the payment server 114 may communicate.
In an embodiment, the plurality of cardholders 104 (or cardholders 104) uses one or more payment cards 118(1), 118(2), … 118(N) (collectively, referred to hereinafter as the ‘plurality of payment cards 118’ and ‘N’ is a Natural number) respectively to make payment transactions at the plurality of merchants 106 (or merchants 106). A cardholder (e.g., the cardholder 104(1)) may be any individual, representative of a corporate entity, a non-profit organization, or any other person that is presenting payment account details during an electronic payment transaction with a merchant (e.g., the merchant 106(1)). The cardholder (e.g., the cardholder 104(1)) may have a payment account issued by an issuing bank (not shown in figures) associated with an issuer server (e.g., issuer server 110(1)) from the plurality of the issuer servers 110 (explained later) and may be provided a payment card (e.g., the payment card 118(1)) with financial or other account information encoded onto the payment card (e.g., the payment card 118(1)) such that the cardholder (i.e., the cardholder 104(1)) may use the payment card 118(1) to initiate and complete a payment transaction using a bank account at the issuing bank.
In an example, the cardholders 104 may use their corresponding electronic devices (not shown in figures) to access a mobile application or a website associated with the issuing bank, or any third-party payment application. In various non-limiting examples, electronic devices may refer to any electronic devices such as, but not limited to, Personal Computers (PCs), tablet devices, Personal Digital Assistants (PDAs), voice-activated assistants, Virtual Reality (VR) devices, smartphones, and laptops.
In an embodiment, the merchants 106 may include retail shops, restaurants, supermarkets or establishments, government and/or private agencies, or any such places equipped with POS terminals, where an individual such as the cardholders 104 visit for performing the financial transaction in exchange for any goods and/or services or any financial transactions.
In one scenario, the cardholders 104 may use their corresponding payment accounts or payment cards (e.g., the plurality of payment cards 118 (or payment cards 118)) to conduct payment transactions with the merchants 106. Moreover, it may be noted that each of the cardholders 104 may use their corresponding payment card from the payment cards 118 differently or make the payment transaction using different means of payment. For instance, the cardholder 104(1) may enter payment account details on an electronic device (not shown) associated with the cardholder 104(1) to perform an online payment transaction. In another example, the cardholder 104(2) may utilize the payment card 118(2) to perform an offline payment transaction. For example, the cardholder 104(3) may enter details of the payment card 118(3) to transfer funds in the form of fiat currency on an e-commerce platform to buy goods. In another instance, each cardholder (e.g., the cardholder 104(1)) of the cardholders 104 may transact at any merchant (e.g., the merchant 106(1)) from the merchants 106.
In one embodiment, the cardholders 104 is associated with the plurality of issuer servers 110 (or issuer servers 110). In one embodiment, an issuer server such as issuer server 110(1) is associated with a financial institution normally called an “issuer bank”, “issuing bank” or simply “issuer”, in which a cardholder (e.g., the cardholder 104(1)) may have the payment account, (which also issues a payment card, such as a credit card or a debit card), and provides microfinance banking services (e.g., payment transaction using credit/debit cards) for processing electronic payment transactions, to the cardholder (e.g., the cardholder 104(1)).
In an embodiment, the merchants 106 is associated with the plurality of acquirer servers 108. In an embodiment, each merchant (e.g., the merchant 106(1)) is associated with an acquirer server (e.g., the acquirer server 108(1)). In one embodiment, the acquirer server 108(1) is associated with a financial institution (e.g., a bank) that processes financial transactions for the merchant 106(1). This can be an institution that facilitates the processing of payment transactions for physical stores, merchants (e.g., the merchant 106(1)), or institutions that own platforms that make either online purchases or purchases made via software applications possible (e.g., shopping cart platform providers and in-app payment processing providers). The terms “acquirer”, “acquiring bank”, “acquiring bank” or “acquirer server” will be used interchangeably herein.
In one embodiment, the payment network 112 may be used by the payment card issuing authorities as a payment interchange network. Examples of the plurality of payment cards 118 may include debit cards, credit cards, etc. Similarly, examples of payment interchange networks include but are not limited to, a Mastercard® payment system interchange network. The Mastercard® payment system interchange network is a proprietary communications standard promulgated by Mastercard International Incorporated® for the exchange of electronic payment transaction data between the plurality of issuers 110 and the plurality of acquirers 108 that are members of Mastercard International Incorporated®.
As explained earlier, conventional approaches have explored the possibility of integrating these two models to harness their combined strengths. An example of such a conventional approach is knowledge distillation frameworks like GLNN. GLNN operates by distilling the knowledge of a GNN trained on graph data to a student model. The distillation process is carried out with the help of the soft labels generated by the trained teacher GNN model. This eliminates the requirement of message passing at the inference time and allows the MLPs to be exclusively utilized for inference, with just the node feature content. Thus, enhancing their latency. However, this conventional approach comes with a limitation of overreliance on node features. In particular, the student model relies only on the node feature information and fails to completely grasp the essence of the graph topology and node interconnections. In other words, MPNNs are limited by over-reliance on potentially limited node features and high inference latency caused by implicit topology use.
GNNs have been widely used for graph learning tasks due to their ability to capture complex graph structures. Further, convolutional networks have been extended to graphs through GNNs using Message-Passing Neural Networks (MPNNs) with the introduction of Graph Neural Network (GCN). These conventional approaches can be further divided into Spatial and Spectral GCNs. Most of the conventional GNNs can be considered as MPNNs with architectural differences. For example, GAT incorporates an attention mechanism to aggregate features from neighbors with different weights. GraphSAGE applies an efficient aggregation function to learn node features from the local neighborhood. DeepGCNs and GCNII utilize residual connections to aggregate neighbors from multi-hop and further address the over-smoothing problem. However, these Message-Passing Neural Networks (MPNNs) only extract the local structural information.
Many conventional approaches have tried to sparsify or simplify the graph structure to decrease the number of operations at the inference time. Still, the inherent dependency on the graph is not resolved. Graph neural networks like MPNNs rely on two main information sources, such as node features and graph topology. Node features are continuous and regular, making them easy to encode in neural networks. However, graph topology is discrete and irregular, so MPNNs only use it implicitly to propagate node features between neighbors. This implicit use of topology leads to two fundamental, coupled limitations of MPNNs. First limitations includes, over-reliance on node features. Since topology only fetches neighbors’ features, if node features are uninformative, topology provides no valuable signal. MPNNs degrade severely in low-feature settings, hurting practicality in industrial applications where node features are limited for privacy, fairness, or availability reasons while having high inference latency. It is noted that propagating features over graph edges requires many neighbor feature lookups and operations, slowing inference. This prevents deployment in latency sensitive applications. This "neighbor explosion" issue worsens as model depth increases. These two key issues arise because MPNNs only implicitly leverage graph topology to propagate features. The goal is to address them by explicitly encoding topology information, but this is challenging because topology is discrete and irregular, unlike continuous node features. To conclude, the conventional MLPs and GNNs have their inherent strengths and weaknesses.
Further, some conventional approaches involve training less complex ‘student’ GNNs to emulate the performance of their more complex ‘teacher’ counterparts. However, this process often requires extensive message passing and multi-hop neighborhood extraction, leading to latency issues. Techniques such as Landslide Susceptibility Prediction (LSP) and small GNN (also referred to as TinyGNN) have been developed to circumvent this issue, but they still rely heavily on message passing. Some conventional approaches propose MLP-based student models, eliminating the need for message passing. However, these MLP models still require the graph representation at inference time resulting in latency issues. Some other conventional approaches such as GLNN and noise robust structure aware MLPs on graphs (NOSMOG) try to capture the structural information of the graph through positional encoding methods, thereby decreasing the graph dependency at inference time. There also exist conventional approaches like Simple Graph Convolution (SGC) which simplify the GNN architecture by removing the non-linearities in the network. Though these strategies enhance the inference speed of GNNs to some degree, they can’t completely eradicate the fundamental overhead caused by Message Passing (MP) due to the dependency on neighbor data. Thus, there is a need to effectively represent discrete, irregular topology information of graphs in a student ML model or neural network.
The above-mentioned technical problem among other problems is addressed by one or more embodiments implemented by the server system 102 of the present disclosure. In one embodiment, the server system 102 is configured to perform one or more of the operations described herein.
In one embodiment, the environment 100 may further include a database 122 coupled with the server system 102. In an example, the server system 102 coupled with the database 122 is embodied within the payment server 114, however, in other examples, the server system 102 can be a standalone component (acting as a hub) connected to any of the acquirer servers 108 and any of issuer servers 110. The database 122 may be incorporated in the server system 102 or maybe an individual entity connected to the server system 102 or maybe a database stored in cloud storage. In one embodiment, the database 122 may store a student Machine Learning (ML) model 120, an entity-related dataset, and other necessary machine instructions required for implementing the various functionalities of the server system 102 such as firmware data, operating system, and the like. In a particular non-limiting instance, the server system 102 may locally store the student ML model 120 as well (as depicted in FIG. 1).
In an example, the entity-related dataset stored in the database 122 includes information related to a plurality of entities, and a relationship between each of the plurality of entities. For instance, in the financial domain, the entity-related dataset may be a historical transaction dataset. In this scenario, the relational dataset includes real-time transaction data of the cardholders 104 and the merchants 106. To that end, the transaction data may also be called merchant-cardholder interaction data as well. The transaction data may include but is not limited to, transaction attributes, such as transaction amount, source of funds such as bank or credit cards, transaction channel used for loading funds such as POS terminal or Automated Teller Machine (ATM), transaction velocity features such as count and transaction amount sent in the past ‘x’ number of days to a particular user, transaction location information, external data sources, merchant country, merchant Identifier (ID), cardholder ID, cardholder product, cardholder Permanent Account Number (PAN), Merchant Category Code (MCC), merchant location data or merchant co-ordinates, merchant industry, merchant super industry, ticket price, and other transaction-related data.
In another example, the student ML model 120 may be an AI or an ML based model that is configured or trained to perform a downstream task such as classification. In a non-limiting example, the student ML model 120 is a classifier-based ML model (or a differential classifier model) Various examples of classifier-based ML models include MLPs, Convolutional Neural networks (CNNs), Recurrent Neural networks (RNNs), Long-Short Term Memory (LSTM) networks and so on. It is noted that the student ML model has been explained in detail later in the present disclosure with reference to FIGS. 3A to FIG. 3C. In addition, the database 122 provides a storage location for data and/or metadata obtained from various operations performed by the server system 102.
In an embodiment, the server system 102 is configured to train the student ML model 120 by inducing local and global structure of a homogeneous graph along with knowledge distillation from a teacher ML model. The homogenous graph can be created for the plurality of entities based on the features associated with each entity. Homogeneous graphs provide a detailed insight into the relationship between different entities of the same type within a network. For example, in the financial domain, the homogeneous graph may be generated for the cardholders 104 or the merchants 106. In this example, the homogeneous graph may be called a cardholder-cardholder and a merchant-merchant graph, respectively. Further, the nodes within such graphs will belong to the cardholders 104 and the merchants 106, respectively. In another example, the homogeneous graph may be generated for the acquirers 108 and the issuers 110 as well. In this example, the homogeneous graph may be called an acquirer-acquirer graph or an issuer-issuer graph, respectively.
The teacher ML model may be any GNN based ML model that is configured to learn from the set of features associated with the set of nodes within a homogeneous graph. The server system 102 performs the training of student ML model 120 can be classified into two stages, namely, a structural information induction stage and a knowledge distillation stage. During the structural information induction stage, the student ML model 120 is pre-trained with the local structural information of the homogeneous graph through neighborhood contrastive learning. Contrastive learning is a powerful technique that involves the comparison and contrast of model data representations. This technique has considerably revolutionized learning visual representations. It plays an instrumental in the development of methods such as simple framework for contrastive learning of visual representations (SimCLR) and Contrastive Language-Image Pre-training (CLIP). Contrastive learning can be performed for various applications using self-supervised and supervised learning by significantly impacting natural language processing and graph learning. Recent advancements, particularly in unsupervised representation learning, showcase the potential of contrastive learning. Methods like SimCLR have proven to outperform traditional pre-training methods requiring labeled data. Efforts to understand contrastive learning’s success have emphasized geometric approaches and the role of augmentations in connecting latent subclasses.
During the knowledge distillation stage, the teacher ML model learns from a set of labeled nodes within the set of nodes to predict soft labels for a set of unlabeled nodes within the set of nodes. Further, the student ML model 120 is trained using the soft labels of the set of unlabeled nodes along with the set of labeled nodes. The act of learning using the soft labels generated by the teacher ML model allows the server system 102 to distill the knowledge from the teacher ML model to the student ML model, efficiently.
Upon successful completion of the training process, the student ML model 120 is deployed for inferencing or classification by the server system 102 for any concerned node. During deployment, it is understood that the student ML model 120 does not require the graph structure and can operate directly on node features of the concerned node. Since the student ML model 120 does not need to process the graph structure during deployment, the inferencing time of the student ML model 120 is significantly reduced along with reduced computational requirements as well. It is noted that the various aspects described with reference to FIG. 1 have been described further in detail with FIG. 2 later on.
It should be understood that the server system 102 is a separate part of the environment 100, and may operate apart from (but still in communication with, for example, via the network 116) any third-party external servers (to access data to perform the various operations described herein). However, in other embodiments, the server system 102 may be incorporated, in whole or in part, into one or more parts of the environment 100.
It is pertinent to note that the various embodiments of the present disclosure have been described herein with respect to examples from the financial domain, and it should be noted the various embodiments of the present disclosure can be applied to a wide variety of applications as well and the same will be covered within the scope of the present disclosure as well. For instance, for recommender systems, the plurality of entities may be users or items. To that end, the various embodiments of the present disclosure apply to various applications as long as a dataset pertaining to the desired application can be represented in the form of a homogenous graph.
The number and arrangement of systems, devices, and/or networks shown in FIG. 1 are provided as an example. There may be additional systems, devices, and/or networks; fewer systems, devices, and/or networks; different systems, devices, and/or networks; and/or differently arranged systems, devices, and/or networks than those shown in FIG. 1. Furthermore, two or more systems or devices shown in FIG. 1 may be implemented within a single system or device, or a single system or device is shown in FIG. 1 may be implemented as multiple, distributed systems or devices. In addition, the server system 102 should be understood to be embodied in at least one computing device in communication with the network 116, which may be specifically configured, via executable instructions, to perform steps as described herein, and/or embodied in at least one non-transitory computer-readable media.
FIG. 2 illustrates a simplified block diagram of a server system 200, in accordance with an embodiment of the present disclosure. It is noted that the server system 200 is identical to the server system 102 of FIG. 1. In one embodiment, the server system 200 is a part of the payment network 112 or integrated within the payment server 114. In some embodiments, the server system 200 is embodied as a cloud-based and/or Software as a Service (SaaS) based architecture.
The server system 200 includes a computer system 202 and a database 204. It is noted that the database 204 is identical to the database 122 in FIG. 1. The computer system 202 includes at least one processor 206 (also referred to as the processor 206) for executing instructions, a memory 208, a communication interface 210, a user interface 212, and a storage interface 214 that communicate with each other via a bus 216.
In some embodiments, the database 204 is integrated within the computer system 202. For example, the computer system 202 may include one or more hard disk drives like database 204. The user interface 212 is any component capable of providing an administrator (not shown) of the server system 200, the ability to interact with the server system 200. This user interface 212 may be a GUI or Human Machine Interface (HMI) that can be used by the administrator to configure the various operational parameters of the server system 200. The storage interface 214 is any component capable of providing the processor 206 with access to the database 204. The storage interface 214 may include, for example, an Advanced Technology Attachment (ATA) adapter, a Serial ATA (SATA) adapter, a Small Computer System Interface (SCSI) adapter, a RAID controller, a SAN adapter, a network adapter, and/or any component providing the processor 206 with access to the database 204. In one non-limiting example, the database 204 is configured to store an entity-related dataset 228, a teacher ML model 230, a student ML model 232, and the like. It is noted that the student ML model 232 is identical to the student ML model 120 of FIG. 1.
The processor 206 includes suitable logic, circuitry, and/or interfaces to execute operations for training the student ML model 232 with graph topological/structure information using knowledge distillation, generating an inference (i.e., a classification) prediction using the student ML model 232, and the like. In other words, the processor 206 includes suitable logic, circuitry, and/or interfaces to execute operations for learning and operating various ML models. Examples of the processor 206 include but are not limited to, an Application-Specific Integrated Circuit (ASIC) processor, a Reduced Instruction Set Computing (RISC) processor, a Graphical Processing Unit (GPU), a Complex Instruction Set Computing (CISC) processor, a Field-Programmable Gate Array (FPGA), and the like.
The memory 208 includes suitable logic, circuitry, and/or interfaces to store a set of computer-readable instructions for performing various operations described herein. Examples of the memory 208 include a Random-Access Memory (RAM), a Read-Only Memory (ROM), a removable storage drive, a Hard Disk Drive (HDD), and the like. It will be apparent to a person skilled in the art that the scope of the disclosure is not limited to realizing the memory 208 in the server system 200, as described herein. In another embodiment, the memory 208 may be realized in the form of a database server or a cloud storage working in conjunction with the server system 200, without departing from the scope of the present disclosure.
The processor 206 is operatively coupled to the communication interface 210, such that the processor 206 is capable of communicating with a remote device (i.e., to/from a remote device 218) such as the plurality of issuer servers 110, the plurality of acquirer servers 108, the payment server 114, or communicating with any entity connected to the network 116 (as shown in FIG. 1).
It is noted that the server system 200 as illustrated and hereinafter described is merely illustrative of an apparatus that could benefit from embodiments of the present disclosure and, therefore, should not be taken to limit the scope of the present disclosure. Further, it would be apparent to a person skilled in the art that the various operations of the present disclosure have been explained with reference to examples from the financial domain, the scope of these operations is not limited to the same and can used to in various other industries as well. It is noted that the server system 200 may include fewer or more components than those depicted in FIG. 2.
In one implementation, the processor 206 includes a data pre-processing module 220, a graph generation module 222, a model training module 224, and a classification module 226. It should be noted that components, described herein, such as the data pre-processing module 220, the graph generation module 222, the model training module 224, and the classification module 226 can be configured in a variety of ways, including electronic circuitries, digital arithmetic, and logic blocks, and memory systems in combination with software, firmware, and embedded technologies.
In an embodiment, the data pre-processing module 220 includes suitable logic and/or interfaces for accessing an entity-related dataset 228 from the database 204. In various-non-limiting examples, the entity-related dataset 228 may include information related to a plurality of entities, and a relationship between the plurality of entities. In various non-limiting examples, the plurality of entities may include the cardholders 104, the merchants 106, the issuer servers 110, and the acquirer servers 108 as depicted in FIG. 1. Further, the information related to these entities may include information related to a plurality of historical payment transactions performed by the plurality of cardholders 104 with the plurality of merchants 106. For instance, a relationship between two distinct cardholders may be defined if they have performed a payment transaction at the same merchant in the past, and vice versa. It is noted that this non-limiting example is specific to the financial industry or payment ecosystem however, the various operations of the present disclosure are not limited to the same. To that end, the entity-related dataset 228 can be configured to include different information specific to any field of operation. Therefore, it is understood that the various embodiments of the present disclosure apply to a variety of different fields of operation and the same is covered within the scope of the present disclosure.
Returning to the previous example, the entity-related dataset 228 may include information related to a plurality of historical payment transactions performed within a predetermined interval of time (e.g., 6 months, 12 months, 24 months, etc.). In some other non-limiting examples, the entity-related dataset 228 includes information related to at least merchant name identifier, unique merchant identifier, timestamp information (i.e., transaction date/time), geo-location related data (i.e., latitude and longitude of the cardholder/merchant), MCC, merchant industry, merchant super industry, information related to payment instruments involved in the set of historical payment transactions, cardholder identifier, Permanent Account Number (PAN), merchant name, country code, transaction identifier, transaction amount, and the like.
In one example, the entity-related dataset 228 may define a relationship between each of the plurality of entities. In a non-limiting example, a relationship between different cardholder accounts may be defined by a transaction performed by them at the same merchant. For instance, when the cardholder 104(1) purchases an item from the merchant 106(1) and the cardholder 104 (2) also purchases any item from the merchant 106(1), a relationship is said to be established between cardholder 104(1) and cardholder 104(2).
In another example, the entity-related dataset 228 may include information related to past payment transactions such as transaction date, transaction time, geo-location of a transaction, transaction amount, transaction label (e.g., fraudulent or non-fraudulent), and the like. In yet another embodiment, the entity-related dataset 228 may include information related to the acquirer servers 108 such as the date of merchant registration with the acquirer server (such as acquirer server 108(1), amount of payment transactions performed at the acquirer server 108(1) in a day, number of payment transactions performed at the acquirer server 108(1) in a day, maximum transaction amount, minimum transaction amount, number of fraudulent merchants or non-fraudulent merchants registered with the acquirer server 108(1), and the like.
In addition, the data pre-processing module 220 is configured to generate a set of features corresponding to each entity of the plurality of entities based, at least in part, on the information related to a plurality of entities present in the entity-related dataset 228. In various non-limiting examples, the data pre-processing module 220 may utilize any feature generation approach to generate the set of features. It is understood that such feature generation techniques are already well known in the art, therefore the same are explained here for the sake of brevity.
In another embodiment, the data pre-processing module 220 is communicably coupled to the graph generation module 222 and is configured to transmit the set of features to the graph generation module 222.
In an embodiment, the graph generation module 222 includes suitable logic and/or interfaces for generating a homogeneous graph based, at least in part, on the set of features for each entity. The homogenous graph includes a set of nodes such that each node represents a distinct entity from the plurality of entities. Also, each node of the set of nodes is associated with a plurality of features of the corresponding entity which it represents. Additionally, a set of edges exists between the set of nodes, i.e., the nodes of the homogeneous graph are connected to each other via edges. Rach edge of the set of edges may indicate information related to a relationship between two distinct nodes connected by it. In one example, the set of nodes is classified into a set of labeled nodes and a set of unlabeled nodes. As may be understood, the information present in the entity-related dataset 228 for each entity may not always have labeled assigned to it. In scenarios, when features are generated for an entity that is unlabeled, then the node generated for the said entity will also be unlabeled. For example, in the financial domain, the homogeneous graph may be generated for the cardholders 104. In this example, the homogeneous graph may be called a cardholder-cardholder graph. Further, the set of nodes may be the plurality of cardholders 104 and the set of edges may represent common merchants at which distinct cardholders performed transactions.
It is noted that a representation of a homogeneous graph (see, 302) has been explained further in detail later in the present disclosure with reference to FIG. 3A. Upon generation of the homogenous graph, the homogeneous graph may be stored in the database 204 associated with the server system 200. It is noted that when the server system 200 has to perform any operation related to the graph, the server system 200 may access the homogeneous graph from the database 204. In a situation where the graph is not available, the server system 200 may generate the homogeneous graph based on the process described earlier.
In another embodiment, the graph generation module 222 is communicably coupled to the model training module 224 and is configured to transmit the homogeneous graph to the model training module 224.
In an embodiment, the model training module 224 includes suitable logic and/or interfaces for training the teacher ML model 230. To train the teacher ML model 230, the model training module 224 is configured to perform a set of operations (also, referred to as a third set of operations) iteratively till a performance of the teacher ML model 230 reaches a predefined criterion (also known as a third criterion). It is noted that the third predefined criteria may refer to a point in the iterative process where the value of losses associated with the teacher ML model 230 either minimizes or saturates (i.e., stops or effectively ceases to decrease with successive iterations). Alternatively, the third predefined criteria may refer to a point in the iterative process where the improvement in performance of the teacher ML model 230 saturates or effectively ceases to improve with successive iterations, or reaches a desired level. It is understood that the learning operation performed for the teacher ML model 230 is an example of supervised learning. In an example, the teacher ML model 230 may be a Graph Neural Network (GNN) based ML model such as a GraphSAGE model, a GLNN model, and so on.
The third set of operations include initializing the teacher ML model based, at least in part, on a set of teacher model parameters. In various non-limiting examples, set of teacher model parameters may define the various aspects related to the various neural network layers of the teacher ML model 230 such as a set of shared layers and a set of classification layers of the teacher ML model 230, i.e., a number of layers, a number of hidden dimensions, learning rate, weights of different layers, weight decay, normalization factor, fan out, and the like. The third set of operations further includes performing for each labeled node of the set of labeled nodes a set of steps. The steps include (1) generating the set of labeled node features based, at least in part, on masking the label of each labeled node (2) generating, by the teacher ML model 230, a fourth embedding based, at least in part, on the corresponding set of labeled node features, (3) determining, by the teacher ML model 230, a teacher label prediction based, at least in part, on the corresponding fourth embedding, (4) determining a teacher cross entropy loss based, at least in part, on the corresponding teacher label prediction and the corresponding label. It is understood that the cross entropy loss enables the teacher ML model 230 to learn using contrastive learning. The third set of operations further includes updating the set of teacher ML model parameters based, at least in part, on backpropagating the teacher cross entropy loss of each labeled node. The backpropagation can be performed using various known optimization algorithms that help to update the set of teacher model parameters. As may be understood, as an iteration is performed and the process moves to a successive iteration, the teacher ML model 230 is reinitialized with the updated set of teacher model parameters. In other words, for each iteration the set of teacher model parameters are updated. Further, the teacher ML model 230 may be used by the model training module 224 to generate a soft label for each unlabeled node of the plurality of unlabeled nodes based, at least in part, on the plurality of features corresponding to each unlabeled node. As may be understood since the teacher ML model 230 has learned from the labeled nodes, it is capable of inferencing a label for any unlabeled node in the homogenous graph.
In another embodiment, the model training module 224 is configured to train the student ML model 232. The training process of the student ML model 232 is divided into a pretraining stage and a training stage. During the pre-training stage, the student ML model 232 learns local structure information of the homogeneous graph. On the other hand, during the training stage, the student ML model 232 learns from the knowledge of the teacher ML model 230. To determine the structure information of the graph, at first, the model training module 224 generates an adjacency matrix based, at least in part, on the set of nodes. The adjacency matrix is a matrix that stores a global structure of the set of nodes in the homogeneous graph. Then, the model training module 224 extracts a local adjacency matrix from the adjacency matrix. The local adjacency matrix can be defined as matrix that describes a local structure of the set of labeled nodes in the homogeneous graph.
Further, the model training module 224 is configured to pre-train the student ML model 232 by performing a set of operations (also referred to as a first set of operations) iteratively till a performance of the student ML model 232 reaches a first predefined criterion. It is noted that the first predefined criteria may refer to a point in the iterative process where the value of losses associated with the student ML model 232 either minimizes or saturates (i.e., stops or effectively ceases to decrease with successive iterations). Alternatively, the first predefined criteria may refer to a point in the iterative process where the improvement in performance of the student ML model 232 during the pre-training stage either saturates or effectively ceases to improve with successive iterations, or reaches a desired level. It is understood that the learning operation performed for the student ML model 232 during the pre-training stage is an example of supervised learning. In an example, the student ML model 232 may be a classifier-based ML model such as a MLP model, a LSTM model, and so on.
The first set of operations includes initializing the student ML model 232 based, at least in part, on a set of model parameters. In various non-limiting examples, the set of model parameters may define the various aspects related to the various neural network layers of the student ML model 232 such as a set of shared layers and a set of classification layers of the student ML model 232, i.e., a number of layers, a number of hidden dimensions, learning rate, weights of different layers, weight decay, normalization factor, fan out, and the like. Further, the first set of operations includes determining a neighborhood contrastive loss based, at least in part, on the local adjacency matrix. Further, the first set of operations includes performing for each labeled node of the set of labeled nodes a set of steps. The set of steps includes (1) generating a set of labeled node features based, at least in part, on masking a label of the each labeled node, (2) generating, by the student ML model, a first embedding based, at least in part, on the corresponding set of labeled node features, (3) determining, by the student ML model, a first label prediction based, at least in part, on the corresponding first embedding, (4) determining a first cross entropy loss based, at least in part, on the corresponding first label prediction and the corresponding label. Further, the first set of operations includes updating the set of model parameters based, at least in part, on backpropagating the neighborhood contrastive loss and the first cross entropy loss of each labeled node. As may be understood, as an iteration is performed and the process moves to a successive iteration, the student ML model 232 is reinitialized with the updated set of model parameters. In other words, for each iteration the set of model parameters is updated. As may be understood since the student ML model 232 learns from the neighborhood contrastive loss and the first cross entropy loss of each labeled node from the labeled nodes, it can understand the structure information of the homogenous graph.
Further, the model training module 224 is configured to train the student ML model 232 by performing another set of operations (also referred to as a second set of operations) iteratively till a performance of the student ML model 232 reaches a second predefined criterion. It is noted that the second predefined criteria may refer to a point in the iterative process where the value of losses associated with the student ML model 232 either minimizes or saturates (i.e., stops or effectively ceases to decrease with successive iterations). Alternatively, the second predefined criteria may refer to a point in the iterative process where the improvement in performance of the student ML model 232 during the training stage either saturates or effectively ceases to improve with successive iterations, or reaches a desired level. It is understood that the learning operation performed for the student ML model 232 during the training stage is an example of semi-supervised learning.
The second set of operations includes generating a global embedding for each node of the homogenous graph based, at least in part, on the plurality of features and the adjacency matrix. In particular, to generate the global embedding, the model training module 224 is configured to implement a RandNE algorithm on the homogenous graph. Then, train a Regressor (such as a Neural Network model) to predict the global embedding given the local structure based on the adjacency matrix. Then, the neural network regressor is used for predicting the global embedding for inference data points. To elaborate, a homogenous graph is constructed, meaning a graph where all nodes represent the same type of entity. The RandNE algorithm is applied to this graph to generate an initial global embedding for each node. RandNE stands for Random Network Embedding and is an algorithm to learn representations of nodes in a network. A regression model, specifically a neural network, is trained to predict the global embedding of a node given its local network structure. The local structure is represented through the adjacency matrix of the graph. For new unseen nodes, the trained neural network regressor is used to predict their global embeddings based only on their local connections in the graph.
Further, the second set of operations includes re-initializing the student ML model 232 based, at least in part, on the set of model parameters. It is understood that this set of model parameters corresponds to the latest model parameters obtained during the pre-training stage. Further, the second set of operations includes performing for each labeled node of the set of labeled nodes a set of steps. The set of steps includes (1) generating the set of labeled node features based, at least in part, on masking the label of each labeled node, (2) generating, by the student ML model, a second embedding based, at least in part, on the corresponding set of labeled node features, (3) determining, by the student ML model, a second label prediction based, at least in part, on the corresponding second embedding and the corresponding global embedding, (4) determining a second cross entropy loss based, at least in part, on the corresponding second label prediction and the corresponding label.
Further, the second set of operations includes performing for each unlabeled node of the set of unlabeled nodes another set of steps. These set of steps include generating a set of unlabeled features based, at least in part, on masking the soft label of each unlabeled node, (2) generating, by the student ML model, a third embedding based, at least in part, on the corresponding set of unlabeled features, (3) determining, by the student ML model, a third label prediction based, at least in part, on the corresponding third embedding and the corresponding global embedding, (4) determining a Kullback–Leibler (KL) divergence loss based, at least in part, on the corresponding third label prediction and the corresponding soft label.
Further, the second set of operations includes updating the set of model parameters based, at least in part, on backpropagating the second cross entropy loss of each labeled node and the KL divergence loss for each unlabeled node. As may be understood since the student ML model 232 learns from the KL divergence contrastive loss and the second cross entropy loss of each labeled node from the labeled nodes, it can learn from the knowledge of the teacher ML model 230 and the global structure information of the graph as well.
In another embodiment, the model training module 224 is communicably coupled to classification module 226 and is configured to share the trained student ML model 232 with the classification module 226.
In an embodiment, the classification module 226 includes suitable logic and/or interfaces for receiving a classification request for an entity associated with an individual node from the set of nodes for a classification task. The classification request may correspond to any classification task such as fraud detection, label assignment, and so on. The classification module 226is further configured to generate a classification prediction for the individual node based, at least in part, on corresponding set of features of the individual nodes. In an embodiment, the classification module 226is configured to generate the classification prediction using the trained student ML model 232. It is noted that since the student ML model 232 is trained using the structure information of the graph, it does not need to process the graph structure for the individual node during inferencing, thus leading to reduce latency in generating the classification prediction. FIGS. 3A, 3B, and 3C, collectively, illustrate an architecture (see, 300, 320, 330) for training and deploying a student ML model (e.g., student MLP model 310), in accordance with an embodiment of the present disclosure. It should be noted that the student MLP model 310 is similar to the student ML model 232 of FIG. 2.

A homogeneous graph such as homogeneous graph 302 can be denoted as G=(V,E), where V represents the set of nodes, E represents the set of edges. The set of features 312 of each node can be defined as X?R^(N×d), where N is the total number of nodes in the homogeneous graph 302 and d is the feature dimension of each node. Each row of X can be represented as x_v. The server system 200 is further configured to generate an adjacency matrix 314 based on the homogeneous graph 302. The adjacency matrix 314 can be defined by A?R_(n×n) where a_uv denotes the edge weight between nodes u and v.a_uv is 1 if (u,v)?E,or 0 otherwise. For a node classification task, the targets are represented by Y?R^(N×K), in which each row y_v is a K-dimensional one-hot vector for a node v. The set of nodes can be divided into two subsets based on the labels, i.e., a set of labeled nodes and a set of unlabeled nodes. The set of labeled nodes are represented using subscript L, i.e., V^L,X^L, and Y^L. Similarly, the set of unlabeled nodes is represented using subscript U. i.e., V^U,X^U, and Y^U.
In GNNs, for a given node u?V, GNNs aggregate the messages from node neighbours N(u) to learn node embedding h_u?R^d with embedding dimension of d. The node embedding in l-th layer h_u^((l)) is learned by first aggregating (AGG) the embeddings in the neighborhood followed by an update (UPD) operation:
h_u^((l))=UPD(h_u^((l-1)),AGG(h_u^((l-1)):v?N(u))¦… Eqn. 1
As per various embodiments, different GNN models may be as the teacher ML model 230 such as GCN, GraphSAGE, GAT, and so on. The corresponding non-matrix equations for GCN, GraphSAGE and GAT using common notation are provided below:
GCN: h_u^((l+1))=s(?_v¦??(1/v((d_u*d_v ) ))*A_uv*h_v^((l))*W^((l)) )… Eqn. 2
where A is the adjacency matrix 314, d_u is the degree of node v,h_u^((I)) is the input feature vector of node u in layer l,W^((l)) is the weight matrix for layer l and s is an activation function.
GraphSAGE: h_u^((l+1))=a(AGGREGATE_(v?N(u)) (h_v^((l)) )*W^((l)) )… Eqn. 3
where N(u) is the neighbor set of node u,AGGREGATE is a function like mean or max pool over the neighbor features.
GAT: h_u^((l+1))=s?_v¦? (a_uv*h_v^((l))*W^((l)) ) … Eqn. 4
where,
a_uv=softmax?x_v (a(h_u^((l)),h_v^((l)) )) … Eqn. 5
where a is an attention mechanism like dot product softmax?x_v normalizeV
L=?_(v?V^L)¦??L_CE (y ˆ_v,y_v )… Eqn. 6
Further, every node in the homogeneous graph 302, including those set aside for validation and testing, to produce soft labels 306. The student MLP model is then trained using these soft labels 306. However, the true labels of the test data are never directly used for training purposes. It is noted that the main focus is on node classification in a semi supervised setting, where there are very limited labels available. For instance, out of about 20,000 nodes in Pubmed (described later), only 60 labelled nodes are utilized for training, and in Citeseer (described later), just 120 out of 2,000 nodes have labels for training.
Conventional graph distillation models like GLNN focus purely on mimicking the output of the teacher GNN model 304 without providing any structural information to the student model. To bolster graph learning and support the student MLP model in capturing structural information, the proposed approach includes pre-training the student MLP model with neighborhood contrastive loss. For generating node embeddings with graph connection information, it is logical to presume that nodes which are linked to each other would be closer to each other in the embedding space than the nodes which are not linked at all. This idea synchronizes well with the foundation of Contrastive learning. Driven by this, the Neighborhood Contrastive loss (i.e., NC loss) to induce structural information in the student MLP model without any explicit message passing. It is noted that MLP training can be done batch wise, as there is no explicit requirement of message passing. To obtain the graph connectivity information while calculating the NC loss, the adjacency matrix 314 is used. In the case of the NC loss, the k-hop neighbors of each node are considered to be positive examples and the remaining nodes are considered to be negative (i.e., negative sample 324) as shown in FIG. 3B. This aspect promotes the positive examples to be closer to the target node in embedding space, while simultaneously driving the negative examples farther away. k is a hyperparameter and can be changed as needed. NC loss for node v can be formulated by the equation below:
L_u=-log?(?_(v?B)¦??1_(v?u) ?_uv exp?(sim?(z_u,z_v )/t))/(?_(w?B)¦??1_(w?u) exp?(sim?(z_u,z_w )/t) )… Eqn. 7
where B is the training batch of nodes, sim represents the cosine similarity between two vectors, t represents the temperature. ?_uv is dependent on k. It is non-zero for the k-hop (see, 326) neighbour nodes of the target node and zero for the remaining nodes.
During each batch process, b=|B| nodes are selected at random and extract the related adjacency details from A^*?R^(b×b) along with the node attributes from X?R^(b×d). Where d is the feature vector dimension. There might be instances where, due to the random nature of the selection, a node might not have any positive samples in the batch. In such scenarios, the loss for that specific batch is set as null.
?_uv={¦(C&" if node " v" is a " k"-hop neighbour of node " u@0&" if node " v" is not a " k"-hop neighbour of node " u)¦… Eqn. 8
where C?R is a constant.
Along with this NC loss, a standard cross entropy loss for node classification is used. NC loss is applied on the embeddings (i.e., output of the second last layer) and CE loss is applied on the output of the last (or softmax) layer. The combined optimization objective can be represented by the equation given below:
L_u=a?_(u?B)¦? (L_CE ((y_u ) ˆ,y_u ))+(1-a)?_(u?B)¦? L_NC (h_u,h_(N_u ) )… Eqn. 9
where y_u is the ground truth label, y ˆ_u is the output of the last layer, B?V^L is the training batch of the node, h_u is the output of the second last layer, h_(N_u ) are the output of second last layer corresponding to nodes in the k-hop neighbourhood of ua?[0,1] is a parameter used to balance the two components of the loss.
It is noted that not all the nodes in the k-hop subgraph are considered to be positive (i.e., positive sample 322) examples, but only the k-hop neighbours are considered to be positive. This is a generic formulation made to accommodate both homophilous and heterophilous graphs. It is noted that all the datasets used for experimentation later on are predominantly homophilous graphs. According to the experimentation on homophilous datasets, k>1 doesn't really help the model. Therefore, k is chosen to be 1 across all the datasets for which performance is tested. However, k greater than 1 would be useful for heterogeneous or bi-partite graphs, where the 1-hop or other close neighbours does not necessarily be placed close to the target node in the embedding space.
The idea of pre-training the student model is to induce the structural information in the model weights itself. Now, to actually perform the distillation process, the student MLP model 310 has to be trained to mimic the predictions of the teacher GNN model 304. To perform this, KL-divergence loss is used. It is noted that to train the student MLP model 310 on the set of labelled nodes, a cross entropy loss is used. The training process of the student MLP model 310 can be performed batch-wise as there is no implicit graph topology dependency. In order to utilize both the ground truth labels and the soft labels 306, the optimization objective can be formulated using the following equation:
L=??_(??V^L)¦??L_CE (y ˆ_?,y_0 )+(1-?)?_(??V)¦??L_KL (y ˆ_v ? ,z_? )… Eqn. 10
Where L_CE is the V.
Pseudo-code:
Input: Graph G=(V,E), the set of labeled node V_L?V, the set of unlabeled nodes V_U?V, node features X.
Output: The learned student MLP model is distilled from the teacher GNN model,
Initialize the teacher GNN model with 11 layers,
Initialize an empty set Z for the soft labels generated by the
teacher GNN model,
Initialize the optimizer for the teacher GNN model,
for each node v in V_L do
for each layer l in GNN do
compute the updated node features using h_v^((l))=UPD(h_v^((l-1)),AGG(h_v^((l-1)):u?N(v))¦
end for
Compute the loss for the teacher GNN model given by L= ?_(v ??V^L )?L_CE (y ˆ_v ? ,y_v )
Backpropagate the loss to update the teacher GNN model parameters
end for
Use the trained teacher GNN model to compute the soft labels z_v for v?V
and add to the soft labels set Z
Initialize the student MLP model with 12 layers,
Initialize the optimizer for the student MLP model,
for each node v in V_L do
Compute the output of the second last layer h_v,
Compute the output of the last layer of MLP y ˆ_v,
Compute the loss given by L_v=a(L_CE ((y_v ) ˆ,y_v ))+(1- a)L_NC (h_0,h_(N_v ) ),
Backpropagate the loss to update the parameters,
end for
Use the pre-trained student MLP model for distillation,
Initialize the optimizer for the student MLP model,
for each node v in V_L do
Compute the logit y ˆ_v for v using the student MLP model,
Fetch the corresponding soft label z_v from Z,
Compute the distillation loss given by L=?(L_CE (y ˆ_v,y_v ))+ (1-?)?_(??V)¦??L_KL (y ˆ_v ? ,z_? ),
Backpropagate the loss to update the MLP parameters.
end for

It is noted that the global structure of the homogeneous graph 302 is also imparted to the pre-trained student MLP model 308 in the training stage, along with the knowledge distillation losses. First, the RandNE algorithm is used to create an initial global embedding for each node in the homogeneous graph 302. In particular, RandNE analyses the full graph structure to learn feature representations that encode global graph properties. Then, a neural network model is trained to predict these RandNE global embeddings based only on local graph structure. The local structure for a node is captured through its local adjacency matrix, which expresses its direct connections to neighbor nodes.
The trained neural network acts as a local approximation of the global RandNE algorithm. For new unseen nodes added to the homogeneous graph 302, the neural network can rapidly generate embeddings by looking only at their local connections. This avoids needing to re-run the slower RandNE algorithm on the full graph every time a new node is added.
In other words, RandNE provides the initial global embeddings, while the neural network learns to estimate these from local structure. This allows efficient generation of embeddings for new nodes. The entire process produces node representations that reflect both local neighborhood as well as global graph positions. The pre-trained student MLP model 308 is then used for generating an inference 316 based on input data i.e., homogeneous graph 302, the soft labels 306, etc and data from the student MLP model 310. Upon successful completion of the training process, the student ML model 120 is deployed for inferencing or classification by the server system 102 for any concerned node. Referring to FIG. 3C, it is understood that, a deployed student ML model 334 does not require the graph structure (i.e., homogenous graph 302) and can operate directly on node features of the concerned node 332. Since the deployed student ML model 334 does not need to process the graph structure during deployment, the inferencing time (i.e., of an inference 336) of the deployed student ML model 334 is significantly reduced along with reduced computational requirements as well.
FIGS. 4A, 4B, and 4C, collectively, illustrate experimental results of various experiments performed, in accordance with one or more embodiments of the present disclosure. It is noted that various experiments have been conducted on publicly available datasets to train and test the student ML model 232. For the purposes of experimentation, a student MLP model 310 is chosen as the student ML model 232. Furthermore, in order to maintain a balanced comparison, various GNN based models such as GraphSAGE with GCN aggregation, GCN, GAT, and APPNP have been considered as the teacher ML model 230. It is noted that these GNN models are well known in the art, therefore their architecture has not been explained herein for the sake of brevity. Various publicly available CPF datasets such as, Cora, Citeseer, Pubmed and two large OGB datasets such as Arxiv and Products have been used to evaluate the proposed approach of training the student ML model 232. As would be evident from Table 1, the Arxiv dataset and product dataset are very large when compared to the other three CPF datasets and a small subset of nodes from these datasets are labelled and are used for training the teacher GNN model 304. It is noted that the various experiments performed are exemplary in nature and do not construe a limitation on the scope of the present disclosure. Furthermore, the various experimental results provided herein are approximate in nature and may vary if reproduced.
Dataset Nodes Edges Feature Dimension Classes
Cora 2708 5429 1433 7
Citeseer 3327 4732 3703 6
Pubmed 19717 44338 500 3
Arxiv 169K 1.1M 128 40
Products 2.5M 61M 100 47
Table 1: Statistics of datasets used for experimentation.
The Cora dataset is a dataset that consists of 2708 scientific publications classified into one of seven classes. The citation network consists of 5429 links. In this dataset, each publication is described by a 0/1-valued word vector indicating the absence/presence of the corresponding word from the dictionary. The dictionary consists of 1433 unique words. For experimentation using the Cora dataset, the task is considered as a publication classification which translates to seven class node categorization.
The CiteSeer dataset is a dataset that consists of 3312 scientific publications classified into one of six classes. The citation network consists of 4732 links. In this dataset, each publication in the dataset is described by a 0/1-valued word vector indicating the absence/presence of the corresponding word from the dictionary. The dictionary consists of 3703 unique words. For experimentation using the CiteSeer dataset, the task is considered as a publication classification which translates to six class node categorization.
The Pubmed dataset is a dataset that consists of 19717 scientific publications from PubMed database pertaining to diabetes classified into one of three classes. The citation network consists of 44338 links. In this dataset, each publication in the dataset is described by a TF/IDF weighted word vector from a dictionary which consists of 500 unique words. For experimentation using the Pubmed dataset, the task is considered as a publication classification which translates to three class node categorization.
The Arxiv dataset is a dataset that represents the citation network for all Computer Science Arxiv papers. Each node is an Arxiv paper, and each edge indicates one paper cites another one. Each paper comes with 128-dimensional feature vector obtained by averaging the embeddings in the title and abstract. For experimentation using the Arxiv dataset, the task is considered as a prediction of 40 subject areas of Arxiv CS papers, which are manually labelled by papers authors.
The Products dataset is a dataset that represents Amazon product co-purchasing network. Nodes represent products sold on Amazon, and edges between two products indicate that the products are purchased together. Node features are generated by extracting bag-of-words features from the product descriptions. For experimentation using the Products dataset, the task is to predict the category of a product in a multi-class classification setup, where the 47 top-level categories are used for target labels.
For experiments, the mean of ten separate runs with different random seeds is reported. Accuracy of the various ML models is used to measure the model performance. Further, validation data is used to select the optimal model, and the results on test data are reported. The experiments, including those performed on various baseline models and the student ML model 232, have been conducted using the PyTorch framework, alongside the DGL library for GNN methods, and utilized Adam as the optimization tool. Further, the hardware employed for these performing experiments featured a Dual AMD Rome 7742 running at 2.25GHz with a 128 core CPU, complemented by a NVIDIA A100 GPU having 20GB of memory.
For the student MLP model 310, a grid search algorithm is used for tuning the hyperparameters (i.e., the set of model parameters). A search of learning rate from [0.01, 0.005, 0.001], weight decay from [0, 0.001, 0.002, 0.005, 0.01], dropout from [0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6], temperature parameter ?? from [0.5, 1.0, 2.0] and the ??, used in equation 9, from [0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6] is performed.
The student MLP model 310 is compared to GNN (i.e., GraphSAGE), MLP and the state-of-the art-method (GLNN), under the same experimental setting. As illustrated in Table 2, on an average, student MLP model 310 outperforms the MLP of same complexity by an approximate margin between 9% to 12% across all the five datasets. Further, the student MLP model 310 outperforms the teacher GNN model (i.e., GraphSAGE) across the three CPF datasets by an approximate average between 1% to 2% and falls behind on the two OGB datasets. Moreover, the student MLP model 310 outperforms the state-of-the-art GLNN by an approximate average between 1% to 2% across all the datasets including Arxiv dataset and Products dataset. This shows that the student MLP model 310 is able to capture topological information or graph structure information better than both GNN and GLNN. These results from the OGB datasets can be attributed to the classic model complexity and accuracy trade-off. As the OGB datasets are very big when compared to CPF datasets thus, it is natural that they need to be trained on more complex models. Since the models of similar complexity are compared, subpar performance on OGB datasets is expected. It is noted that an increase in accuracy can be achieved by sacrificing the inference time.
Hyperparameter Cora-Citeseer-Pubmed datasets Arxiv dataset Products dataset
num layers 2 3 3
hidden dim 128 256 256
learning rate 0.01 0.01 0.003
weight decay 0.0005 0 0
normalization - batch batch
fan out [5,5] [5,10,15] [5,10,15]
Table 2: Hyperparameter setting used for training the teacher model.
Further, the student MLP model 310 is compared with Simplifying Graph Convolutional networks (SCG). For smaller datasets, it is observed that SGC has an inferior performance when compared to the student MLP model 310. For larger datasets, like OGB-Arxiv dataset and OGB-Products dataset, SGC has a significant degradation in performance due to the lack of expressivity.
Experiments have been performed to highlight the efficacy of the student MLP model 310 as well. Such experiments specifically focus on the correlation between its prediction precision and the speed at which it delivers results. The Products dataset is for exploration as depicted by a representation 400 in FIG. 4A, the student MLP model 310 stands out in its performance, delivering an impressive accuracy of approximately 75% to 80% while maintaining a speedy model inference time of about 1.35ms. More specifically, the student MLP model 310 outperforms similar models operating within the same timeframe. For example, the likes of GLNN, SGC and MLPs lag, only hitting the 67%, 69% and 60% accuracy mark, respectively. Furthermore, the models that can match student MLP model’s accuracy require a significantly longer period for inference, further illustrating its superior efficiency. The various experimental results are shown in Table 3. In Table 3, ?MLP, ?GNN, ?GLNN represent the difference between the student MLP model 310 and the other conventional models, i.e., SAGE, MLP, SGC, and GLNN, respectively.
Dataset SAGE MLP GLNN SGC Student MLP Model ?MLP ?GNN ?GLNN ?SGC
Cora 75.75 56.3 75.76 74.35 77.05 20.75 1.30 1.29 2.70
Citeseer 60 56.56 61.2 59.80 62 5.44 2 0.8 2.20
Pubmed 74 62.72 74.6 73.22 76 13.28 2 1.4 2.78
Arxiv 68.39 55.77 62.98 63.40 64.57 8.8 -3.82 1.59 1.17
Products 75.06 59.79 66.85 67.90 69.12
9.33 -5.94 2.27 1.22
Table 3: Accuracy results across different datasets.
Further, it is observed that inducing structural induction through the neighborhood contrastive loss and utilizing soft labels 306 for distillation both play a pivotal role in enhancing the performance beyond a basic MLP model. To put things into perspective, the graph structural information captured by the teacher GNN model is implicitly transferred to the student MLP model 310 through the soft labels 306. On the other hand, structural information is explicitly induced by the using the contrastive loss. To gauge the significance of the soft labels 306 in improving performance of the student MLP model 310, a plot the variation of accuracy on the Cora dataset with 1 – ?? (coefficient of KL-divergence in Equation 10) in depicted by a representation 410 in FIG. 4B. As illustrated in FIG. 4B, as 1 -?? grows, there’s an almost linear uptick in accuracy, peaking near 0.8, after which it declines. From this experiment, it is understood that the soft labels 306 give a significant boost in accuracy when given importance to the KL divergence in the loss function.
Further, the performance of the student MLP model 310 is checked using different teacher GNN models across other GNN architectures. The average performance on the three CPF datasets, i.e., Cora dataset, Citeseer dataset, and Pubmed dataset across different teacher GNNs such as GCN, GAT, GraphSAGE, APPNP is illustrated using a representation 430 in FIG. 4C.
The experimental results show a nearly identical level of competence across all the four teacher GNN models. It is noted that performance is slightly reduced with the APPNP as the teacher GNN model. This is due to the synergy between this teacher GNN model and student MLP model 310 not being well established. This is because, in APPNP, the predictions are generated using the node features on an MLP and then propagated through the graph using an adaptation of a page-rank algorithm. As a result, the extra insights provided to the MLP by APPNP are significantly fewer than those offered by other teacher models. However, the student MLP model 310 consistently beats the state-of-the-art GLNN across all the teacher models.
As may be understood, the structure induction allows the student MLP model 310 to learn a more efficient and compact structure than the teacher GNN model. In particular, the KD alone trains the student ML model 232 to mimic the teacher’s outputs, but the student MLP model 310 may retain an unnecessarily complex structure. Structure induction acts as an additional regularization method, preventing overfitting to the teacher’s outputs. This results in a student MLP model 310 that is less prone to replicating any suboptimal behaviors or quirks of the teacher GNN model. Hence, for smaller datasets, it is observed that the student MLP model 310 outperforms both GLNN and teacher GNN model. For larger datasets not having the test time graph hurts the performance somewhat, though still much better than GLNN. It is understood that having no test time graph is essential for scalability and fast inference times, the slightly lower accuracies on large graph datasets is a tradeoff to achieve the same.
As described earlier, the student MLP model 310 is pretrained using neighborhood contrastive loss using the InfoNCE objective and then knowledge distillation is applied on the student MLP model 310 as the next step. This approach has multiples benefits such as pretraining on a large unlabeled dataset allows the model to learn good general-purpose representations of the data before fine-tuning on a downstream task. This often leads to better performance compared to only training on the typically smaller labeled downstream dataset. Further, pretraining is usually done in a self-supervised manner by defining a pretext task that doesn’t require labels. This allows the model to take advantage of a significant amounts of unlabeled data. As may be understood, joint training on both unsupervised and supervised objectives can make optimization more difficult compared to separating the stages. The different losses may compete with each other which is avoided using the pre-training stage. Further, pretraining on unlabeled data allows for better generalization and less overfitting compared to joint training on the downstream labeled data only. Further, pretraining provides a good model initialization for the downstream task, which leads to faster convergence compared to random initialization. It is noted that decoupling the pretraining and fine-tuning stages gives more flexibility. For instance, the same pretrained student ML model 232 can be fine-tuned on many different tasks rather than having to train from scratch each time.
Additionally, pretraining followed by fine-tuning takes better advantage of unlabeled data, provides flexibility, improves generalization, and leads to better end task performance and uses less labels. However, the tradeoff is that it requires training the student ML model 232 in two separate stages. Table 4 provides the approximate accuracies with one step training and two step pretraining-training process. As illustrated in Table 4, pretraining achieves much better accuracies for all datasets.
Dataset One-Stage Training Two-Stage Pretraining
Cora 74.35 77.05
Citeseer 60.40 62.00
Pubmed 73.84 76.00
Arxiv 63.08 64.57
Products 66.84 69.12
Table 4: One Stage Training vs Two Stage with pre-training of the student MLP model
Additionally, Table 5 shows an approximate percentage of labels required to achieve similar accuracies with pretraining as in case of one step training process. It is observed that a significant reduction in the number of labels required for same accuracy using this process. There are other benefits like no negative transfer (losses competing with each other) and more flexibility as explained above.
Dataset One-Stage Training Two-Stage Pretraining
Cora 100 84.25
Citeseer 100 86.80
Pubmed 100 83.27
Arxiv 100 91.50
Products 100 87.82
Table 5: Percentage of labels required with two step pre-training for achieving one step training accuracies.
It can be concluded from the experiments that GNNs achieve high accuracy but have slow inference due to computational graph dependency while MLPs are fast but less accurate on graph data. The proposed student ML model 232 can be a MLP trained by knowledge distillation from a GNN teacher with contrastive local structure induction. This eliminates graph dependency for fast inference while retaining strong performance. Comprehensive experiments on the datasets show the student MLP model 310 is approximately around 200 times faster on average than GNNs with competitive accuracy. This demonstrates the student MLP model’s potential for deploying low latency models. The current knowledge distillation techniques used to train the student MLP model 310 are basic and there is room for improvement. More sophisticated and advanced distillation methods could further enhance the accuracy and performance of the student MLP model 310 and the same would be covered within the scope of the present disclosure.
FIG. 5 illustrates a process flow diagram depicting a method 500 for training the teacher ML model 230, in accordance with an embodiment of the present disclosure. The method 500 depicted in the flow diagram may be executed by, for example, the server system 200. The sequence of operations of the method 500 may not be necessarily executed in the same order as they are presented. Further, one or more operations may be grouped and performed in the form of a single step, or one operation may have several sub-steps that may be performed in parallel or in a sequential manner. Operations of the method 500, and combinations of operations in the method 500 may be implemented by, for example, hardware, firmware, a processor, circuitry, and/or a different device associated with the execution of software that includes one or more computer program instructions. The plurality of operations is depicted in the process flow of the method 500. The process flow starts at operation 502.
At operation 502, the method 500 includes training, by the server system 200, the teacher ML model 230 based, at least in part, on performing the third set of operations iteratively till a performance of the teacher ML model 230 reaches the third predefined criterion. The third set of operations includes performing operations 504, 505, and 508.
At operation 504, the method 500 includes initializing the teacher ML model 230 based, at least in part, on the set of teacher model parameters.
At operation 506, the method 500 includes performing for each labeled node of the set of labeled nodes operations 506A to 506D.
At operation 506A, the method 500 includes generating the set of labeled node features based, at least in part, on masking the label of each labeled node.
At operation 506B, the method 500 includes generating, by the teacher ML model 230, the fourth embedding based, at least in part, on the corresponding set of labeled node features.
At operation 506C, the method 500 includes determining, by the teacher ML model 230, the teacher label prediction based, at least in part, on the corresponding fourth embedding.
At operation 506D, the method 500 includes determining the teacher cross entropy loss based, at least in part, on the corresponding teacher label prediction and the corresponding label.
At operation 508, the method 500 includes updating the set of teacher ML model parameters based, at least in part, on backpropagating the teacher cross entropy loss of each labeled node.
FIG. 6 illustrates a process flow diagram depicting a method 600 for training the student ML model 232, in accordance with an embodiment of the present disclosure. The method 600 depicted in the flow diagram may be executed by, for example, the server system 200. The sequence of operations of the method 600 may not be necessarily executed in the same order as they are presented. Further, one or more operations may be grouped and performed in the form of a single step, or one operation may have several sub-steps that may be performed in parallel or in a sequential manner. Operations of the method 600, and combinations of operations in the method 600 may be implemented by, for example, hardware, firmware, a processor, circuitry, and/or a different device associated with the execution of software that includes one or more computer program instructions. The plurality of operations is depicted in the process flow of the method 600. The process flow starts at operation 602.
At operation 602, the method 600 includes accessing, by the server system 200, the homogenous graph from the database 204 associated with the server system 200. The homogenous graph includes the set of nodes. Further, the set of nodes includes a set of labeled nodes and a set of unlabeled nodes. Herein, the set of edges exists between the set of labeled nodes and the set of unlabeled nodes such that each node of the set of nodes is associated with a plurality of features.
At operation 604, the method 600 includes generating, by the teacher ML model 230 associated with the server system 200, the soft label 306 for each unlabeled node of the plurality of unlabeled nodes based, at least in part, on the plurality of features corresponding to each unlabeled node.
At operation 606, the method 600 includes generating, by the server system 200, the adjacency matrix 314 based, at least in part, on the set of nodes. The adjacency matrix 314 indicates the global structure of the set of nodes in the homogeneous graph 302.
At operation 608, the method 600 includes extracting, by the server system 200, the local adjacency matrix from the adjacency matrix 314. The local adjacency matrix indicates the local structure of the set of labeled nodes in the homogeneous.
At operation 610, the method 600 includes training, by the server system 200, the student ML model 232 based, at least in part, on performing operation 610A and operation 610B.
At operation 610A, the method 600 includes performing the first set of operations iteratively till the performance of the student ML model 232 reaches the first predefined criterion.
At operation 610B, the method 600 includes performing the second set of operations iteratively till the performance of the student ML model 232 reaches the second predefined criterion.
FIG. 7 illustrates a process flow diagram depicting a method 700 for performing the first set of operations, in accordance with an embodiment of the present disclosure. The method 700 depicted in the flow diagram may be executed by, for example, the server system 200. The sequence of operations of the method 700 may not be necessarily executed in the same order as they are presented. Further, one or more operations may be grouped and performed in the form of a single step, or one operation may have several sub-steps that may be performed in parallel or in a sequential manner. Operations of the method 700, and combinations of operations in the method 700 may be implemented by, for example, hardware, firmware, a processor, circuitry, and/or a different device associated with the execution of software that includes one or more computer program instructions. The plurality of operations is depicted in the process flow of the method 700. The process flow starts at operation 702.
At operation 702, the method 700 includes initializing the student ML model 232 based, at least in part, on the set of model parameters.
At operation 704, the method 700 includes determining the neighborhood contrastive loss based, at least in part, on the local adjacency matrix.
At operation 706, the method 700 includes performing for each labeled node of the set of labeled nodes operation 706A to operation 706D.
At operation 706A, the method 700 includes generating a set of labeled node features based, at least in part, on masking the label of the each labeled node.
At operation 706B, the method 700 includes generating, by the student ML model 232, a first embedding based, at least in part, on the corresponding set of labeled node features.
At operation 706C, the method 700 includes determining, by the student ML model 232, a first label prediction based, at least in part, on the corresponding first embedding.
At operation 706D, the method 700 includes determining the first cross entropy loss based, at least in part, on the corresponding first label prediction and the corresponding label.
At operation 708, the method 700 includes updating the set of model parameters based, at least in part, on backpropagating the neighborhood contrastive loss and the first cross entropy loss of each labeled node.
FIGS. 8A and 8B, collectively, illustrate a process flow diagram depicting a method 800 for performing a second set of operations, in accordance with an embodiment of the present disclosure. The method 800 depicted in the flow diagram may be executed by, for example, the server system 200. The sequence of operations of the method 800 may not be necessarily executed in the same order as they are presented. Further, one or more operations may be grouped and performed in the form of a single step, or one operation may have several sub-steps that may be performed in parallel or in a sequential manner. Operations of the method 800, and combinations of operations in the method 800 may be implemented by, for example, hardware, firmware, a processor, circuitry, and/or a different device associated with the execution of software that includes one or more computer program instructions. The plurality of operations is depicted in the process flow of the method 800. The process flow starts at operation 802.
At operation 802, the method 800 includes generating the global embedding for each node of the homogenous graph such as homogeneous graph 302 based, at least in part, on the plurality of features and the local adjacency matrix.
At operation 804, the method 800 includes re-initializing the student ML model 232 based, at least in part, on the set of model parameters.
At operation 806, the method 800 includes performing for each labeled node of the set of labeled nodes operation 806A to operation 806D.
At operation 806A, the method 800 includes generating the set of labeled node features based, at least in part, on masking the label of each labeled node.
At operation 806B, the method 800 includes generating, by the student ML model 232, a second embedding based, at least in part, on the corresponding set of labeled node features.
At operation 806C, the method 800 includes determining, by the student ML model 232, the second label prediction based, at least in part, on the corresponding second embedding and the corresponding global embedding.
At operation 806D, the method 800 includes determining the second cross entropy loss based, at least in part, on the corresponding second label prediction and the corresponding label.
At operation 808, the method 800 includes performing for each unlabeled node of the set of unlabeled nodes operation 808A to operation 808D.
At operation 808A, the method 800 includes generating the set of unlabeled features based, at least in part, on masking the soft label 306 of each unlabeled node.
At operation 808B, the method 800 includes generating, by the student ML model 232, a third embedding based, at least in part, on the corresponding set of unlabeled features.
At operation 808C, the method 800 includes determining, by the student ML model 232, the third label prediction based, at least in part, on the corresponding third embedding and the corresponding global embedding.
At operation 808D, the method 800 includes determining the Kullback–Leibler (KL) divergence loss based, at least in part, on the corresponding third label prediction and the corresponding soft label 306.
At operation 810, the method 800 includes updating the set of model parameters based, at least in part, on backpropagating the second cross entropy loss of each labeled node and the KL divergence loss for each unlabeled node.
FIG. 9 illustrates a simplified block diagram of the payment server 900, in accordance with an embodiment of the present disclosure. The payment server 900 is an example of the payment server 114 of FIG. 1. The payment server 900 and the server system 200 may use the payment network 112 as a payment interchange network. Examples of payment interchange networks include, but are not limited to, Mastercard® payment system interchange network.
The payment server 900 includes a processing module 902 configured to extract programming instructions from a memory 904 to provide various features of the present disclosure. The components of the payment server 900 provided herein may not be exhaustive, and the payment server 900 may include more or fewer components than those depicted in FIG. 9. Further, two or more components may be embodied in one single component, and/or one component may be configured using multiple sub-components to achieve the desired functionalities. Some components of the payment server 900 may be configured using hardware elements, software elements, firmware elements, and/or a combination thereof.
Via a communication interface 906, the processing module 902 receives the request from a remote device 908, such as the issuer server 110, the acquirer server 108, or the server system 102. The request may be a request for conducting the payment transaction. The communication may be achieved through API calls, without loss of generality. The payment server 900 includes a database 910. The database 910 also includes transaction processing data such as issuer ID, country code, acquirer ID, and Merchant Identifier (MID), among others.
When the payment server 900 receives the payment transaction request from the acquirer server 108 or a payment terminal (e.g., IoT device), the payment server 900 may route the payment transaction request to an issuer server (e.g., the issuer server 110(1)). The database 910 stores transaction identifiers for identifying transaction details such as transaction amount, IoT device details, acquirer account information, transaction records, merchant account information, and the like.
In one example embodiment, the acquirer server 108(1) is configured to send an authorization request message to the payment server 900. The authorization request message includes, but is not limited to, the payment transaction request.
The processing module 902 further sends the payment transaction request to the issuer server 110(1) for facilitating the payment transactions from the remote device 908. The processing module 902 is further configured to notify the remote device 908 of the transaction status in the form of an authorization response message via the communication interface 906. The authorization response message includes, but is not limited to, a payment transaction response received from the issuer server 110(1). Alternatively, in one embodiment, the processing module 902 is configured to send an authorization response message for declining the payment transaction request, via the communication interface 906, to the acquirer server 108(1). In one embodiment, the processing module 902 executes similar operations performed by the server system 200, however, for the sake of brevity, these operations are not explained herein.
The disclosed methods with reference to FIGS. 5-7, FIG. 8A and FIG. 8B, or one or more operations of the server system 200 may be implemented using software including computer-executable instructions stored on one or more computer-readable media (e.g., non-transitory computer-readable media, such as one or more optical media discs, volatile memory components (e.g., DRAM or SRAM), or nonvolatile memory or storage components (e.g., hard drives or solid-state nonvolatile memory components, such as Flash memory components) and executed on a computer (e.g., any suitable computer, such as a laptop computer, netbook, Web book, tablet computing device, smartphone, or other mobile computing devices). Such software may be executed, for example, on a single local computer or in a network environment (e.g., via the Internet, a wide-area network, a local-area network, a remote web-based server, a client-server network (such as a cloud computing network), or other such networks) using one or more network computers. Additionally, any of the intermediate or final data created and used during the implementation of the disclosed methods or systems may also be stored on one or more computer-readable media (e.g., non-transitory computer-readable media) and are considered to be within the scope of the disclosed technology. Furthermore, any of the software-based embodiments may be uploaded, downloaded, or remotely accessed through a suitable communication means. Such a suitable communication means includes, for example, the Internet, the World Wide Web (WWW), an intranet, software applications, cable (including fiber optic cable), magnetic communications, electromagnetic communications (including RF, microwave, and infrared communications), electronic communications, or other such communication means.
Although the invention has been described with reference to specific exemplary embodiments, it is noted that various modifications and changes may be made to these embodiments without departing from the broad scope of the invention. For example, the various operations, blocks, etc., described herein may be enabled and operated using hardware circuitry (for example, Complementary Metal Oxide Semiconductor (CMOS) based logic circuitry), firmware, software, and/or any combination of hardware, firmware, and/or software (for example, embodied in a machine-readable medium). For example, the apparatuses and methods may be embodied using transistors, logic gates, and electrical circuits (for example, Application Specific Integrated Circuit (ASIC) circuitry and/or in Digital Signal Processor (DSP) circuitry).
Particularly, the server system 200 and its various components may be enabled using software and/or using transistors, logic gates, and electrical circuits (for example, integrated circuit circuitry such as ASIC circuitry). Various embodiments of the invention may include one or more computer programs stored or otherwise embodied on a computer-readable medium, wherein the computer programs are configured to cause the processor or the computer to perform one or more operations. A computer-readable medium storing, embodying, or encoded with a computer program, or similar language, may be embodied as a tangible data storage device storing one or more software programs that are configured to cause the processor or computer to perform one or more operations. Such operations may be, for example, any of the steps or operations described herein. In some embodiments, the computer programs may be stored and provided to a computer using any type of non-transitory computer-readable media. Non-transitory computer-readable media includes any type of tangible storage media. Examples of non-transitory computer-readable media include magnetic storage media (such as floppy disks, magnetic tapes, hard disk drives, etc.), optical magnetic storage media (e.g., magneto-optical disks), Compact Disc Read-Only Memory (CD-ROM), Compact Disc Recordable (CD-R ), compact disc rewritable (CD-R/W ), Digital Versatile Disc (DVD), BLU-RAY® Disc (BD), and semiconductor memories (such as mask ROM, programmable ROM (PROM), (erasable PROM), flash memory, Random Access Memory (RAM), etc.). Additionally, a tangible data storage device may be embodied as one or more volatile memory devices, one or more non-volatile memory devices, and/or a combination of one or more volatile memory devices and non-volatile memory devices. In some embodiments, the computer programs may be provided to a computer using any type of transitory computer-readable media. Examples of transitory computer-readable media include electric signals, optical signals, and electromagnetic waves. Transitory computer-readable media can provide the program to a computer via a wired communication line (e.g., electric wires, and optical fibers) or a wireless communication line.
Various embodiments of the invention, as discussed above, may be practiced with steps and/or operations in a different order, and/or with hardware elements in configurations, which are different than those which, are disclosed. Therefore, although the invention has been described based on these exemplary embodiments, it is noted that certain modifications, variations, and alternative constructions may be apparent and well within the scope of the invention.
Although various exemplary embodiments of the invention are described herein in a language specific to structural features and/or methodological acts, the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as exemplary forms of implementing the claims.
, Claims:CLAIMS
WE CLAIM:

1. A computer-implemented method, comprising:
accessing, by a server system, a homogenous graph from a database associated with the server system, the homogenous graph comprising a set of nodes, the set of nodes comprising a set of labeled nodes and a set of unlabeled nodes, wherein a set of edges exists between the set of labeled nodes and the set of unlabeled nodes, each node of the set of nodes is associated with a plurality of features;
generating, by a teacher Machine Learning (ML) model associated with the server system, a soft label for each unlabeled node of the plurality of unlabeled nodes based, at least in part, on the plurality of features corresponding to each unlabeled node;
generating, by the server system, an adjacency matrix based, at least in part, on the set of nodes, the adjacency matrix indicating a global structure of the set of nodes in the homogeneous graph;
extracting, by the server system, a local adjacency matrix from the adjacency matrix, the local adjacency matrix indicating a local structure of the set of labeled nodes in the homogeneous graph; and
training, by the server system, a student ML model based, at least in part, on performing a first set of operations iteratively till a performance of the student ML model reaches a first predefined criterion, the first set of operations comprising:
initializing the student ML model based, at least in part, on a set of model parameters;
determining a neighborhood contrastive loss based, at least in part, on the local adjacency matrix;
performing for each labeled node of the set of labeled nodes:
generating a set of labeled node features based, at least in part, on masking a label of the each labeled node;
generating, by the student ML model, a first embedding based, at least in part, on the corresponding set of labeled node features;
determining, by the student ML model, a first label prediction based, at least in part, on the corresponding first embedding; and
determining a first cross entropy loss based, at least in part, on the corresponding first label prediction and the corresponding label; and
updating the set of model parameters based, at least in part, on backpropagating the neighborhood contrastive loss and the first cross entropy loss of each labeled node.

2. The computer-implemented method as claimed in claim 1, wherein training the student ML model, further comprises:
performing, a second set of operations iteratively till the performance of the student ML model reaches a second predefined criterion, the second set of operations comprising:
generating a global embedding for each node of the homogenous graph based, at least in part, on the plurality of features and the adjacency matrix;
re-initializing the student ML model based, at least in part, on the set of model parameters;
performing for each labeled node of the set of labeled nodes:
generating the set of labeled node features based, at least in part, on masking the label of each labeled node;
generating, by the student ML model, a second embedding based, at least in part, on the corresponding set of labeled node features;
determining, by the student ML model, a second label prediction based, at least in part, on the corresponding second embedding and the corresponding global embedding; and
determining a second cross entropy loss based, at least in part, on the corresponding second label prediction and the corresponding label;
performing for each unlabeled node of the set of unlabeled nodes:
generating a set of unlabeled features based, at least in part, on masking the soft label of each unlabeled node;
generating, by the student ML model, a third embedding based, at least in part, on the corresponding set of unlabeled features;
determining, by the student ML model, a third label prediction based, at least in part, on the corresponding third embedding and the corresponding global embedding; and
determining a Kullback–Leibler (KL) divergence loss based, at least in part, on the corresponding third label prediction and the corresponding soft label; and
updating the set of model parameters based, at least in part, on backpropagating the second cross entropy loss of each labeled node and the KL divergence loss for each unlabeled node.

3. The computer-implemented method as claimed in claim 2, wherein generating the global embedding comprises:
generating, by the server system, an initial global embedding for each node of the homogeneous graph based, at least in part, on the plurality of features;
training, by the server system, a neural network model to predict the initial global embedding for each node based, at least in part, on the local adjacency matrix; and
determining, by the server system via the neural network model, the global embedding for each node of the homogeneous graph.

4. The computer-implemented method as claimed in claim 1, further comprising:
receiving, by the server system, a classification request for an entity associated with an individual node from the set of nodes for a classification task; and
generating, by the student ML model, a classification prediction for the individual node based, at least in part, on corresponding set of features of the individual nodes.

5. The computer-implemented method as claimed in claim 1, further comprising:
training, by the server system, the teacher ML model based, at least in part, on performing a third set of operations iteratively till a performance of the teacher ML model reaches a third predefined criterion, the third set of operations comprising:
initializing the teacher ML model based, at least in part, on a set of teacher model parameters;
performing for each labeled node of the set of labeled nodes:
generating the set of labeled node features based, at least in part, on masking the label of each labeled node;
generating, by the teacher ML model, a fourth embedding based, at least in part, on the corresponding set of labeled node features;
determining, by the teacher ML model, a teacher label prediction based, at least in part, on the corresponding fourth embedding; and
determining a teacher cross entropy loss based, at least in part, on the corresponding teacher label prediction and the corresponding label; and
updating the set of teacher ML model parameters based, at least in part, on backpropagating the teacher cross entropy loss of each labeled node.

6. The computer-implemented method as claimed in claim 1, wherein accessing the homogenous graph comprises:
accessing, by the server system, an entity-related dataset from the database, the entity-related dataset comprising information related to a plurality of entities;
generating, by the server system, the set of features corresponding to each entity of the plurality of entities based, at least in part, on the information related to a plurality of entities; and
generating, by the server system, the homogeneous graph based, at least in part, on the set of features for each entity, wherein each particular node of the homogeneous graph corresponds to each particular entity of the plurality of entities.

7. The computer-implemented method as claimed in claim 6, wherein the plurality of entities comprises at least one of a plurality of cardholders, a plurality of merchants, a plurality of issuers, and a plurality of acquirers.

8. The computer-implemented method as claimed in claim 1, wherein the teacher ML model is a Graph Neural Network (GNN) based ML model.

9. The computer-implemented method as claimed in claim 1, wherein the student ML model is a classifier-based ML model.

10. A server system, comprising:
a memory configured to store instructions;
a communication interface; and
a processor in communication with the memory and the communication interface, the processor configured to execute the instructions stored in the memory and thereby cause the server system to perform at least in part to:
access a homogenous graph from a database associated with the server system, the homogenous graph comprising a set of nodes, the set of nodes comprising a set of labeled nodes and a set of unlabeled nodes, wherein a set of edges exists between the set of labeled nodes and the set of unlabeled nodes, each node of the set of nodes is associated with a plurality of features;
generate, by a teacher Machine Learning (ML) model associated with the server system, a soft label for each unlabeled node of the plurality of unlabeled nodes based, at least in part, on the plurality of features corresponding to each unlabeled node;
generate an adjacency matrix based, at least in part, on the set of nodes, the adjacency matrix indicating a global structure of the set of nodes in the homogeneous graph;
extract a local adjacency matrix from the adjacency matrix, the local adjacency matrix indicating a local structure of the set of labeled nodes in the homogeneous graph; and
train a student ML model based, at least in part, on performing a first set of operations iteratively till a performance of the student ML model reaches a first predefined criterion, the first set of operations comprising:
initializing the student ML model based, at least in part, on a set of model parameters;
determining a neighborhood contrastive loss based, at least in part, on the local adjacency matrix;
performing for each labeled node of the set of labeled nodes:
generating a set of labeled node features based, at least in part, on masking a label of the each labeled node;
generating, by the student ML model, a first embedding based, at least in part, on the corresponding set of labeled node features;
determining, by the student ML model, a first label prediction based, at least in part, on the corresponding first embedding; and
determining a first cross entropy loss based, at least in part, on the corresponding first label prediction and the corresponding label; and
updating the set of model parameters based, at least in part, on backpropagating the neighborhood contrastive loss and the first cross entropy loss of each labeled node.

11. The server system as claimed in claim 10, wherein training the student ML model, further comprises:
performing, a second set of operations iteratively till the performance of the student ML model reaches a second predefined criterion, the second set of operations comprising:
generating a global embedding for each node of the homogenous graph based, at least in part, on the plurality of features and the local adjacency matrix;
re-initializing the student ML model based, at least in part, on the set of model parameters;
performing for each labeled node of the set of labeled nodes:
generating the set of labeled node features based, at least in part, on masking the label of each labeled node;
generating, by the student ML model, a second embedding based, at least in part, on the corresponding set of labeled node features;
determining, by the student ML model, a second label prediction based, at least in part, on the corresponding second embedding and the corresponding global embedding; and
determining a second cross entropy loss based, at least in part, on the corresponding second label prediction and the corresponding label;
performing for each unlabeled node of the set of unlabeled nodes:
generating a set of unlabeled features based, at least in part, on masking the soft label of each unlabeled node;
generating, by the student ML model, a third embedding based, at least in part, on the corresponding set of unlabeled features;
determining, by the student ML model, a third label prediction based, at least in part, on the corresponding third embedding and the corresponding global embedding; and
determining a Kullback–Leibler (KL) divergence loss based, at least in part, on the corresponding third label prediction and the corresponding soft label; and
updating the set of model parameters based, at least in part, on backpropagating the second cross entropy loss of each labeled node and the KL divergence loss for each unlabeled node.

12. The server system as claimed in claim 11, wherein for generating the global embedding the server system is further caused, at least in part, to:
generate an initial global embedding for each node of the homogeneous graph based, at least in part, on the plurality of features;
train a neural network model to predict the initial global embedding for each node based, at least in part, on the local adjacency matrix; and
determine via the neural network model, the global embedding for each node of the homogeneous graph.

13. The server system as claimed in claim 10, wherein the server system is further caused, at least in part, to:
receive a classification request for an entity associated with an individual node from the set of nodes for a classification task; and
generate, by the student ML model, a classification prediction for the individual node based, at least in part, on corresponding set of features of the individual nodes.

14. The server system as claimed in claim 10, wherein the server system is further caused, at least in part, to:
train the teacher ML model based, at least in part, on performing a third set of operations iteratively till a performance of the teacher ML model reaches a third predefined criterion, the third set of operations comprising:
initializing the teacher ML model based, at least in part, on a set of teacher model parameters;
performing for each labeled node of the set of labeled nodes:
generating the set of labeled node features based, at least in part, on masking the label of each labeled node;
generating, by the teacher ML model, a fourth embedding based, at least in part, on the corresponding set of labeled node features;
determining, by the teacher ML model, a teacher label prediction based, at least in part, on the corresponding fourth embedding; and
determining a teacher cross entropy loss based, at least in part, on the corresponding teacher label prediction and the corresponding label;
, and updating the set of teacher ML model parameters based, at least in part, on backpropagating the teacher cross entropy loss of each labeled node.

15. The server system as claimed in claim 10, wherein the server system is further caused, at least in part, to:
access an entity-related dataset from the database, the entity-related dataset comprising information related to a plurality of entities;
generate the set of features corresponding to each entity of the plurality of entities based, at least in part, on the information related to a plurality of entities; and
generate the homogeneous graph based, at least in part, on the set of features for each entity, wherein each particular node of the homogeneous graph corresponds to each particular entity of the plurality of entities.

16. The server system as claimed in claim 15, wherein the plurality of entities comprises at least one of a plurality of cardholders, a plurality of merchants, a plurality of issuers, and a plurality of acquirers.

17. The server system as claimed in claim 10, wherein the teacher ML model is a Graph Neural Network (GNN) based ML model.

18. The server system as claimed in claim 10, wherein the student ML model is a classifier-based ML model.

19. A non-transitory computer-readable storage medium comprising computer-executable instructions that, when executed by at least a processor of a server system, cause the server system to perform a method comprising:
accessing a homogenous graph from a database associated with the server system, the homogenous graph comprising a set of nodes, the set of nodes comprising a set of labeled nodes and a set of unlabeled nodes, wherein a set of edges exists between the set of labeled nodes and the set of unlabeled nodes, each node of the set of nodes is associated with a plurality of features;
generating, by a teacher Machine Learning (ML) model associated with the server system, a soft label for each unlabeled node of the plurality of unlabeled nodes based, at least in part, on the plurality of features corresponding to each unlabeled node;
generating an adjacency matrix based, at least in part, on the set of nodes, the adjacency matrix indicating a global structure of the set of nodes in the homogeneous graph;
extracting a local adjacency matrix from the adjacency matrix, the local adjacency matrix indicating a local structure of the set of labeled nodes in the homogeneous graph; and
training a student ML model based, at least in part, on performing a first set of operations iteratively till a performance of the student ML model reaches a first predefined criterion, the first set of operations comprising:
initializing the student ML model based, at least in part, on a set of model parameters;
determining a neighborhood contrastive loss based, at least in part, on the local adjacency matrix;
performing for each labeled node of the set of labeled nodes:
generating a set of labeled node features based, at least in part, on masking a label of the each labeled node;
generating, by the student ML model, a first embedding based, at least in part, on the corresponding set of labeled node features;
determining, by the student ML model, a first label prediction based, at least in part, on the corresponding first embedding; and
determining a first cross entropy loss based, at least in part, on the corresponding first label prediction and the corresponding label; and
updating the set of model parameters based, at least in part, on backpropagating the neighborhood contrastive loss and the first cross entropy loss of each labeled node.

20. The non-transitory computer-readable storage medium as claimed in claim 19, wherein training the student ML model, further comprises:
performing, a second set of operations iteratively till the performance of the student ML model reaches a second predefined criterion, the second set of operations comprising:
generating a global embedding for each node of the homogenous graph based, at least in part, on the plurality of features and the adjacency matrix;
re-initializing the student ML model based, at least in part, on the set of model parameters;
performing for each labeled node of the set of labeled nodes:
generating the set of labeled node features based, at least in part, on masking the label of each labeled node;
generating, by the student ML model, a second embedding based, at least in part, on the corresponding set of labeled node features;
determining, by the student ML model, a second label prediction based, at least in part, on the corresponding second embedding and the corresponding global embedding; and
determining a second cross entropy loss based, at least in part, on the corresponding second label prediction and the corresponding label;
performing for each unlabeled node of the set of unlabeled nodes:
generating a set of unlabeled features based, at least in part, on masking the soft label of each unlabeled node;
generating, by the student ML model, a third embedding based, at least in part, on the corresponding set of unlabeled features;
determining, by the student ML model, a third label prediction based, at least in part, on the corresponding third embedding and the corresponding global embedding; and
determining a Kullback–Leibler (KL) divergence loss based, at least in part, on the corresponding third label prediction and the corresponding soft label; and
updating the set of model parameters based, at least in part, on backpropagating the second cross entropy loss of each labeled node and the KL divergence loss for each unlabeled node.

Documents

Application Documents

#	Name	Date
1	202441000268-STATEMENT OF UNDERTAKING (FORM 3) [02-01-2024(online)].pdf	2024-01-02
2	202441000268-POWER OF AUTHORITY [02-01-2024(online)].pdf	2024-01-02
3	202441000268-FORM 1 [02-01-2024(online)].pdf	2024-01-02
4	202441000268-FIGURE OF ABSTRACT [02-01-2024(online)].pdf	2024-01-02
5	202441000268-DRAWINGS [02-01-2024(online)].pdf	2024-01-02
6	202441000268-DECLARATION OF INVENTORSHIP (FORM 5) [02-01-2024(online)].pdf	2024-01-02
7	202441000268-COMPLETE SPECIFICATION [02-01-2024(online)].pdf	2024-01-02
8	202441000268-Proof of Right [08-04-2024(online)].pdf	2024-04-08