Methods And Systems For Generating A Transformer Model Architecture

< Back

Methods And Systems For Generating A Transformer Model Architecture Capable Of Learning From A Graph

Abstract: Methods and systems for generating structure-aware transformer model architecture capable of learning from graph-structured data are disclosed. Method performed by a server system includes accessing homogeneous graph, determining multiple node features for each node, and training structure-related neural network of a structure-aware transformer model architecture to determine similarity metrics between pairs of nodes based on homogeneous graph and contrastive loss. Method includes computing, by the structure-related neural network, a positional encoding for each node based on similarity metrics between the corresponding pairs of nodes. The method includes generating an updated feature vector for each node based on concatenating the multiple node features with the positional encoding. Thereafter, the method includes facilitating a tokenization layer of the structure-aware transformer model architecture to generate an aggregated token sequence for each node based on the updated feature vector. The aggregated token sequence includes aggregation of a set of tokens and global context vector.

Get Free WhatsApp Updates!
Notices, Deadlines & Correspondence

Patent Information

Application #

Filing Date

31 October 2023

Publication Number

18/2025

Publication Type

INA

Invention Field

COMPUTER SCIENCE

Status

Parent Application

Applicants

MASTERCARD INTERNATIONAL INCORPORATED

2000 Purchase Street, Purchase, NY 10577, United States of America

Inventors

1. Sanjay Kumar Patnala

Second floor, building number-15, J-8, DLF Phase-2, Gurgaon 122002, Haryana, India

2. Akshay Sethi

A-80, Meera Bagh, Paschim Vihar, New Delhi, Delhi 110087, India

3. Siddhartha Asthana

7/108 Malviya Nagar, New Delhi, Delhi 110017, India

4. Sonia Gupta

D-801, Emaar Emerald Estate, Gurgaon, Haryana 122001 India

5. Sumedh B G

#1616, 6th cross , 6th main, 2nd stage Vijayanagar, Mysore 570017, Karnataka, India

Specification

Description: The present disclosure relates to artificial intelligence-based processing systems and, more particularly, to electronic methods and complex processing systems for generating a transformer model architecture. More specifically, methods and systems for generating a structure-aware transformer model architecture capable of learning from graph-structured data.
BACKGROUND
With the advancement in deep learning-based techniques for performing various predictive tasks, various problems have emerged that cause problems while training the learning models such as complexities of specific tasks, data types, application domains considering resource constraints, and the like. To address such problems different deep learning architectures have been developed. One such category of deep learning architecture operates on graph-based data. As may be understood, real-world information such as social networks, molecular networks, citation networks, etc., is better represented in graphs. In addition, it has several advantages such as its node-link structure stores a lot of information, distributed computing can be implemented in graphs which saves a lot of computation and reduces time complexity, beneficial to solve relational problems, and the like. As a result, graph-structured data is generally preferred when performing several tasks and in several application domains.
However, deriving useful insights from the information contained within their structure and node properties is difficult due to the intrinsic complexities of graphs, which is a result of their distinctive topology and subtle structural nuances. As a result, a graph neural network (GNN) was proposed, whose key design element is the use of pairwise message passing so that the nodes in the graph iteratively update their representations by exchanging information with their neighbors. Nonetheless, as GNNs are restricted to aggregation of information from only one-hop neighbors, they fail to capture global and long-range dependencies. Also, they are less effective as they cause over-smoothing and over-squashing and are bound to the 1-Weisfeiler-Lehman (WL) test. It is noted that the 1-WL test is a graph isomorphism test that can be used to check the expressivity of an algorithm dealing with graph data.
Further, in the fields of Computer Vision (CV) and Natural Language Processing (NLP), a deep learning architecture that has achieved state-of-the-art performance is a transformer. It may be noted that a distinctive trait of transformers is their intrinsic flexibility in information aggregation which emerges from their encoding process. The encoding process in the transformers defines the boundaries of information aggregation to be dynamic and is determined by the properties of the input. It may be noted that the dynamicity arises from the usage of the attention mechanism.
To that end, driven by the significant performance of the transformers in domains dependent upon Euclidean spaces, various approaches have been implemented to integrate this robust architecture into the field of graph-based representation learning. However, it may be noted that embedding intricate properties of graphs within Euclidean space and subsequently directing them to a transformer framework is an extremely challenging task. Therefore, the conventional approaches that enable a transformer architecture to learn from graph-structured data have several drawbacks such as requiring a lot of computation and having quadratic time complexity. For instance, in one of the conventional approaches, Laplacian eigenvectors are used for computing positional encodings. Nevertheless, it has limitations due to challenges in eigenvalue multiplicity, eigenvector sign ambiguity, and normalization issues.
Some other conventional approaches include methods that use random walks, ego-graphs, or substructure sampling for node sampling. However, these approaches also have limitations especially linked to efficiency, thoroughness, and uniformity of sampling. Further, substantial loss of information is also observed because of non-uniform and non-superficial sampling of random walks or ego-graphs. Furthermore, even though some of the conventional approaches can partially overcome the above-mentioned drawbacks, they still are restricted to a predefined count of hops in the graph and do not consider nodes beyond that count. As a result, long-range interactions beyond a particular count of hops are overlooked which is undesirable.
Thus, there exists a need for technical solutions such as methods and systems for generating a structure-aware transformer model architecture capable of learning from graph-structured data while overcoming the technical drawbacks.
SUMMARY
Various embodiments of the present disclosure provide methods and systems for generating a structure-aware transformer model architecture capable of learning from graph-structured data.
In an embodiment, a computer-implemented method for generating a structure-aware transformer model architecture capable of learning from graph-structured data is disclosed. The computer-implemented method performed by a server system includes accessing a homogeneous graph from a database associated with the server system. The homogeneous graph includes a plurality of nodes connected via a set of edges. Each node of the plurality of nodes represents a user of a plurality of users. Further, each edge between one or more pairs of nodes represents a relationship between users associated with the corresponding one or more pairs of nodes. The method further includes determining a plurality of node features for each node in the homogeneous graph based, at least in part, on the homogeneous graph. Further, the method includes training a structure-related neural network of a structure-aware transformer model architecture, to determine one or more similarity metrics between the one or more pairs of nodes from the plurality of nodes in the homogeneous graph based, at least in part, on the homogeneous graph and a contrastive loss. Further, the method includes computing, by the structure-related neural network of the structure-aware transformer model architecture associated with the server system, a positional encoding for each node based, at least in part, on the one or more determined similarity metrics between the corresponding one or more pairs of nodes in the homogeneous graph. The method includes generating an updated feature vector for each node based, at least in part, on concatenating the plurality of node features for each node with the corresponding positional encoding for each node.
In another embodiment, a server system is disclosed. The server system includes a communication interface and a memory including executable instructions. The server system also includes a processor communicably coupled to the memory. The processor is configured to execute the instructions to cause the server system, at least in part, to access a homogeneous graph from a database associated with the server system. The homogeneous graph includes a plurality of nodes connected via a set of edges. Each node of the plurality of nodes represents a user of a plurality of users. Further, each edge between one or more pairs of nodes represents a relationship between users associated with the corresponding one or more pairs of nodes. The server system is further caused to determine a plurality of node features for each node in the homogeneous graph based, at least in part, on the homogeneous graph. Further, the server system is caused to train a structure-related neural network of a structure-aware transformer model architecture, to determine one or more similarity metrics between the one or more pairs of nodes from the plurality of nodes in the homogeneous graph based, at least in part, on the homogeneous graph and a contrastive loss. Furthermore, the server system is caused to compute, by the structure-related neural network of the structure-aware transformer model architecture associated with the server system, a positional encoding for each node based, at least in part, on the one or more determined similarity metrics between the corresponding one or more pairs of nodes in the homogeneous graph. The server system is caused to generate an updated feature vector for each node based, at least in part, on concatenating the plurality of node features for each node with the corresponding positional encoding for each node.
In yet another embodiment, a non-transitory computer-readable storage medium is disclosed. The non-transitory computer-readable storage medium includes computer-executable instructions that, when executed by at least a processor of a server system, cause the server system to perform a method. The method includes accessing a homogeneous graph from a database associated with the server system. The homogeneous graph includes a plurality of nodes connected via a set of edges. Each node of the plurality of nodes represents a user of a plurality of users. Further, each edge between one or more pairs of nodes represents a relationship between users associated with the corresponding one or more pairs of nodes. The method further includes determining a plurality of node features for each node in the homogeneous graph based, at least in part, on the homogeneous graph. Further, the method includes training a structure-related neural network of a structure-aware transformer model architecture, to determine one or more similarity metrics between the one or more pairs of nodes from the plurality of nodes in the homogeneous graph based, at least in part, on the homogeneous graph and a contrastive loss. Further, the method includes computing, by the structure-related neural network of the structure-aware transformer model architecture associated with the server system, a positional encoding for each node based, at least in part, on the one or more determined similarity metrics between the corresponding one or more pairs of nodes in the homogeneous graph. The method includes generating an updated feature vector for each node, based, at least in part, on concatenating the plurality of node features for each node with the corresponding positional encoding for each node.
BRIEF DESCRIPTION OF THE FIGURES
For a more complete understanding of example embodiments of the present technology, reference is now made to the following descriptions taken in connection with the accompanying drawings in which:
FIG. 1 illustrates a schematic representation of an environment related to at least some example embodiments of the present disclosure;
FIG. 2 illustrates a simplified block diagram of a server system, in accordance with an embodiment of the present disclosure;
FIG. 3 illustrates a schematic representation of an architecture of a structure-aware transformer model, in accordance with an embodiment of the present disclosure;
FIG. 4 illustrates a schematic representation of a process of generating positional encodings, in accordance with an embodiment of the present disclosure;
FIG. 5 illustrates a schematic representation of a process of generating updated feature vectors, in accordance with an embodiment of the present disclosure;
FIG. 6 illustrates a schematic representation of a node sampling mechanism, in accordance with an embodiment of the present disclosure;
FIG. 7 illustrates a flow diagram depicting a method for generating a structure-aware transformer model architecture capable of learning from graph-structured data, in accordance with an embodiment of the present disclosure; and
FIG. 8 illustrates a simplified block diagram of a payment server, in accordance with an embodiment of the present disclosure.
The drawings referred to in this description are not to be understood as being drawn to scale except if specifically noted, and such drawings are only exemplary in nature.
DETAILED DESCRIPTION
In the following description, for purposes of explanation, numerous specific details are set forth to provide a thorough understanding of the present disclosure. It will be apparent, however, to one skilled in the art that the present disclosure can be practiced without these specific details. Descriptions of well-known components and processing techniques are omitted to not unnecessarily obscure the embodiments herein. The examples used herein are intended merely to facilitate an understanding of ways in which the embodiments herein may be practiced and to further enable those of skill in the art to practice the embodiments herein. Accordingly, the examples should not be construed as limiting the scope of the embodiments herein.
Reference in this specification to “one embodiment” or “an embodiment” means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment of the present disclosure. The appearances of the phrase “in an embodiment” in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Moreover, various features are described which may be exhibited by some embodiments and not by others. Similarly, various requirements are described which may be requirements for some embodiments but not for other embodiments.
Moreover, although the following description contains many specifics for the purposes of illustration, anyone skilled in the art will appreciate that many variations and/or alterations to said details are within the scope of the present disclosure. Similarly, although many of the features of the present disclosure are described in terms of each other, or in conjunction with each other, one skilled in the art will appreciate that many of these features can be provided independently of other features. Accordingly, this description of the present disclosure is set forth without any loss of generality to, and without imposing limitations upon, the present disclosure.
Embodiments of the present disclosure may be embodied as an apparatus, a system, a method, or a computer program product. Accordingly, embodiments of the present disclosure may take the form of an entire hardware embodiment, an entire software embodiment (including firmware, resident software, micro-code, etc.), or an embodiment combining software and hardware aspects that may all generally be referred to herein as a “circuit”, “engine”, “module”, or “system”. Furthermore, aspects of the present disclosure may take the form of a computer program product embodied in one or more computer-readable storage media having computer-readable program code embodied thereon.
The terms “payment transaction”, “financial transaction”, “e-commerce transactions”, “digital transaction”, and “transaction” are used interchangeably throughout the description and refer to a transaction of payment of a certain amount being initiated by the cardholder. In other words, it may be referred to as an agreement that is carried out between a buyer and a seller to exchange goods or services in exchange for assets in the form of a payment (e.g., cash, fiat-currency, digital asset, cryptographic currency, coins, tokens, etc.). The financial transactions may include, for example, online payment, payment at a terminal (e.g., point of sale (POS) terminal, ATM, self-service kiosks, and the like. Moreover, the online payments can be performed on electronic-commerce (E-commerce) platforms that are either provided on web browsers or by merchants in the form of mobile applications.
The terms “cardholder”, “user”, “account holder”, “consumer”, and “buyer” are used interchangeably throughout the description and refer to a person who has a payment account or at least one payment card (e.g., credit card, debit card, etc.) may or may not be associated with the payment account, that will be used by a merchant to complete the payment transaction that may be initiated by the cardholder. The payment account may be opened via an issuing bank or an issuer server.
The term “merchant”, used throughout the description generally refers to a seller, a retailer, a purchase location, an organization, or any other entity that is in the business of selling goods or providing services, and it can refer to either a single business location or a chain of business locations of the same entity.
The term “payment account” used throughout the description refers to a financial account that is used to fund a financial transaction interchangeably referred to as “payment transaction” or “transaction”). Examples of the financial account include but are not limited to a savings account, a credit account, a checking account, and a virtual payment account. The financial account may be associated with an entity such as an individual person, a family, a commercial entity, a company, a corporation, a governmental entity, a non-profit organization, and the like. In some scenarios, the financial account may be a virtual or temporary payment account that can be mapped or linked to a primary financial account, such as those accounts managed by payment wallet service providers, and the like.
The term “issuer”, used throughout the description, refers to a financial institution normally called an “issuer bank” or “issuing bank” in which an individual or an institution may have an account. The issuer also issues a payment card, such as a credit card, a debit card, etc. Further, the issuer may also facilitate online banking services such as electronic money transfer, bill payment, etc., to the cardholders through a server called “issuer server” throughout the description. The terms “issuer”, “issuer bank”, “issuing bank” or “issuer server” will be used interchangeably herein.
Further, the term “acquirer”, is a financial institution (e.g., a bank) that processes financial transactions for merchants. In other words, this can be an institution that facilitates the processing of payment transactions for physical stores, merchants, or institutions that own platforms that make either online purchases or purchases made via software applications possible (e.g., the shopping cart platform providers and the in-app payment processing providers). The terms “acquirer”, “acquirer bank”, “acquiring bank” or “acquirer server” will be used interchangeably herein.
The terms “payment network” and “card network” are used interchangeably throughout the description and refer to a network or collection of systems used for the transfer of funds using cash substitutes. Payment networks may use a variety of different protocols and procedures to process the transfer of money for various types of transactions. Payment networks are companies that connect an issuing bank with an acquiring bank to facilitate online payment. Transactions that may be performed via a payment network may include product or service purchases, credit purchases, debit transactions, fund transfers, account withdrawals, etc. Payment networks may be configured to perform transactions via cash substitutes that may include payment cards, letters of credit checks, financial accounts, etc. Examples of networks or systems configured to perform as payment networks include those operated by such as Mastercard®.
The term “payment card”, used throughout the description, refers to a physical or virtual card that may or may not be linked with a financial or payment account that may be presented to a merchant or any such facility to fund a financial transaction via the associated payment account. Examples of payment cards include, but are not limited to, debit cards, credit cards, prepaid cards, virtual payment numbers, virtual card numbers, forex cards, charge cards, e-wallet cards, and stored-value cards. A payment card may be a physical card that may be presented to the merchant for funding the payment. Alternatively, or additionally, the payment card may be embodied in the form of data stored in a user device, where the data is associated with a payment account such that the data can be used to process the financial transaction between the payment account and a merchant’s financial account.
OVERVIEW
Various embodiments of the present disclosure provide methods, systems electronic devices, and computer program products for generating a structure-aware transformer model architecture capable of learning from graph-structured data. In a specific embodiment, the server system may be embodied within a payment server associated with a payment network. In an embodiment, the server system is configured to generate a homogeneous graph from the graph-structured data. Further, in an embodiment, the server system is configured to access the homogeneous graph from a database associated with the server system. The homogeneous graph includes a plurality of nodes connected via a set of edges. Each node of the plurality of nodes represents a user of a plurality of users. Further, each edge between one or more pairs of nodes represents a relationship between users associated with the corresponding one or more pairs of nodes.
The server system is further configured to determine a plurality of node features for each node in the homogeneous graph based, at least in part, on the homogeneous graph. Further, the server system is configured to train a structure-related neural network of a structure-aware transformer model architecture, to determine one or more similarity metrics between the one or more pairs of nodes from the plurality of nodes in the homogeneous graph based, at least in part, on the homogeneous graph and a contrastive loss. In an embodiment, the structure-aware transformer model architecture corresponds to a Graph Neural Network (GNN)-transformer Machine learning (ML)-based model architecture. Further, in an embodiment, the structure-related neural network of the structure-aware transformer model architecture corresponds to a multi-layer perceptron (MLP) neural network.
In an embodiment, for training the structure-related neural network, the server system may also be configured to perform a set of operations iteratively until a performance of the structure-related neural network converges to predefined criteria. The set of operations may include: (1) initializing the structure-related neural network with a training dataset and one or more model parameters, wherein the training dataset includes a predefined pair of nodes comprising a reference node and a negative node, (2) determining, via the structure-related neural network, one or more similarity metrics between the one or more pairs of nodes from the plurality of nodes based, at least in part, on an adjacency matrix corresponding to the homogeneous graph, the training dataset and the one or more model parameters, (3) generating, via the structure-related neural network, a set of positive pairs of nodes and a set of negative pairs of nodes based, at least in part, on the one or more similarity metrics, the set of positive pairs of nodes indicating a first subset of nodes that are closely related to each other, and the set of negative pairs of nodes indicating a second subset of nodes that are unrelated to each other, (4) computing, via the structure-related neural network, a first set of probability values for the corresponding set of positive pairs of nodes and a second set of probability values for the corresponding set of negative pairs of nodes, (5) computing a contrastive loss using a contrastive loss function based, at least in part, on comparison of the first set of probability values with the second set of probability values, and (6) backpropagating the contrastive loss for maximizing the difference between the first set of probability values and the second set of probability values.
Further, in an embodiment, the predefined criteria include saturation of the contrastive loss. The contrastive loss may be saturated after a plurality of iterations of the set of operations is performed. Moreover, in an embodiment, the contrastive loss may include a neighborhood contrastive loss.
Furthermore, the server system is configured to compute a positional encoding for each node based, at least in part, on the one or more determined similarity metrics between the corresponding one or more pairs of nodes in the homogeneous graph. Herein, the positional encoding is computed by the structure-related neural network of the structure-aware transformer model architecture associated with the server system.
The server system is configured to generate an updated feature vector for each node based, at least in part, on concatenating the plurality of node features for each node with the corresponding positional encoding for each node.
Moreover, in an example embodiment, the server system is configured to generate an aggregated token sequence for each node based, at least in part, on the updated feature vector for each node. In an embodiment, the server system generates the aggregated token sequence via a tokenization layer of the structure-aware transformer model architecture. Further, in an embodiment, for the server system to generate the aggregated token sequence for each node, the server system is configured to generate a set of tokens for each node. Each token from the set of tokens may be generated based, at least in part, on aggregating a first set of updated feature vectors corresponding to a first set of nodes that are within a predefined count of hops from the corresponding node. The server system is further configured to generate a global context vector for each node based, at least in part, on aggregating a second set of updated feature vectors corresponding to a second set of nodes that are beyond the predefined count of hops from the corresponding node. Further, the server system is configured to aggregate the set of tokens for each node with the global context vector for each node based, at least in part, on aggregation criteria, thereby generating the aggregated token sequence for each node.
In an embodiment, the aggregation criteria include the predefined count of hops for generating the aggregated token sequence for a particular node is equal to a shortest-path distance between the corresponding node and neighbor nodes of the corresponding node. The neighbor nodes may include the first set of nodes.
Furthermore, in an embodiment, the server system is further configured to train a transformer layer of the structure-aware transformer model architecture for performing a node classification task based, at least in part, on the aggregated token sequence corresponding to each node.
Various embodiments of the present disclosure offer multiple advantages and technical effects. For instance, the present disclosure proposes a structure-aware transformer model architecture that can fully exploit both semantic and structural knowledge provided by the graph in the most optimal way. Further, since the present disclosure utilizes the concept of graph transformers rather than GNNs, the need to pass message to one-hop neighbors for information aggregation is eliminated rather k-hop aggregation is performed, and hence can capture long-range dependencies or interactions. In addition, since the mechanism of message passing is eliminated, the risk of over-smoothing and over-squashing is alleviated. Also, it is understood that the architecture proposed in the present disclosure performs better than the 1-WL test as graph transformers are more expressive than GNNs.
Furthermore, structural information such as local sub-structural knowledge and global structural knowledge is also considered by using neighborhood contrastive loss for positional encoding and considering all nodes beyond k-hops, respectively. As a result, the proposed architecture is more effective and more efficient in terms of computation and time. More specifically, it may be noted that the use of contrastive loss makes the implementation of the proposed architecture computationally less expensive i.e., O(b2*d), where b is batch size and d is a feature dimension in comparison to Laplacian matrix decomposition (which has a computational complexity of O(N3) for decomposing a N*N Laplacian matrix). Moreover, there is a significant improvement in the performance of the model due to the usage of positional encodings obtained by considering the contrastive loss.
More specifically, it may be noted that the present disclosure not only emphasizes the performance of the algorithm but also its scalability. Further, by leveraging a neighborhood aggregation sampling approach, the attention scope is significantly narrowed from N2 nodes to merely K nodes. This results in the computational complexity of the attention mechanism being reduced to O(K2 * N * d) from the original O(N2), where d denotes the feature dimension.
Moreover, the aggregation operations involved in generating input tokens can be performed offline as well, thereby enhancing efficiency. Hence, the overall complexity of the present disclosure is O(K2 * N * d). This linear time complexity ensures that the method proposed in the present disclosure is well-equipped to manage larger graphs efficiently leading to a scalable graph learning architecture.
In addition, in the financial domain, transactional graphs can be provided as input to the proposed architecture, and node embeddings can be obtained. These node embeddings can then be used for performing different downstream tasks such as fraud detection, credit risk assessment, customer relationship management, anomaly detection, etc., to improve the performance of different Artificial Intelligence (AI) products available in the market. Thus, in conclusion, it may be noted that the strength of the proposed transformer architecture lies in its ability to model complex relationships, handle graph-structured data efficiently, and provide valuable insights for various financial tasks and decision-making processes.
Various example embodiments of the present disclosure are described hereinafter with reference to FIGS. 1 to 8.
FIG. 1 illustrates a schematic representation of an environment 100 related to at least some example embodiments of the present disclosure. Although the environment 100 is presented in one arrangement, other embodiments may include the parts of the environment 100 (or other parts) arranged otherwise depending on, for example, computing positional encodings, concatenating with input node features, performing sampling of nodes, and the like.
The environment 100 generally includes a plurality of entities such as a server system 102, a plurality of cardholders 104(1), 104(2), … 104(N) (collectively referred to hereinafter as a ‘plurality of cardholders 104’ or simply ‘cardholders 104’), a plurality of merchants 106(1), 106(2), … 106(N) (collectively referred to hereinafter as a ‘plurality of merchants 106’ or simply ‘merchants 106’), a plurality of issuer servers 108(1), 108(2), … 108(N) (collectively referred to hereinafter as a ‘plurality of issuer servers 108’ or simply ‘issuer servers 108’), a plurality of acquirer servers 110(1), 110(2), … 110(N) (collectively referred to hereinafter as a ‘plurality of acquirer servers 110’ or simply ‘acquirer servers 110’), a payment network 112 including a payment server 114, and a database 116 each coupled to, and in communication with (and/or with access to) a network 118. Herein, ‘N’ is a non-zero natural number. The network 118 may include, without limitation, a Light Fidelity (Li-Fi) network, a Local Area Network (LAN), a Wide Area Network (WAN), a Metropolitan Area Network (MAN), a satellite network, the Internet, a fiber optic network, a coaxial cable network, an infrared (IR) network, a Radio Frequency (RF) network, a virtual network, and/or another suitable public and/or private network capable of supporting communication among two or more of the parts or users illustrated in FIG. 1, or any combination thereof.
Various entities in the environment 100 may connect to the network 118 in accordance with various wired and wireless communication protocols, such as Transmission Control Protocol and Internet Protocol (TCP/IP), User Datagram Protocol (UDP), 2nd Generation (2G), 3rd Generation (3G), 4th Generation (4G), 5th Generation (5G) communication protocols, Long Term Evolution (LTE) communication protocols, New Radio (NR) communication protocol, any future communication protocol, or any combination thereof. In some instances, the network 118 may utilize a secure protocol (e.g., Hypertext Transfer Protocol (HTTP), Secure Socket Lock (SSL), and/or any other protocol, or set of protocols for communicating with the various entities depicted in FIG. 1.
In an embodiment, the cardholder (e.g., the cardholder 104(1)) may be any individual, representative of a corporate entity, a non-profit organization, or any other person who is presenting payment account details during an electronic payment transaction. The cardholder (e.g., the cardholder 104(1)) may have a payment account issued by an issuing bank (not shown in figures) associated with an issuer server (e.g., the issuer server 108(1)) and may be provided with a payment card with financial or other account information encoded onto the payment card such that the cardholder (i.e., the cardholder 104(1)) may use the payment card to initiate and complete a payment transaction using a bank account at the issuing bank.
In another embodiment, the cardholders 104 may use their corresponding electronic devices (not shown in figures) to access a mobile application or a website associated with the issuing bank, or any third-party payment application to perform a payment transaction. In various non-limiting examples, the electronic devices may refer to any electronic devices such as, but not limited to, Personal Computers (PCs), tablet devices, smart wearable devices, Personal Digital Assistants (PDAs), voice-activated assistants, Virtual Reality (VR) devices, smartphones, laptops, and the like.
In an embodiment, the merchants 106 may include retail shops, restaurants, supermarkets or establishments, government and/or private agencies, or any such places equipped with POS terminals, where cardholders 104 visits to perform financial transactions in exchange for any goods and/or services or any financial transactions.
In one scenario, the cardholders 104 may use their corresponding payment accounts to conduct payment transactions with the merchants 106. Moreover, it may be noted that each of the cardholders 104 may use their corresponding payment cards differently or make the payment transaction using different means of payment such as net banking, Unified Payments Interface (UPI) payment, card transaction, cheque transaction, etc. For instance, the cardholder 104(1) may enter payment account details on an electronic device (not shown) associated with the cardholder 104(1) to perform an online payment transaction. In another instance, the cardholder 104(2) may utilize a payment card to perform an offline payment transaction. In yet another instance, the cardholder 104(3) may enter details of the payment card to transfer funds in the form of fiat currency on an e-commerce platform to buy goods.
In one embodiment, the cardholders 104 may be associated with financial institutions such as issuing banks who are associated with the issuer servers 108. To that end, it is noted that the terms “issuer bank”, “issuing bank” or simply “issuer”, and “issuer servers”, hereinafter may be used interchangeably. It may be understood that a cardholder (e.g., the cardholder 104(1)) may have the payment account with the issuing bank, (that may issue a payment card, such as a credit card or a debit card to the cardholders 104). Further, the issuing banks provide microfinance banking services (e.g., payment transactions using credit/debit cards) for processing electronic payment transactions, to the cardholder (e.g., the cardholder 104(1)).
In an embodiment, the merchants 106 are generally associated with financial institutions such as acquiring banks who are associated with the acquirer servers 110. The terms “acquirer”, “acquiring bank”, “acquirer server”, and “acquirer servers” will be used interchangeably hereinafter. The acquiring bank can be an institution that facilitates the processing of payment transactions for physical stores, merchants, or institutions that own platforms that make either online purchases or purchases made via software applications possible.
In an embodiment, the interactions or the payment transactions between the cardholders 104 and the merchants 106 can be best represented in the form of a graph having multiple nodes and edges. As may be understood, the nodes in a graph can be of different types indicating entities between which the interaction may happen based on the type of network. Examples of entities may include cardholders, merchants, users, atoms in a molecule, members of a family, friends in a social network, etc. Further, the edges may indicate the type of interaction between the nodes which is also based on the type of network. In a non-limiting example of the payment network such as the payment network 112, the graph may include a bi-partite graph of payment transactions between the cardholders 104 and the merchants 106.
As mentioned earlier, the graph-structured data is beneficial in performing several tasks as its node-link structure saves a lot of information, reduces time complexity, is beneficial to solve relational problems, etc. The tasks may include fraud detection, credit risk assessment, customer relationship management, anomaly detection, facilitating implementation of distributed computing, solving relational problems, etc. However, due to the intrinsic complexities of graphs, it is difficult to derive useful insights from the information contained in the structure of the graphs. Also, conventional graphical data processing techniques have limitations as they fail to capture global and long-range dependencies, cause over-smoothing, over-squashing, etc. On the other hand, transformers have achieved state-of-the-art performance as a deep learning architecture in several fields. Thus, there is a significant development in the research of building approaches to transfer the architecture of the transformers in the field of graph-based representation learning.
As may be understood conventional transformers take a sequence of tokens as input and use the concept of attention mechanism to calculate the similarity/dependence between each pair of tokens. This corresponds to features associated with each token in the input sequence. Considering this, it may be noted that the token’s updated features are simply the sum of linear transformations of the features of all the tokens, weighted by their importance. Further, this information is concatenated with positional encodings facilitating the transformer to understand the relative position of each token. Further, mathematical operations are performed on this concatenated data to obtain more complex patterns and relationships between the tokens. This is followed by an operation of residual connections and layer normalization to make sure that information is not lost during processing and training in more stable.
Further, the architecture of conventional transformers generally includes tokenizers for generating a sequence of tokens from the input data. This is followed by an embedding layer for generating vector representations of the tokens. The architecture further includes transformer layers and optionally un-embedding layers. Transformer layers can be one of two types such as encoder, and decoder. Some of the conventional transformers possess both the encoder and the decoder. However, the operation of the transformer layers is to carry out repeated transformations on the vector representations, extracting more and more linguistic information. These layers include alternating multi-head self-attention (MSA) and position-wise feedforward networks (FFN). The same can be explained using mathematical symbols and vectors. Consider, H?R_n×d to be input an MSA module, where R indicates the Real number of components for the tokens, n is the number of tokens, and d is the hidden dimension. This input H is projected into three matrices Q, K, and V, using W_Q?R^(d×d_Q ), W_K?R^(d×d_K ), and W_V?R^(d×d_V ). The self-attention may then be calculated as:
Q=HW_Q,K=HW_K,V=HW_V … Eqn. 1
Where, Attn?(H)=softmax?((QK^T)/v(d_K )) … Eqn. 2
The role of the attention matrix of Eqn. 2 is to capture the pair-wise semantic similarity between the tokens in the input sequence. More specifically, it calculates the dot product between each token pair after projection, and the softmax function is applied row-wise. Thus, it is understood that such transformer architectures fail to efficiently process the graph-structured data as they are trained to operate on input that is in the form of a sequence of tokens.
Moreover, as mentioned earlier, conventional approaches that enable a transformer to learn from the graph-structured data have several drawbacks. For example, a conventional approach using Laplacian eigenvectors for computing the positional encodings has limitations due to challenges in eigenvalue multiplicity, eigenvector sign ambiguity, and normalization issues. Thus, it may be understood that conventional approaches require a lot of computation and hence possess quadratic time complexity for merely computing the positional encodings.
Further, some other conventional approaches have drawbacks associated with node sampling as well. It may be noted that these drawbacks are linked to efficiency, thoroughness, and uniformity of the sampling process. These drawbacks ultimately result in a substantial loss of information. To overcome this drawback, a conventional approach is known in the art in which a sampling mechanism used for node sampling enabled an input token i to aggregate node features from ith hop neighbors. This approach has remarkable efficiency while ensuring minimal information loss. Nevertheless, a drawback associated with this approach is that long-range interactions beyond a particular count of hops (k-hops) in the graph are overlooked as well.
Therefore, there is a need for a technical solution for generating a new transformer model architecture that is capable of learning from the graph-structured data more efficiently in terms of both computation complexity and time consumption.
The above-mentioned technical problems, among other problems, are addressed by one or more embodiments implemented by the server system 102 and the methods thereof provided in the present disclosure. In an embodiment, it may be noted that the methods and systems proposed in the present disclosure can be used in any domain or industry to perform the classification task with different classes. However, for the sake of explanation, analysis, and performance comparison, the example of using the proposed system in the payment industry of the financial domain is considered in the present disclosure. To that note, the present disclosure is applicable in various other industries such as healthcare, financial technology, hospitality, media, travel, law enforcement, and the like. In one embodiment, the server system 102 is configured to perform one or more of the operations described herein.
As may be understood, the present disclosure is intended to develop a new transformer architecture that can efficiently learn from the graph-structured data, an unweighted and undirected graph can be considered for the sake of the explanation of the implementation of the proposed architecture. In an embodiment, in the financial industry, the graph-structured data may correspond to transactional graph input data. The transactional graph input data may be a graph indicating the payment transactions between the cardholders 104 and the merchants 106. Thus, in an embodiment, the server system 102 may store the graph-structured data in the database 116 of the server system 102.
In various non-limiting examples, the database 116 may include one or more Hard Disk Drives (HDD), Solid-State Drives (SSD), an Advanced Technology Attachment (ATA) adapter, a Serial ATA (SATA) adapter, a Small Computer System Interface (SCSI) adapter, a redundant array of independent disks (RAID) controller, a Storage Area Network (SAN) adapter, a network adapter, and/or any component providing the server system 102 with access to the database 116. In one implementation, the database 116 may be viewed, accessed, amended, updated, and/or deleted by an administrator (not shown) associated with the server system 102 through a database management system (DBMS) or relational database management system (RDBMS) present within the database 116.
Further, it may be noted that, in a specific example, the server system 102 coupled with the database 116 is embodied within a payment server associated with the payment processor, however, in other examples, the server system 102 can be a standalone component (acting as a hub) connected to the issuer servers and the acquirer servers. The database 116 may be incorporated in the server system 102 or maybe an individual entity connected to the server system 102 or maybe a database stored in cloud storage.
In a non-limiting embodiment, the server system 102 is configured to facilitate the generation of a structure-aware transformer model architecture that is capable of learning from the graph-structured data. As may be understood, for generating the structure-aware transformer model architecture, the server system 102 may have to access the homogeneous graph from the database 116. In an embodiment, the server system 102 is configured to access the homogeneous graph from the database 116. In a non-limiting example, the homogeneous graph may include a plurality of nodes connected via a set of edges. Each node of the plurality of nodes represents a user of a plurality of users. Further, each edge between one or more pairs of nodes represents a relationship between users associated with the corresponding one or more pairs of nodes. It may be noted that initially, the server system 102 generates the homogeneous graph from the graph-structured data. Moreover, in an embodiment, in the payment industry, the plurality of users may correspond to the cardholder 104 or the merchants 106. Thus, each node may represent either a cardholder or a merchant while each edge may represent the relationship between different cardholders or merchants. Moreover, it is understood that the structurally aware transformer operates on the homogeneous graph which requires each node to be of the same type.
In an example implementation, the graph-structured data may include historical information corresponding to a plurality of payment transactions performed between the plurality of cardholders 104 and the plurality of merchants 106. In an embodiment, the historical information may include, but is not limited to, transaction attributes, such as transaction amount, source of funds such as bank accounts, debit cards or credit cards, transaction channel used for loading funds such as POS terminal or ATM, transaction velocity features such as count and transaction amount sent in the past ‘x’ number of days to a particular user, transaction location information, external data sources, merchant country, merchant Identifier (ID), cardholder ID, cardholder product, cardholder Permanent Account Number (PAN), Merchant Category Code (MCC), merchant location data or merchant co-ordinates, merchant industry, merchant super industry, ticket price, and other transaction-related data.
Further, the server system 102 may be configured to determine a plurality of node features for each node in the homogeneous graph based, at least in part, on the homogeneous graph. Furthermore, the process of developing the structure-aware transformer model architecture capable of learning from the graph-structured data, can be split into at least three steps such as generating positional encodings, performing node sampling, and introducing biases explicitly. As a result, the present disclosure is intended to introduce a new approach for computing the positional encodings that is easy to implement and efficient in terms of time. To that note, the present disclosure introduces positional encodings that are trained via a contrastive loss such as a Neighborhood Contrastive Loss. The Neighborhood Contrastive loss is used for computing the positional encodings to make sure that nodes in the graph-structured data with strong structural ties yield high dot product values for their positional encodings. Further, it also ensures that the intricate topology of the entire graph in the graph-structured data is effectively encapsulated within the positional embedding space.
In an embodiment, the server system 102 is configured to train a structure-related neural network of the structure-aware transformer model architecture, to determine one or more similarity metrics between the one or more pairs of nodes from the plurality of nodes in the homogeneous graph based, at least in part, on the homogeneous graph and the contrastive loss. Further, the server system 102 is configured to compute a positional encoding for each node based, at least in part, on the one or more determined similarity metrics between the corresponding one or more pairs of nodes in the homogeneous graph. The process of computing the positional encodings based on the Neighborhood Contrastive loss is explained later in the present disclosure.
Furthermore, the server system 102 may be configured to generate an updated feature vector for each node based, at least in part, on concatenating the plurality of node features for each node with the corresponding positional encoding for each node.
Moreover, the server system 102 may be configured to generate an aggregated token sequence for each node based, at least in part, on the updated feature vector for each node. In an embodiment, the server system 102 may generate the aggregated token sequence via a tokenization layer of the structure-aware transformer model architecture. It may be noted that the present disclosure discloses a new sampling technique that also considers nodes beyond the kth hop. As a result, the present disclosure efficiently captures short-range interactions as well as long-range interactions, all without having a negative impact on time complexity. The process of the implementation of the new node sampling technique is explained later in the present disclosure. Finally, the server system 102 is configured to train a transformer layer of the structure-aware transformer model architecture for performing a node classification task based, at least in part, on the aggregated token sequence corresponding to each node. It may be noted that the structure-aware transformer model architecture thus generated is a new model architecture for a structure-aware transformer model. Thus, in an embodiment, the database 116 may also store one or more AI or ML models such as the structure-aware transformer model.
In other various examples, the database 116 may also include multifarious data, for example, social media data, Know Your Customer (KYC) data, payment data, trade data, employee data, Anti Money Laundering (AML) data, market abuse data, Foreign Account Tax Compliance Act (FATCA) data, and fraudulent payment transaction data. In addition, the database 116 provides a storage location for data and/or metadata obtained from various operations performed by the server system 102.
In one embodiment, the payment network 112 may be used by the payment card issuing authorities as a payment interchange network. Examples of the payment cards include debit cards, credit cards, etc. Similarly, examples of payment interchange networks include but are not limited to, a Mastercard® payment system interchange network. The Mastercard® payment system interchange network is a proprietary communications standard promulgated by Mastercard International Incorporated® for the exchange of electronic payment transaction data between issuers and acquirers that are members of Mastercard International Incorporated®. (Mastercard is a registered trademark of Mastercard International Incorporated located in Purchase, N.Y.).
It should be understood that the server system 102 is a separate part of the environment 100, and may operate apart from (but still in communication with, for example, via the network 118) any third-party external servers (to access data to perform the various operations described herein). However, in other embodiments, the server system 102 may be incorporated, in whole or in part, into one or more parts of the environment 100.
The number and arrangement of systems, devices, and/or networks shown in FIG. 1 are provided as an example. There may be additional systems, devices, and/or networks; fewer systems, devices, and/or networks; different systems, devices, and/or networks; and/or differently arranged systems, devices, and/or networks than those shown in FIG. 1. Furthermore, two or more systems or devices are shown in FIG. 1 may be implemented within a single system or device, or a single system or device is shown in FIG. 1 may be implemented as multiple, distributed systems or devices. In addition, the server system 102 should be understood to be embodied in at least one computing device in communication with the network 118, which may be specifically configured, via executable instructions, to perform steps as described herein, and/or embodied in at least one non-transitory computer-readable media.
FIG. 2 illustrates a simplified block diagram of a server system 200, in accordance with an embodiment of the present disclosure. The server system 200 is identical to the server system 102 of FIG. 1. In one embodiment, the server system 200 is a part of the payment network 112 or integrated within the payment server 114. In some embodiments, the server system 200 is embodied as a cloud-based and/or SaaS-based (software as a service) architecture.
The server system 200 includes a computer system 202 and a database 204. The computer system 202 includes at least one processor 206 (herein, referred to interchangeably as ‘processor 206’) for executing instructions, a memory 208, a communication interface 210, a user interface 212, and a storage interface 214. The one or more components of the computer system 202 communicate with each other via a bus 216. The components of the server system 200 provided herein may not be exhaustive and the server system 200 may include more or fewer components than those depicted in FIG. 2. Further, two or more components depicted in FIG. 2 may be embodied in one single component, and/or one component may be configured using multiple sub-components to achieve the desired functionalities.
In some embodiments, the database 204 is integrated into the computer system 202. In one non-limiting example, the database 204 is configured to store the graph-structured data such as the graph-structured data 218. In a non-limiting example, as mentioned earlier in the present disclosure, the graph-structured data 218 may include historical information corresponding to the plurality of payment transactions performed between the plurality of cardholders (e.g., the cardholders 104) and the plurality of merchants (e.g., the merchants 106). In an embodiment, the graphical representation may include a plurality of nodes and a plurality of edges connected with each other.
Further, in an embodiment, the database 204 may also store one or more AI or ML models such as a structure-aware transformer model 220 that may be used by the server system 200 for further processing the graph-structured data 218. In addition, the database 204 provides a storage location for data and/or metadata obtained from various operations performed by the server system 200.
Further, the computer system 202 may include one or more hard disk drives as the database 204. The user interface 212 is an interface such as a Human Machine Interface (HMI) or a software application that allows users such as an administrator to interact with and control the server system 200 or one or more parameters associated with the server system 200. It may be noted that the user interface 212 may be composed of several components that vary based on the complexity and purpose of the application. Examples of components of the user interface 212 may include visual elements, controls, navigation, feedback and alerts, user input and interaction, responsive design, user assistance and help, accessibility features, and the like. More specifically these components may correspond to icons, layout, color schemes, buttons, sliders, dropdown menus, tabs, links, error/success messages, mouse and touch interactions, keyboard shortcuts, tooltips, screen readers, and the like.
The storage interface 214 is any component capable of providing the processor 206 access to the database 204. The storage interface 214 may include, for example, an Advanced Technology Attachment (ATA) adapter, a Serial ATA (SATA) adapter, a Small Computer System Interface (SCSI) adapter, a RAID controller, a SAN adapter, a network adapter, and/or any component providing the processor 206 with access to the database 204.
The processor 206 includes suitable logic, circuitry, and/or interfaces to execute operations for determining a plurality of node features for each node, training the structure-related neural network of a structure-aware transformer model architecture to determine one or more similarity metrics between pairs of nodes, computing a positional encoding, generating an updated feature vector, and the like. Examples of the processor 206 include, but are not limited to, an Application-Specific Integrated Circuit (ASIC) processor, a Reduced Instruction Set Computing (RISC) processor, a Graphical Processing Unit (GPU), a complex instruction set computing (CISC) processor, a Field-Programmable Gate Array (FPGA), and the like.
The memory 208 includes suitable logic, circuitry, and/or interfaces to store a set of computer-readable instructions for performing operations. Examples of the memory 208 include a Random-Access Memory (RAM), a Read-Only Memory (ROM), a removable storage drive, a Hard Disk Drive (HDD), and the like. It will be apparent to a person skilled in the art that the scope of the disclosure is not limited to realizing the memory 208 in the server system 200, as described herein. In another embodiment, the memory 208 may be realized in the form of a database server or a cloud storage working in conjunction with the server system 200, without departing from the scope of the present disclosure.
The processor 206 is operatively coupled to the communication interface 210, such that the processor 206 is capable of communicating with a remote device 222 such as the issuer servers 108, the acquirer servers 110, the payment server 114, or communicating with any entity connected to the network 118 (as shown in FIG. 1).
It is noted that the server system 200 as illustrated and hereinafter described is merely illustrative of an apparatus that could benefit from embodiments of the present disclosure and, therefore, should not be taken to limit the scope of the present disclosure. It is noted that the server system 200 may include fewer or more components than those depicted in FIG. 2.
In one implementation, the processor 206 includes a data pre-processing module 224, a positional encoding computation module 226, a concatenation module 228, and an aggregation module 230. It should be noted that components, described herein, such as the data pre-processing module 224, the positional encoding computation module 226, the concatenation module 228, and the aggregation module 230 can be configured in a variety of ways, including electronic circuitries, digital arithmetic, and logic blocks, and memory systems in combination with software, firmware, and embedded technologies. Moreover, it may be noted that the data pre-processing module 224, the positional encoding computation module 226, the concatenation module 228, and the aggregation module 230 may be communicably coupled with each other to exchange information with each other for performing the one or more operations facilitated by the server system 200.
In an embodiment, the data pre-processing module 224 includes suitable logic and/or interfaces for accessing the graph-structured data 218 from the database 204. As may be understood that the graph-structured data 218 may include a graphical representation of historical information corresponding to a plurality of payment transactions performed between a plurality of cardholders (e.g., the cardholders 104) and a plurality of merchants (e.g., the merchants 106).
In an embodiment, the data pre-processing module 224 accesses a homogeneous graph from the database 204. The homogeneous graph includes a plurality of nodes connected via a set of edges. Each node of the plurality of nodes represents a user of a plurality of users. Further, each edge between one or more pairs of nodes represents a relationship between users associated with the corresponding one or more pairs of nodes. In the payment network 112, the plurality of users may be either the cardholders 104 or the merchants 106 as homogeneity requires the nodes in a graph to be of the same type. Moreover, it may be noted that the data pre-processing module 224 generates the homogeneous graph from the graph-structured data 218.
The data pre-processing module 224 further determines a plurality of node features for each node in the homogeneous graph based, at least in part, on the homogeneous graph. In an example embodiment of the payment network 112, the plurality of node features may include a plurality of merchant-related features or a plurality of cardholder-related features. In an embodiment, the plurality of merchant-related features may include a count of approved transactions for different predefined periods, amounts corresponding to the approved transactions for the predefined periods, amounts corresponding to card-not-present (CNP) transactions for different predefined periods, a count of card-not-present (CNP) transactions for different predefined periods, a count of card-present (CP) transactions for different predefined periods, amounts corresponding to the CP transactions for different predefined periods, merchant identity (ID), and the like. In another embodiment, the plurality of cardholder-related features may include an approved spending amount, approved transactions, total spend, total transactions, etc. Further, the homogeneous graph may be provided to the positional encoding computation module 226.
In an embodiment, the positional encoding computation module 226 includes suitable logic and/or interfaces for training a structure-related neural network of a structure-aware transformer model architecture, to determine one or more similarity metrics between one or more pairs of nodes from the plurality of nodes in the homogeneous graph based, at least in part, on the homogeneous graph and a contrastive loss. The structure-aware transformer model architecture may correspond to a model architecture of the structure-aware transformer model 220.
Further, for training the structure-related neural network, the positional encoding computation module 226 performs a set of operations iteratively until a performance of the structure-related neural network converges to predefined criteria. The predefined criteria include saturation of the contrastive loss. The contrastive loss may be saturated after a plurality of iterations of the set of operations is performed. The set of operations may be further elaborated as the configuration of the server system 200. Thus, the server system 200 is configured to initialize the structure-related neural network with a training dataset and one or more model parameters. The training dataset includes a predefined pair of nodes including a reference node and a negative node. As used herein, the term ‘reference node’ refers to a target node concerning which all the computations related to a task are performed. In an embodiment, the reference node refers to a target node for which a positional encoding is computed at a particular instant of time. Further, as used herein, the term ‘negative node’ refers to a node that is farther or not related to the reference node. Herein, information related to the negative node is provided while training the structure-related neural network to assist the structure-related neural network to learn to distinguish between neighborhood nodes and farther or unrelated nodes.
As may be understood, for training any model or a neural network, its model parameters may have to be set based on which the corresponding model may start to perform its operation. At later stages of training the model, the model parameters get updated, thereby improving the performance of the model. Thus, in an embodiment, the one or more model parameters may include weights, biases, epochs, hyperparameters such as a hop count, a split percentage of an input dataset for training, validation, and testing operations, etc.
Further, the server system 200 may be configured to determine, via the structure-related neural network, one or more similarity metrics between the one or more pairs of nodes from the plurality of nodes based, at least in part, on an adjacency matrix corresponding to the homogeneous graph, the training dataset and the one or more model parameters. In an embodiment, each pair of nodes from the one or more pairs of nodes may include a target node and a neighbor node. Herein, the neighbor node may correspond to a kth hop node from the target node. Further, the adjacency matrix provides information about nodes that are adjacent to each other or that are related to each other. As used herein, the term ‘similarity metric’ may refer to a parameter that provides information about the similarity and/or dissimilarity between a pair of nodes. Examples of the one or more similarity metrics may include a count of neighbor nodes to a particular node, a count of common neighbor nodes for each node, a distance between nodes based on a hop count, and the like.
Furthermore, the server system 200 may be configured to generate, via the structure-related neural network, a set of positive pairs of nodes and a set of negative pairs of nodes based, at least in part, on the one or more similarity metrics. The set of positive pairs of nodes indicates a first subset of nodes that are closely related to each other, and the set of negative pairs of nodes indicates a second subset of nodes that are unrelated to each other. Further, the server system 200 may be configured to compute, via the structure-related neural network, a first set of probability values for the corresponding set of positive pairs of nodes and a second set of probability values for the corresponding set of negative pairs of nodes. The server system 200 may be further configured to compute a contrastive loss using a contrastive loss function based, at least in part, on the comparison of the first set of probability values with the second set of probability values. Finally, the server system 200 may be configured to backpropagate the contrastive loss for maximizing the difference between the first set of probability values and the second set of probability values. Moreover, the contrastive loss may include a neighborhood contrastive loss. The details of the contrastive loss and the process of training the structure-related neural network are explained further in the present disclosure.
Further, the positional encoding computation module 226 is also configured to compute the positional encoding for each node based, at least in part, on the one or more determined similarity metrics between the corresponding one or more pairs of nodes in the homogeneous graph. In an embodiment, the positional encoding computation module 226 is configured to compute the positional encoding by the structure-related neural network of the structure-aware transformer model architecture associated with the server system. The plurality of node features and the positional encoding may be provided to the concatenation module 228.
The concatenation module 228 includes suitable logic and/or suitable logic and/or interfaces for generating an updated feature vector for each node based, at least in part, on concatenating the plurality of node features for each node with the corresponding positional encoding for each node. In an embodiment, the concatenation module 228 may facilitate applying a concatenation operator onto the plurality of node features and the positional encoding of each node. The updated feature vector is provided to the aggregation module 230.
The aggregation module 230 includes suitable logic and/or suitable logic and/or interfaces for generating an aggregated token sequence for each node based, at least in part, on the updated feature vector for each node. In an embodiment, the aggregation module 230 may be configured to generate the aggregated token sequence for each node via a tokenization layer of the structure-aware transformer model architecture.
Further, for the aggregation module 230 to generate the aggregated token sequence for each node, the aggregation module 230 may have to generate a set of tokens for each node. Each token from the set of tokens may be generated based, at least in part, on aggregating a first set of updated feature vectors corresponding to a first set of nodes that are within a predefined count of hops from the corresponding node (otherwise also referred to as a target node). The aggregation module 230 further generates a global context vector for each node based, at least in part, on aggregating a second set of updated feature vectors corresponding to a second set of nodes that are beyond the predefined count of hops from the corresponding node. Further, the aggregation module 230 aggregates the set of tokens for each node with the global context vector for each node based, at least in part, on aggregation criteria, thereby generating the aggregated token sequence for each node.
In an embodiment, the aggregation criteria include the predefined count of hops for generating the aggregated token sequence for a particular node (or the target node) is equal to a shortest-path distance between the corresponding node (the target node) and neighbor nodes of the corresponding node. The neighbor nodes may include the first set of nodes. In an embodiment, the structure-aware transformer model architecture corresponds to a Graph Neural Network (GNN)-transformer Machine learning (ML)-based model architecture. Further, the structure-related neural network of the structure-aware transformer model architecture corresponds to a multi-layer perceptron (MLP) neural network. Herein, the MLP neural network, in the context of contrastive learning may refer to a neural network that is used to map data points into a shared embedding space, where similar data points are closer, and dissimilar data points are farther apart. More specifically, the MLP neural network may be a feedforward neural network having multiple layers of interconnected neurons.
In an embodiment, to make sure that the performance of the new model architecture has improved, an experiment is conducted. It may be noted that the experiment is conducted on different types of datasets. In an example implementation, the datasets correspond to five public benchmark datasets including three citation networks. The citation networks used are Cora, Citeseer, and Pumbed. The other two are a photo and a computer. It may be noted that in a citation network, each node represents a paper, and an edge represents a citation between those papers. In accordance with an example implementation, details associated with the datasets considered for conducting the experiment are as follows:
Datasets Nodes Edges Feature dimension Classes
Cora 2708 5429 1433 7
Citeseer 3327 4732 3703 6
Pumbed 19717 44338 500 3
Photo 7650 238K 745 8
Computer 13752 491K 767 10
Table 1: Statistics of datasets used
As may be understood, the dataset is split into a training dataset, a validation dataset, and a testing dataset. The datasets considered for the sake of the experiment may also be split randomly approximately as 60 percent (%), 20%, and 20% for training, validation, and testing purposes, respectively.
In an example implementation, the approximate experiment results are obtained upon implementation of the proposed architecture disclosed in the present disclosure for the above-mentioned datasets. Further, approximate experiment results are obtained for conventional approaches or baseline models such as Graph Convolutional (GCN) neural network model, Graph Attention Network (GAT), Self-Attention Network (SAN), etc., and compared with that of the proposed architecture. In an example implementation, the comparison of the approximate performance percentage values for the conventional approaches and the proposed architecture are as follows:
Approach Cora Citeseer Pumbed Photo Computer
Conventional approaches 80-91% 67-79% 78-89% or OOM 92-95% 89-91% or OOM
structure-aware transformer model architecture (proposed architecture) 91.90% ± 0.27 80.80% ± 0.41 89.85% ± 0.14 96.41% ± 0.11 92.36% ± 0.17
Table 2: Experiment results
Upon observing the results, it is clear that the proposed model architecture consistently outperforms conventional approaches by a substantial margin. More specifically, the observations made may prove that the fundamental message passing of GNNs is not the most optimal way of dealing with graph data structures. Further, only transformer-based approaches may also not be the best method for dealing with the graph-structured data. Further, it may be observed that some of the conventional approaches can also run Out-of-Memory (OOM) on medium-sized datasets.
Further, in an embodiment, an ablation study may also be performed on the above-mentioned datasets to make sure that the concatenation of the positional encodings (PEs) with the node features has a significant impact on the performance of the proposed model architecture. More specifically, it may be noted that the experiment may be performed on the above-mentioned datasets using the proposed model architecture without concatenating the node features with the positional encodings that are trained with the contrastive loss. From Table 2 it may be understood that the positional encodings are of greater help to datasets with high dependence on structural knowledge and inductive biases when compared to datasets where the semantic knowledge of node features is overpowering the topology.
Similarly, the experiment may be performed on the above-mentioned datasets by removing the extra token i.e., the global context vector from the input sequence. In an example implementation, the approximate ablation study results may be as follows:
Component Cora Citeseer Pumbed Photo Computer
No PEs and No global context vector 90.37 80.12 89.39 95.36 90.87
Global context vector but o PEs 90.93 80.71 89.7 95.94 92.25
PEs but no global context vector 91.11 80.72 89.55 96.27 91.71
Pes and global context vector 91.90 80.80 89.85 96.41 92.36
Table 3: Ablation results
From Table 3 it is evident that this extra token of global context information gains substantial importance if a dataset that has extremely low inductive biases and a good number of long-range interactions are considered. It is also very clear that the percentage gain solely due to the provision of this extra token varies over datasets.
FIG. 3 illustrates a schematic representation 300 of an architecture of a structure-aware transformer model (e.g., the structure-aware transformer model 220), in accordance with an embodiment of the present disclosure. In a non-limiting implementation, the process of generating the structure-aware transformer model architecture capable of learning from the graph-structured data such as the graph-structured data 218, can be split into at least three steps. In an embodiment, the at least three steps may include generating positional encodings, performing node sampling, and introducing biases explicitly. The present disclosure mainly discloses a new process of generating the positional encodings and a new process of node sampling.
In an embodiment, as the architecture is supposed to be designed to be capable of learning from the graph-structured data 218, inputs 302 to the architecture correspond to graphical data such as the homogeneous graph obtained from the graph-structured data 218. The inputs 302 to the architecture may have to be made compatible with a transformer architecture. Thus, in an embodiment, the data pre-processing module 224 accesses the homogeneous graph from the database 204. In an embodiment, the homogeneous graph includes a plurality of nodes connected via a set of edges. Each node of the plurality of nodes represents a user of a plurality of users and each edge between one or more pairs of nodes represents a relationship between users associated with the corresponding one or more pairs of nodes. The data pre-processing module 224 further determines the plurality of node features based, at least in part, on the graph-structured data 218. In a non-limiting example, the plurality of node features may include features that can be related to each node in a graph based on the type of network of the graph. For example, in the case of a transactional graph corresponding to a cardholder, the plurality of node features may include transaction amount, transaction date and time, transaction type, cardholder personal information, and the like.
In some embodiments, it may be noted that the data pre-processing module 224 may also generate a plurality of input embeddings (e.g., input embeddings 304) for the corresponding inputs 302 as shown in FIG. 3. Further, as may be understood when applying a transformer architecture to the homogeneous graph, the inclusion of positional encodings 306 (otherwise also referred to as positional encoding(s) 306) assumes an essential role in delivering structural insights. Moreover, in the absence of the positional encodings 306, an attention score may be solely dependent on semantic similarities of the plurality of node features. Thus, the main reason behind the introduction of the positional encodings 306 is to capture core structural nuances and explicitly integrate this information within the transformer architecture. To that end, in an embodiment, the positional encoding computation module 226 is configured to compute the positional encoding 306 for each node from the homogeneous graph. Further, a node embedding i.e., an updated feature vector 308 (otherwise also referred to as updated feature vector(s) 308) may be generated by concatenating the plurality of node features with the corresponding positional encodings 306 for each node.
Further, as it is understood that one of the fundamental operations of a transformer is the aggregation of information, a dot product between the node embeddings of two nodes in the graph may be computed for information aggregation. The positional encodings 306 may have to be computed in such a way that if two nodes have a strong structural influence on each other, then the dot product of the node embeddings of the corresponding two nodes should be substantially high. Moreover, these two nodes may have to be placed close to each other. Further, if two nodes are not structurally related then the dot product of the node embeddings of the corresponding two nodes have to be a smaller value. Also, the corresponding nodes should be placed far away from each other.
Further, to achieve the above-mentioned condition, the positional encodings 306 for each node are trained using a contrastive loss framework. The contrastive loss framework involves the use of contrastive loss 310 for training the structure-aware transformer model 220 to generate the positional encoding 306. The process of generating the positional encoding 306 is explained further in the present disclosure.
Further, the updated feature vectors 308 are passed through a neighborhood features aggregation process (see, 312) for generating an aggregated token sequence 314 including k-hop aggregated vectors 316 along with a global context vector 318 as shown in FIG. 3. The process of performing the neighborhood feature aggregation process is explained further in the present disclosure.
Upon obtaining the k-hop aggregated vectors 316 and the global context vector 318, they are fed to a transformer block 320 for further processing. In an example embodiment, the k-hop aggregated vectors 316 and the global context vector 318 may be provided to the transformer block 320 and trained on performing a node classification task. In an embodiment, the transformer block 320 is similar to a conventional transformer block and hence includes a multi-head self-attention (MSA) block 322 and a position-wise feedforward network (FFN) 324. Thus, in an embodiment, the k-hop aggregated vectors 316 and the global context vector 318 are passed through the MSA block 322 and subsequently through the FFN 324. In a non-limiting implementation, the following equations provide outputs of the MSA block 322 and the FFN block 324 respectively:
Z_v^((l))=MSA?(LN?(Z_v^((l-1)) ))+Z_v^((l-1)) … Eqn. 3
Z_v^((l))=FFN?(LN?(Z_v^((l)) ))+Z_v^('(l)) … Eqn. 4
Herein, l=1, 2, …, L implies lth layer of the transformer, LN implies LayerNorm which is a normalization function applied to the aggregated token sequence 314. Thus, it may be understood that the transformer block 320 also includes one or more normalization layers 326 (otherwise also referred to as add and norm layer 326) associated with the MSA block 322 and the FFN 324 that facilitates speeding up the training by normalizing intermediate outputs of these blocks such as the MSA block 322 and the FFN 324. The details of the transformer block 320 and components of the transformer block 320 are not disclosed in detail for the sake of brevity, as it is a well-known concept to the person skilled in the art.
Further, attention coefficients may be calculated using the generated Z values based on a softmax layer using the following non-limiting exemplary equation:
a_k=exp?((Z_0?Z_k ) W_a^? )/(?_(i=1)^K¦? exp?((Z_0?Z_i ) W_a^? ) ) … Eqn. 5
Herein, W_a?R^(1×2d_m ) denotes the learnable projection and i=1,…,k.
Finally, outputs 328 i.e., the update node representation are obtained using these attention coefficients that can be used in a prediction layer 330 for performing predictions 332 in several classification tasks. In a non-limiting implementation, the outputs 328 may be computed using the following equation:
Z_"out " =Z_0+?_(k=1)^K¦? a_k Z_k … Eqn. 6
FIG. 4 illustrates a schematic representation 400 of a process of generating positional encoding (PE) such as the positional encoding 306, in accordance with an embodiment of the present disclosure. In a non-limiting implementation, as may be understood, the present disclosure is intended to develop a new transformer architecture that can efficiently learn from the homogeneous graph. In an instance, an unweighted and undirected graph can be considered for the sake of the explanation of the implementation of the proposed architecture. To that note, consider a graph, G=(V,E) (see, 402), where V={v_1,v_2,…,v_n } are vertices or nodes, E indicates edges in the graph G, and n=|V|. Further, its graph structure information can be represented as an adjacency matrix such as A?R^(n×n) and a diagonal degree matrix represented by D. Thus, in an embodiment, the data pre-processing module 224 may be configured to generate the adjacency matrix such as the adjacency matrix 404 from the homogeneous graph.
Further, in an embodiment, the normalized adjacency matrix is expressed as A ^=D ~^(-1/2) A ~D ~^(-1/2), here A ~ denotes adjacency matrix with self-loops and D ~ denotes the corresponding degree matrix. Every node v?V is associated with a feature vector x_v?X, here X?R^(n×d) is the feature matrix that describes the node attributes and d is the dimension of the node feature vector. It may be understood that the adjacency matrix 404 captures connections and relationships between the plurality of nodes in the homogeneous graph. For example, the adjacency matrix 404 includes information such as binary values, weight values, directed and undirected information, graph representations, etc.
Upon obtaining the adjacency matrix 404 from the homogeneous graph, the positional encoding computation module 226 then trains the structure-aware transformer model 220 using the contrastive loss 310 to generate the positional encoding 306. More specifically, the positional encoding computation module 226 trains the structure-aware transformer model 220 to determine one or more similarity metrics between each pair of nodes in the homogeneous graph based, at least in part, on the homogeneous graph and the contrastive loss 310.
In an embodiment, the positional encoding computation module 226 takes the adjacency matrix 404 as input and trains the structure-aware transformer model 220 using a concept of contrastive learning. As used herein, the term ‘contrastive learning’ refers to a machine learning (ML) technique that focuses on teaching a model to understand differences and similarities between datapoints. Thus, in an embodiment, the positional encoding computation module 226 may train a structure-related neural network of the structure-aware transformer model architecture by performing a set of operations iteratively until the performance of the structure-aware transformer model 220 converges to predefined criteria. Herein, the structure-aware transformer model architecture is associated with the structure-aware transformer model 220. The set of operations includes initially, choosing two nodes at a time and facilitating the structure-related neural network to learn to distinguish them based, at least in part, on the adjacency matrix 404. Before this step, the structure-related neural network is initialized with the training dataset and the one or more model parameters as explained earlier in the present disclosure. The training dataset includes a predefined pair of nodes including a reference node and a negative node.
Further, the positional encoding computation module 226 determines the one or more similarity metrics between the one or more pairs of nodes from the plurality of nodes based, at least in part, on the adjacency matrix 404 corresponding to the homogeneous graph, the training dataset and the one or more model parameters. The positional encoding computation module 226 further generates the one or more positive pairs of nodes and the one or more negative pairs of nodes. As may be understood, the one or more positive pairs of nodes indicate nodes that are closely related to each other, and the one or more negative pairs of nodes indicate nodes that are unrelated. Moreover, the positional encoding computation module 226 computes a first set of probability values for the corresponding set of positive pairs of nodes and a second set of probability values for the corresponding set of negative pairs of nodes. This is followed by the computation of the contrastive loss 310 based, at least in part, on the comparison of the first set of probability values with the second set of probability values.
As mentioned earlier, the contrastive loss 310 is used for training the structure-related neural network. It may be noted that the contrastive loss 310 may be computed using a contrastive loss function. Further, the positional encoding computation module 226 may backpropagate the contrastive loss 310 for minimizing the distance between the one or more positive pairs of nodes and maximizing the distance between the one or more negative pairs of nodes. In other words, backpropagating the contrastive loss 310 maximizes the difference between the first set of probability values and the second set of probability values. Thus, it may be understood that the contrastive loss function facilitates positioning similar nodes closer to each other and dissimilar nodes farther in the graphical representation. In an embodiment, the predefined criteria may include saturation of the contrastive loss 310. Herein, the contrastive loss 310 may be saturated after a plurality of iterations of the set of operations is performed. In a non-limiting example, the contrastive loss 310 may include a neighborhood contrastive loss which may be represented by the following exemplary equation:
l_i=-log?((?_(j=1)^B¦? ?_ij exp?(sim?(z_i,z_j )/t))/(?_(k=1)^B¦? exp?(sim?(z_i,z_k )/t) )) … Eqn. 7
Herein, sim indicates the cosine similarity and t denotes the temperature parameter. In an embodiment, the hyper-parameter r may be chosen as the r-hop neighbors that are the positive samples in the loss. Further, it may be noted that the training is performed batch-wise by using the equation shown in Eqn. 7 for the contrastive loss 310. Furthermore, ?ij denotes the strength of the connection between nodes i and j. It gets a non-zero value only when the j is the m-hop neighbor of node i. This loss essentially makes the node representation of the node i to be closer to the node representations of the nodes in the m-hop neighbor, in the final embedding space.
Further, in an embodiment, the positional encoding computation module 226 computes the positional encoding 306 for each node based, at least in part, on the one or more determined similarity metrics between each pair of nodes in the homogeneous graph. In an example embodiment, the positional encoding computation module 226 computes the positional encoding 306 for each node using the structure-related neural network of the structure-aware transformer model architecture. In one embodiment, the positional encoding may include positive samples that represent proximate neighbors. In another embodiment, the positional encoding may include negative samples that represent distant non-neighbors.
FIG. 5 illustrates a schematic representation 500 of a process of generating updated feature vectors, in accordance with an embodiment of the present disclosure. As may be understood, upon obtaining the positional encoding 306 for each node and extracting the plurality of node features 502 for each node in the graph-structured data 218, both have to be combined with each other. Thus, in an embodiment, the concatenation module 228 may be configured to generate an updated feature vector 504 for each node based, at least in part, on concatenating the plurality of node features 502 for each node with the corresponding positional encoding 306 for each node. Herein, the updated feature vector 504 is substantially similar to the updated feature vector 308 of FIG. 3.
In a non-limiting implementation, the plurality of node features 502 may be represented as a matrix vector having a dimension of N × d. Herein, N represents rows in the matrix vector indicating a count of the plurality of node features 502 and d represents columns in the matrix vector indicating a dimension of each node feature (which is a feature vector). Further, assuming that the positional encoding 306 has a dimension of N × r, wherein N corresponds to a count of the positional encodings and r corresponds to a dimension of each position encoding.
Further, upon concatenation, the updated feature vector 504, thus generated may be represented as a matrix vector having a dimension of N × (d+r), where N represents a count of the updated feature vectors and (d+r) represents a dimension of each updated feature vector which is a sum of a dimension of the corresponding node feature (d) and a dimension of the corresponding positional encoding (r) as shown in FIG. 5.
Moreover, in a non-limiting example, mathematically, the updated feature vector 504 for a node v may be represented by the following exemplary equation:
H(v)=F(v)? P(v) … Eqn. 8
Herein, H(v) refers to the updated feature vector 504 for node v, F(v) refers to the node feature vector for node v, P(v) refers to the positional encoding vector for node v, and || refers to a concatenation operator.
FIG. 6 illustrates a schematic representation 600 of a node sampling mechanism, in accordance with an embodiment of the present disclosure. As may be understood upon obtaining the updated feature vector 504 for each node in the homogeneous graph, the neighborhood feature aggregation process may have to be implemented that facilitates a node to decide how much information to aggregate from which node. Generally, each node is represented as a token in an input sequence to be provided to a transformer block. However, the present disclosure determines a unique input sequence for every individual node. Therefore, it may be noted that the aggregation module 230 may be configured to generate an aggregated token sequence 602 for each node based, at least in part, on the updated feature vector 504 for each node. In an embodiment, for generating the aggregated token sequence 602, the aggregation module 230 may be configured to a set of tokens (e.g., k-hop aggregated vectors 604) for each node. Each token from the set of tokens may be generated based, at least in part, on aggregating a first set of updated feature vectors corresponding to a first set of nodes that are within a predefined count of hops (e.g., k-hops) from the corresponding node. Further, the aggregation module 230 may be configured to a global context vector (global context vector 606) for each node based, at least in part, on aggregating a second set of updated feature vectors corresponding to a second set of nodes that are beyond the predefined count of hops from the corresponding node. Finally, the aggregation module 230 may aggregate the set of tokens for each node with the global context vector 606 for each node based, at least in part, on aggregation criteria, for generating the aggregated token sequence 602 for each node. Also, the aggregated token sequence 602, the k-hop aggregated vectors 604, and the global context vector 606 are similar to the aggregated token sequence 314, the k-hop aggregated vectors 316, and the global context vector 318 of FIG. 3.
Furthermore, for the sake of explanation of the neighborhood feature aggregation process and to understand the formation of the aggregated token sequence 602 for a given node v, a k-hop neighbor node u may be defined such that the following condition is satisfied:
d(u,v)=k … Eqn. 9
Herein, d is the shortest-path distance between the node u and the node v. Also, it may be noted that the condition mentioned in the exemplary Equation 9 corresponds to the aggregation criteria. In other words, the aggregation criteria may include the predefined count of hops that are considered for generating the aggregated token sequence 602 for a particular node (e.g., node v) is equal to a shortest-path distance (e.g., the shortest-path distance d) between the corresponding node and neighbor nodes (u) of the corresponding node v. Further, in an example embodiment, each element in the aggregated token sequence 602 of the node v may be represented as follows:
x_v^k=?¦?H(u)?, where d(u,v)=k … Eqn. 10
Furthermore, in an embodiment, based on the exemplary Eqn. 10 it may be understood that the aggregated token sequence 602 of the node v may look like:
S_v=(x_v^0,x_v^1,x_v^2,… x_v^k), wherein k is a hyperparameter … Eqn. 11
Therefore, it may be understood that according to equations 10 and 11, the aggregated token sequence 602 may be determined for each node in the homogeneous graph based on the nearest nodes to the corresponding node and k-hops set as per the hyperparameter. In other words, it may be noted that k tokens are generated by aggregating 1-hop neighborhood, 2-hop neighborhood, and so on until the k-hop neighborhood. Moreover, each element in the aggregated token sequence 602 for each node may be referred to as k-hop aggregated vectors 604 along with an additional token. In a non-limiting example, the additional token i.e., the global context vector 606 is generated by aggregating all the nodes outside the k-hop. In other words, it may be noted that the aggregation module 230 may be configured to generate (k+1) tokens through the neighborhood aggregation process as explained above.
More specifically, it may be noted that the aggregation module 230 may be initially configured to compute the summation of the updated feature vectors of all N nodes in the homogeneous graph. The aggregation module 230 may then be configured to store the computed result in the database 204 which is accessible for all the nodes in the homogeneous graph. Further, the aggregation module 230 is configured to compute a summation of its tokens for each node within its sequence. Later, the aggregation module 230 is configured to determine a summation of updated feature vectors associated with nodes surpassing the k-hop boundary based on subtracting the summation of tokens of a particular node from the summation of the updated feature vectors for all the N nodes. The summation of updated feature vectors associated with nodes surpassing the k-hop boundary corresponds to the global context vector 606.
Further, the introduction of the global context vector 606 within the aggregated token sequence 602 for each node extends a model’s capacity to cover long-range global interactions. Also, it may be noted that the long-range global interactions are covered without compromising on time complexity.
In some embodiments, the global context vector 606 may fail to provide the salient information. During such scenarios, the attention weight associated with the corresponding nodes falls to zero. However, in scenarios when the global context vector 606 conveys meaningful insights, enhancement in the performance of the structure-aware transformer model 220 may be observed.
FIG. 7 illustrates a flow diagram depicting a method 700 for generating a structure-aware transformer model architecture capable of learning from graph-structured data, in accordance with an embodiment of the present disclosure. The method 700 depicted in the flow diagram may be executed by, for example, the server system 200. The sequence of operations of the method 700 may not be necessarily executed in the same order as they are presented. Further, one or more operations may be grouped and performed in the form of a single step, or one operation may have several sub-steps that may be performed in parallel or in a sequential manner. Operations of the method 700, and combinations of operations in the method 800 may be implemented by, for example, hardware, firmware, a processor, circuitry, and/or a different device associated with the execution of software that includes one or more computer program instructions. The plurality of operations is depicted in the process flow of the method 700. The process flow starts at operation 702.
At operation 702, the method 700 includes accessing, by a server system (e.g., the server system 200), a homogeneous graph from a database (e.g., the database 204) associated with the server system 200, the homogeneous graph including a plurality of nodes connected via a set of edges. Each node of the plurality of nodes represents a user of a plurality of users, and each edge between one or more pairs of nodes represents a relationship between users associated with the corresponding one or more pairs of nodes.
At operation 704, the method 700 includes determining, by the server system 200, a plurality of node features for each node in the homogeneous graph based, at least in part, on the homogeneous graph.
At operation 706, the method 700 includes training, by the server system 200, a structure-related neural network of a structure-aware transformer model architecture, to determine one or more similarity metrics between the one or more pairs of nodes from the plurality of nodes in the homogeneous graph based, at least in part, on the homogeneous graph and a contrastive loss.
At operation 708, the method 700 includes computing, by the structure-related neural network of the structure-aware transformer model architecture associated with the server system 200, a positional encoding for each node based, at least in part, on the one or more determined similarity metrics between the corresponding one or more pairs of nodes in the homogeneous graph.
At operation 710, the method 700 includes generating, by the server system 200, an updated feature vector for each node based, at least in part, on concatenating the plurality of node features for each node with the corresponding positional encoding for each node.
At operation 712, the method 700 includes generating, by a tokenization layer of the structure-aware transformer model architecture, an aggregated token sequence for each node based, at least in part, on the updated feature vector for each node.
FIG. 8 illustrates a simplified block diagram of a payment server, in accordance with an embodiment of the present disclosure. The payment server 800 is an example of the payment server 114 of FIG. 1A. The payment server 800 and the server system 200 may use the payment network 112 as a payment interchange network. Examples of payment interchange networks include, but are not limited to, Mastercard® payment system interchange network.
The payment server 800 includes a processing module 802 configured to extract programming instructions from a memory 804 to provide various features of the present disclosure. The components of the payment server 800 provided herein may not be exhaustive, and the payment server 800 may include more or fewer components than that depicted in FIG. 8. Further, two or more components may be embodied in one single component, and/or one component may be configured using multiple sub-components to achieve the desired functionalities. Some components of the payment server 800 may be configured using hardware elements, software elements, firmware elements, and/or a combination thereof.
Via a communication module 806, the processing module 802 receives a request from a remote device 808, such as the issuer servers 108, the acquirer servers 110, or the server system 102. The request may be a request for conducting the payment transaction. The communication may be achieved through API calls, without loss of generality. The payment server 800 includes a database 810. The database 810 also includes transaction processing data such as issuer ID, country code, acquirer ID, and merchant identifier (MID), among others.
When the payment server 800 receives a payment transaction request from the acquirer servers 110 or a payment terminal (e.g., IoT device), the payment server 800 may route the payment transaction request to the issuer servers 108. The database 810 stores transaction identifiers for identifying transaction details such as transaction amount, IoT device details, acquirer account information, transaction records, merchant account information, and the like.
In one example embodiment, the acquirer servers 110 is configured to send an authorization request message to the payment server 800. The authorization request message includes, but is not limited to, the payment transaction request.
The processing module 802 further sends the payment transaction request to the issuer servers 108 for facilitating the payment transactions from the remote device 808. The processing module 802 is further configured to notify the remote device 808 of the transaction status in the form of an authorization response message via the communication module 806. The authorization response message includes, but is not limited to, a payment transaction response received from the issuer servers 108. Alternatively, in one embodiment, the processing module 802 is configured to send an authorization response message for declining the payment transaction request, via the communication module 806, to the acquirer servers 110. In one embodiment, the processing module 802 executes similar operations performed by the server system 200, however, for the sake of brevity, these operations are not explained herein.
The disclosed method with reference to FIG. 7, or one or more operations of the server system 200 may be implemented using software including computer-executable instructions stored on one or more computer-readable media (e.g., non-transitory computer-readable media, such as one or more optical media discs, volatile memory components (e.g., DRAM or SRAM), or nonvolatile memory or storage components (e.g., hard drives or solid-state nonvolatile memory components, such as Flash memory components) and executed on a computer (e.g., any suitable computer, such as a laptop computer, netbook, Web book, tablet computing device, smartphone, or other mobile computing devices). Such software may be executed, for example, on a single local computer or in a network environment (e.g., via the Internet, a wide-area network, a local-area network, a remote web-based server, a client-server network (such as a cloud computing network), or other such networks) using one or more network computers. Additionally, any of the intermediate or final data created and used during the implementation of the disclosed methods or systems may also be stored on one or more computer-readable media (e.g., non-transitory computer-readable media) and are considered to be within the scope of the disclosed technology. Furthermore, any of the software-based embodiments may be uploaded, downloaded, or remotely accessed through a suitable communication means. Such a suitable communication means includes, for example, the Internet, the World Wide Web, an intranet, software applications, cable (including fiber optic cable), magnetic communications, electromagnetic communications (including RF, microwave, and infrared communications), electronic communications, or other such communication means.
Although the disclosure has been described with reference to specific exemplary embodiments, it is noted that various modifications and changes may be made to these embodiments without departing from the broad scope of the disclosure. For example, the various operations, blocks, etc., described herein may be enabled and operated using hardware circuitry (for example, Complementary Metal Oxide Semiconductor (CMOS) based logic circuitry), firmware, software, and/or any combination of hardware, firmware, and/or software (for example, embodied in a machine-readable medium). For example, the apparatuses and methods may be embodied using transistors, logic gates, and electrical circuits (for example, Application-Specific Integrated Circuit (ASIC) circuitry and/or in Digital Signal Processor (DSP) circuitry).
Particularly, the server system 200 and its various components may be enabled using software and/or using transistors, logic gates, and electrical circuits (for example, integrated circuit circuitry such as ASIC circuitry). Various embodiments of the disclosure may include one or more computer programs stored or otherwise embodied on a computer-readable medium, wherein the computer programs are configured to cause a processor or the computer to perform one or more operations. A computer-readable medium storing, embodying, or encoded with a computer program, or similar language, may be embodied as a tangible data storage device storing one or more software programs that are configured to cause a processor or computer to perform one or more operations. Such operations may be, for example, any of the steps or operations described herein. In some embodiments, the computer programs may be stored and provided to a computer using any type of non-transitory computer-readable media. Non-transitory computer-readable media includes any type of tangible storage media. Examples of non-transitory computer-readable media include magnetic storage media (such as floppy disks, magnetic tapes, hard disk drives, etc.), optical magnetic storage media (e.g. magneto-optical disks), Compact Disc Read-Only Memory (CD-ROM), Compact Disc Recordable CD-R, Compact Disc Rewritable CD-R/W), Digital Versatile Disc (DVD), BLU-RAY® Disc ( BD), and semiconductor memories (such as mask ROM, programmable ROM (PROM ), Erasable PROM (EPROM ), flash memory, Random Access Memory (RAM), etc.). Additionally, a tangible data storage device may be embodied as one or more volatile memory devices, one or more non-volatile memory devices, and/or a combination of one or more volatile memory devices and non-volatile memory devices. In some embodiments, the computer programs may be provided to a computer using any type of transitory computer-readable media. Examples of transitory computer-readable media include electric signals, optical signals, and electromagnetic waves. Transitory computer-readable media can provide the program to a computer via a wired communication line (e.g., electric wires, and optical fibers) or a wireless communication line.
Various embodiments of the disclosure, as discussed above, may be practiced with steps and/or operations in a different order, and/or with hardware elements in configurations, which are different from those which are disclosed. Therefore, although the disclosure has been described based on these exemplary embodiments, it is noted that certain modifications, variations, and alternative constructions may be apparent and well within the scope of the disclosure.
Although various exemplary embodiments of the disclosure are described herein in a language specific to structural features and/or methodological acts, the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as exemplary forms of implementing the claims.
, Claims:1. A computer-implemented method, comprising:
accessing, by a server system, a homogeneous graph from a database associated with the server system, the homogeneous graph comprising a plurality of nodes connected via a set of edges, each node of the plurality of nodes representing a user of a plurality of users and each edge between one or more pairs of nodes representing a relationship between users associated with the corresponding one or more pairs of nodes;
determining, by the server system, a plurality of node features for each node in the homogeneous graph based, at least in part, on the homogeneous graph;
training, by the server system, a structure-related neural network of a structure-aware transformer model architecture, to determine one or more similarity metrics between the one or more pairs of nodes from the plurality of nodes in the homogeneous graph based, at least in part, on the homogeneous graph and a contrastive loss;
computing, by the structure-related neural network of the structure-aware transformer model architecture associated with the server system, a positional encoding for each node based, at least in part, on the one or more determined similarity metrics between the corresponding one or more pairs of nodes in the homogeneous graph; and
generating, by the server system, an updated feature vector for each node based, at least in part, on concatenating the plurality of node features for each node with the corresponding positional encoding for each node.
2. The computer-implemented method as claimed in claim 1, further comprising:
generating, by a tokenization layer of the structure-aware transformer model architecture associated with the server system, an aggregated token sequence for each node based, at least in part, on the updated feature vector for each node.
3. The computer-implemented method as claimed in claim 2, wherein generating the aggregated token sequence for each node comprising:
generating a set of tokens for each node, each token from the set of tokens generated based, at least in part, on aggregating a first set of updated feature vectors corresponding to a first set of nodes that are within a predefined count of hops from the corresponding node;
generating a global context vector for each node based, at least in part, on aggregating a second set of updated feature vectors corresponding to a second set of nodes that are beyond the predefined count of hops from the corresponding node; and
aggregating the set of tokens for each node with the global context vector for each node based, at least in part, on aggregation criteria, for generating the aggregated token sequence for each node.
4. The computer-implemented method as claimed in claim 3, wherein the aggregation criteria comprise the predefined count of hops for generating the aggregated token sequence for a particular node being equal to a shortest-path distance between the corresponding node and neighbor nodes of the corresponding node, the neighbor nodes comprising the first set of nodes.
5. The computer-implemented method as claimed in claim 3, further comprising:
training, by the server system, a transformer layer of the structure-aware transformer model architecture for performing a node classification task based, at least in part, on the aggregated token sequence corresponding to each node.
6. The computer-implemented method as claimed in claim 3, wherein the structure-aware transformer model architecture corresponds to a Graph Neural Network (GNN)-transformer Machine learning-based model architecture.
7. The computer-implemented method as claimed in claim 1, wherein the structure-related neural network of the structure-aware transformer model architecture corresponds to a multi-layer perceptron (MLP) neural network.
8. The computer-implemented method as claimed in claim 1, wherein training, by the server system, the structure-related neural network of the structure-aware transformer model architecture comprises performing a set of operations iteratively until a performance of the structure-related neural network converges to predefined criteria, the set of operations comprising:
initializing the structure-related neural network with a training dataset and one or more model parameters, wherein the training dataset comprises a predefined pair of nodes comprising a reference node and a negative node;
determining, via the structure-related neural network, the one or more similarity metrics between one or more pairs of nodes from the plurality of nodes based, at least in part, on an adjacency matrix corresponding to the homogeneous graph, the training dataset and the one or more model parameters;
generating, via the structure-related neural network, a set of positive pairs of nodes and a set of negative pairs of nodes based, at least in part, on the one or more similarity metrics, the set of positive pairs of nodes indicating a first subset of nodes that are closely related to each other, and the set of negative pairs of nodes indicating a second subset of nodes that are unrelated to each other;
computing, via the structure-related neural network, a first set of probability values for the corresponding set of positive pairs of nodes and a second set of probability values for the corresponding set of negative pairs of nodes;
computing a contrastive loss using a contrastive loss function based, at least in part, on comparison of the first set of probability values with the second set of probability values; and
backpropagating the contrastive loss for maximizing the difference between the first set of probability values and the second set of probability values.
9. The computer-implemented method as claimed in claim 8, wherein the predefined criteria comprise saturation of the contrastive loss, the contrastive loss being saturated after a plurality of iterations of the set of operations is performed.
10. The computer-implemented method as claimed in claim 1, wherein the contrastive loss comprises a neighborhood contrastive loss.
11. The computer-implemented method as claimed in claim 1, wherein the server system is a payment server associated with a payment network.
12. A server system, comprising:
a communication interface;
a memory comprising executable instructions; and
a processor communicably coupled to the communication interface and the memory, the processor configured to cause the server system to at least:
access a homogeneous graph from a database associated with the server system, the homogeneous graph comprising a plurality of nodes connected via a set of edges, each node of the plurality of nodes representing a user of a plurality of users, and each edge between one or more pairs of nodes representing a relationship between users associated with the corresponding one or more pairs of nodes;
determine a plurality of node features for each node in the homogeneous graph based, at least in part, on the homogeneous graph;
train a structure-related neural network of a structure-aware transformer model architecture, to determine one or more similarity metrics between the one or more pairs of nodes from the plurality of nodes in the homogeneous graph based, at least in part, on the homogeneous graph and a contrastive loss;
compute, by the structure-related neural network of the structure-aware transformer model architecture associated with the server system, a positional encoding for each node based, at least in part, on the one or more determined similarity metrics between the corresponding one or more pairs of nodes in the homogeneous graph; and
generate an updated feature vector for each node based, at least in part, on concatenating the plurality of node features for each node with the corresponding positional encoding for each node.
13. The server system as claimed in claim 12, wherein the server system is further caused to at least:
generate, by a tokenization layer of the structure-aware transformer model architecture, an aggregated token sequence for each node based, at least in part, on the updated feature vector for each node.
14. The server system as claimed in claim 13, wherein to generate the aggregated token sequence for each node, the server system is further caused to at least:
generate a set of tokens for each node, each token from the set of tokens being generated based, at least in part, on aggregating a first set of updated feature vectors corresponding to a first set of nodes that are within a predefined count of hops from the corresponding node;
generate a global context vector for each node based, at least in part, on aggregating a second set of updated feature vectors corresponding to a second set of nodes that are beyond the predefined count of hops from the corresponding node; and
aggregate the set of tokens for each node with the global context vector for each node based, at least in part, on aggregation criteria, thereby generating the aggregated token sequence for each node.
15. The server system as claimed in claim 14, wherein the aggregation criteria comprise the predefined count of hops for generating the aggregated token sequence for a particular node being equal to a shortest-path distance between the corresponding node and neighbor nodes of the corresponding node, the neighbor nodes comprising the first set of nodes.
16. The server system as claimed in claim 14, wherein the server system is further caused to at least:
train a transformer layer of the structure-aware transformer model architecture for performing a node classification task based, at least in part, on the aggregated token sequence corresponding to each node.
17. The server system as claimed in claim 14, wherein the structure-aware transformer model architecture corresponds to a Graph Neural Network (GNN)-transformer Machine learning -based model architecture.
18. The server system as claimed in claim 12, wherein the structure-related neural network of the structure-aware transformer model architecture corresponds to a multi-layer perceptron neural network.
19. The server system as claimed in claim 12, wherein the server system is further caused to at least train the structure-related neural network of the structure-aware transformer model architecture by performing a set of operations iteratively until a performance of the structure-related neural network converges to predefined criteria, the set of operations comprising:
initializing the structure-related neural network with a training dataset and one or more model parameters, wherein the training dataset comprises a predefined pair of nodes comprising a reference node and a negative node;
determining, via the structure-related neural network, one or more similarity metrics between the one or more pairs of nodes from the plurality of nodes based, at least in part, on an adjacency matrix corresponding to the homogeneous graph, the training dataset and the one or more model parameters;
generating, via the structure-related neural network, a set of positive pairs of nodes and a set of negative pairs of nodes based, at least in part, on the one or more similarity metrics, the set of positive pairs of nodes indicating a first subset of nodes that are closely related to each other, and the set of negative pairs of nodes indicating a second subset of nodes that are unrelated to each other;
computing, via the structure-related neural network, a first set of probability values for the corresponding set of positive pairs of nodes and a second set of probability values for the corresponding set of negative pairs of nodes;
computing a contrastive loss using a contrastive loss function based, at least in part, on comparison of the first set of probability values with the second set of probability values; and
backpropagating the contrastive loss for maximizing the difference between the first set of probability values and the second set of probability values.
20. A non-transitory computer-readable storage medium comprising computer-executable instructions that, when executed by at least a processor of a server system, cause the server system to perform a method comprising:
accessing a homogeneous graph from a database associated with the server system, the homogeneous graph comprising a plurality of nodes connected via a set of edges, each node of the plurality of nodes representing a user of a plurality of users, and each edge between one or more pairs of nodes representing a relationship between users associated with the corresponding one or more pairs of nodes;
determining a plurality of node features for each node in the homogeneous graph based, at least in part, on the homogeneous graph;
training a structure-related neural network of a structure-aware transformer model architecture, to determine one or more similarity metrics between the one or more pairs of nodes from the plurality of nodes in the homogeneous graph based, at least in part, on the homogeneous graph and a contrastive loss;
computing, by the structure-related neural network of the structure-aware transformer model architecture associated with the server system, a positional encoding for each node based, at least in part, on the one or more determined similarity metrics between the corresponding one or more pairs of nodes in the homogeneous graph; and
generating an updated feature vector for each node based, at least in part, on concatenating the plurality of node features for each node with the corresponding positional encoding for each node.

Documents

Application Documents

#	Name	Date
1	202341074091-STATEMENT OF UNDERTAKING (FORM 3) [31-10-2023(online)].pdf	2023-10-31
2	202341074091-POWER OF AUTHORITY [31-10-2023(online)].pdf	2023-10-31
3	202341074091-FORM 1 [31-10-2023(online)].pdf	2023-10-31
4	202341074091-FIGURE OF ABSTRACT [31-10-2023(online)].pdf	2023-10-31
5	202341074091-DRAWINGS [31-10-2023(online)].pdf	2023-10-31
6	202341074091-DECLARATION OF INVENTORSHIP (FORM 5) [31-10-2023(online)].pdf	2023-10-31
7	202341074091-COMPLETE SPECIFICATION [31-10-2023(online)].pdf	2023-10-31
8	202341074091-Proof of Right [12-03-2024(online)].pdf	2024-03-12