Abstract: Embodiments provide methods and systems for performing task predictions. The method performed by a server system includes accessing, historical tabular data including a series of entries associated with an entity as input training data. Each entry includes a set of data fields. The method includes pre-training a global neural network model based on a plurality of operations. The plurality of operations includes determining a plurality of intra-relational embeddings corresponding to the set of data fields of each entry based on a field transformer layer. The plurality of operations also includes determining a plurality of inter-relational embeddings corresponding to each entry based on a sequence encoding transformer layer of the global neural network model and the plurality of intra-relational embeddings. The method includes fine-tuning the pre-trained global neural network model based on task-specific training data of a particular downstream prediction task.
Description:
FORM 2
THE PATENTS ACT 1970
(39 of 1970)
&
The Patent Rules 2003
COMPLETE SPECIFICATION
(refer section 10 & rule 13)
TITLE OF THE INVENTION:
NEURAL NETWORK BASED METHODS AND SYSTEMS FOR PERFORMING TASK PREDICTIONS
APPLICANT(S):
Name:
Nationality:
Address:
MASTERCARD INTERNATIONAL INCORPORATED
United States of America
2000 Purchase Street, Purchase, NY 10577, United States of America
PREAMBLE TO THE DESCRIPTION
The following specification particularly describes the invention and the manner in which it is to be performed.
DESCRIPTION
(See next page)
NEURAL NETWORK BASED METHODS AND SYSTEMS FOR PERFORMING TASK PREDICTIONS
TECHNICAL FIELD
The present disclosure relates to artificial intelligence technology and, more particularly to, electronic methods and complex processing systems for performing various downstream task predictions based on a hierarchical Bidirectional Encoder Representations from Transformers (BERT) model.
BACKGROUND
Automated time-series prediction for tabular data is an important step in many industries, especially in retail and financial services where a large number of time-series patterns need to be predicted on a weekly or monthly basis for financial planning and analysis. For example, cardholder transaction data can be represented in a tabular form. In fact, tabular data is omnipresent across various industries and is also used to develop several high-impact machine learning models (e.g., fraud detection, lead propensity modeling, etc.). Various critical industries, including, for example, financial services, health care, logistics, etc., rely heavily on data in structured tabular format.
The tabular data can mainly be classified into two types - static time-series data and dynamic time-series data. For dealing with static time-series data, conventional machine learning models such as gradient boosting, random forest, etc. can be widely used. Contrary to static time-series data, dynamic time-series data has a sequential pattern that cannot be easily processed using the above-mentioned conventional machine learning models. Transformers, these days, are gaining more and more popularity for the prediction and/or analysis of tabular data.
The conventional algorithms used to model tabular data are mainly tree-based algorithms. However, the usage of such tree-based algorithms has various drawbacks or limitations. For example, there is a requirement to have labels for training data, but in cases of insufficient labeled data, it becomes very challenging to use semi-supervised methods of training on tree-based algorithms. Additionally, feature engineering takes a lot of time and brainstorming which again plays an important role in deciding the performance of the underlying model. Nowadays, transformers are mainly used for natural language processing (NLP) tasks due to their proven accuracy and efficiency in various state-of-the-art architectures.
Thus, there exists a need for a technical solution for modeling a neural network architecture to perform various downstream task predictions.
SUMMARY
Various embodiments of the present disclosure provide methods and systems for performing various task predictions based on the Bidirectional Encoder Representations from Transformers (BERT) model.
In an aspect, a computer-implemented method is disclosed. The method includes accessing, by a server system, historical tabular data from a database as input training data. The historical tabular data includes a series of entries associated with an entity represented in form of tabular data. Each entry includes a set of data fields. In addition, the method includes pre-training, by the server system, a global neural network model based, at least in part, on a plurality of operations. The plurality of operations includes determining, by the server system, a plurality of intra-relational embeddings corresponding to the set of data fields of each entry based, at least in part, on a field transformer layer. The plurality of intra-relational embeddings of each entry represents relationships among the set of data fields of each entry. The plurality of operations further includes determining, by the server system, a plurality of inter-relational embeddings corresponding to each entry based, at least in part, on a sequence encoding transformer layer of the global neural network model and the plurality of intra-relational embeddings. The plurality of inter-relational embeddings represents temporal relationships among the plurality of entries. Moreover, the method includes fine-tuning, by the server system, the pre-trained global neural network model based, at least in part, on task-specific training data of a particular downstream prediction task.
Other aspects and example embodiments are provided in the drawings and the detailed description that follows.
BRIEF DESCRIPTION OF THE FIGURES
For a more complete understanding of example embodiments of the present technology, reference is now made to the following descriptions taken in connection with the accompanying drawings in which:
FIG. 1A is an exemplary representation of an environment related to at least some embodiments of the present disclosure;
FIG. 1B is another exemplary representation of an environment related to at least some embodiments of the present disclosure;
FIG. 2 is a simplified block diagram of a server system, in accordance with an embodiment of the present disclosure;
FIG. 3 is a schematic representation of a neural network architecture of a hierarchical BERT model, in accordance with an embodiment of the present disclosure;
FIG. 4 is a schematic representation of a transaction field transformer included in a field transformer layer of the global neural network model, in accordance with an embodiment of the present disclosure;
FIG. 5 is a schematic representation of a transaction sequence encoding transformer of the global neural network model, in accordance with an embodiment of the present disclosure;
FIG. 6 is a schematic representation of a pre-training process of the global neural network model, in accordance with an embodiment of the present disclosure;
FIG. 7 is a schematic representation of a fine-tuning process of the global neural network model, in accordance with an embodiment of the present disclosure;
FIG. 8 is a process flow chart of a method for training the global neural network model, in accordance with an embodiment of the present disclosure;
FIG. 9 is a process flow for predicting fraudulent transactions of a cardholder, in accordance with an embodiment of the present disclosure;
FIG. 10 is a process flow chart of a computer-implemented method for training the global neural network model to perform various downstream task predictions; and
FIG. 11 is a simplified block diagram of a payment server, in accordance with an embodiment of the present disclosure.
The drawings referred to in this description are not to be understood as being drawn to scale except if specifically noted, and such drawings are only exemplary in nature.
DETAILED DESCRIPTION
In the following description, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the present disclosure. It will be apparent, however, to one skilled in the art that the present disclosure can be practiced without these specific details.
Reference in this specification to “one embodiment” or “an embodiment” means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment of the present disclosure. The appearance of the phrase “in an embodiment” in various places in the specification is not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Moreover, various features are described which may be exhibited by some embodiments and not by others. Similarly, various requirements are described which may be requirements for some embodiments but not for other embodiments.
Moreover, although the following description contains many specifics for the purposes of illustration, anyone skilled in the art will appreciate that many variations and/or alterations to said details are within the scope of the present disclosure. Similarly, although many of the features of the present disclosure are described in terms of each other, or in conjunction with each other, one skilled in the art will appreciate that many of these features can be provided independently of other features. Accordingly, this description of the present disclosure is set forth without any loss of generality to, and without imposing limitations upon, the present disclosure.
Embodiments of the present disclosure may be embodied as an apparatus, system, method, or computer program product. Accordingly, embodiments of the present disclosure may take the form of an entirely hardware embodiment, an entirely software embodiment (including firmware, resident software, micro-code, etc.), or an embodiment combining software and hardware aspects that may all generally be referred to herein as a “circuit”, “engine”, “module”, or “system”. Furthermore, aspects of the present disclosure may take the form of a computer program product embodied in one or more computer-readable storage media having computer-readable program code embodied thereon.
The term "payment account" used throughout the description refers to a financial account that is used to fund a financial transaction. Examples of the financial account include, but are not limited to, a savings account, a credit account, a checking account, and a virtual payment account. The financial account may be associated with an entity such as an individual person, a family, a commercial entity, a company, a corporation, a governmental entity, a non-profit organization, and the like. In some scenarios, the financial account may be a virtual or temporary payment account that can be mapped or linked to a primary financial account, such as those accounts managed by payment wallet service providers, and the like.
The term "payment card", used throughout the description, refers to a physical or virtual card linked with a financial or payment account that may be presented to a merchant or any such facility to fund a financial transaction via the associated payment account. Examples of the payment card include, but are not limited to, debit cards, credit cards, prepaid cards, virtual payment numbers, virtual card numbers, forex cards, charge cards, e-wallet cards, and stored-value cards. A payment card may be a physical card that may be presented to the merchant for funding the payment. Alternatively, or additionally, the payment card may be embodied in form of data stored in a user device, where the data is associated with a payment account such that the data can be used to process the financial transaction between the payment account and a merchant's financial account.
The term "payment network", used herein, refers to a network or collection of systems used for the transfer of funds through the use of cash-substitutes. Payment networks may use a variety of different protocols and procedures to process the transfer of money for various types of transactions. Transactions that may be performed via a payment network may include product or service purchases, credit purchases, debit transactions, fund transfers, account withdrawals, etc. Payment networks may be configured to perform transactions via cash-substitutes, which may include payment cards, letters of credit, checks, financial accounts, etc. Examples of networks or systems configured to perform as payment networks include those operated by such as, Mastercard®.
The term "merchant", used throughout the description generally refers to a seller, a retailer, a purchase location, an organization, or any other entity that is in the business of selling goods or providing services, and it can refer to either a single business location or a chain of business locations of the same entity.
The terms "account holder", "user", “cardholder”, and "customer" are used interchangeably throughout the description and refer to a person who has a payment account or a payment card (e.g., credit card, debit card, etc.) associated with the payment account, that will be used by a merchant to perform a payment transaction. The payment account may be opened via an issuing bank or an issuer server.
The term "Bidirectional Encoder Representations from Transformers (BERT)", used throughout the description generally refers to a method of pre-training language representations. In general, pre-training refers to initially training a BERT model on a large set of training data (e.g., textual data) and then utilizing the training results (e.g., embeddings) to perform various natural language processing (NLP) based downstream tasks, including, for example, sentiment analysis, next-word prediction, question answering, and the like.
The term "embeddings", used throughout the description, generally, refers to a low-dimension translation of a high-dimension vector. More specifically, the embedding space is a relatively low-dimensional space into which a high-dimensional vector can be translated. Embeddings further make it easier to perform machine learning related tasks on large inputs.
OVERVIEW
Various embodiments of the present disclosure provide methods and systems for performing various downstream task predictions (e.g., financial downstream task predictions) based on a global neural network model. The global neural network model may implement a dual transformer based hierarchical Bidirectional Encoder Representations from Transformers (BERT) model. The hierarchical BERT model may initially be pre-trained on input training data and then be fine-tuned based on a particular downstream prediction task (for example, payment fraud detection, etc.).
As stated above, conventional algorithms used to model tabular data are usually tree-based algorithms. However, utilization of conventional algorithms (i.e., tree-based algorithms) to model tabular data has some serious drawbacks. One of the drawbacks of using the conventional tree-based algorithms is the need to have labels for the training data; however, in case of insufficient labeled data, semi-supervised methods of training the conventional algorithms become a challenging task. Another drawback of using conventional tree-based algorithms is that feature engineering consumes a lot of time and brainstorming.
In addition, the conventional semi-supervised learning approach mainly consists of two parts (1) creating a pre-trained model to learn data distribution (unsupervised) and then (2) learning downstream tasks to predict labels (supervised). Generally, BERT is a transformer-based machine learning technique used for NLP pre-training. In other words, BERT is a method of pre-training language representations. The training results of the hierarchical BERT model can further be utilized to perform various NLP tasks, including, for example, sentiment analysis, question answering, next sentence prediction, and so on. More specifically, the hierarchical BERT model is utilized to learn contextual embeddings for words.
Thus, at least one of the technical problems addressed by the present disclosure includes (a) modeling of dynamic time-series tabular data, (b) dealing with cold start problem in case of absence of complete tabular data, (c) usage of static features (e.g., static transaction features) along with dynamic features (e.g., dynamic transaction features), and (d) generating embeddings based on BERT architecture.
To overcome such technical problems or limitations, the present disclosure describes a server system that is configured to implement a global neural network model. The global neural network model is implemented based, at least in part, on a hierarchical BERT neural network architecture. The global neural network model includes initial field embedding layers, a field transformer layer, and a sequence encoding transformer layer. In one embodiment, the server system includes at least a processor and a memory. In one non-limiting example, the server system is a payment server associated with a payment network. In one implementation, the server system is configured to implement the global neural network model based on BERT architecture for pre-training transaction representations. The training results of the global neural network model are then used to perform downstream prediction tasks such as fraudulent payment transaction prediction, cross-border payment transaction prediction, and the like.
More specifically, the server system is configured to utilize the transformers to model dynamic tabular data. In general, each row in the dynamic tabular data is treated as a sentence and further each feature is treated as a word, given the constraint that each feature should have a finite vocabulary. In addition, each sentence (i.e., each row) will have a fixed length, and the order of words (e.g., features) will not have any real importance. In one embodiment, the server system is configured to implement a semi-supervised transformer-based hierarchical BERT model architecture for modeling dynamic time-series tabular data (also known as multivariate time series data) to also capture the temporal component of the data.
Initially, the server system is configured to access historical tabular data from a database as input training data. In one non-limiting example, the historical tabular data may correspond to historical payment transaction data. The historical tabular data includes a series of entries associated with an entity represented in form of tabular data. In one non-limiting example, the series of entries corresponds to a series of payment transactions. Generally, an entity may correspond to something that exists separately from something else and has its own identity. In one non-limiting example, the entity may correspond to a cardholder. In various non-limiting examples, the entity may correspond to a person, institution, organization, legal entity, a state-owned enterprise, and the like. Each entry further includes a set of data fields. In one non-limiting example, the entry may correspond to a cardholder and the plurality of entries in the tabular data may correspond to the plurality of payment transactions performed by the cardholder. The cardholder may perform the plurality of payment transactions with the facilitation of a payment card, or a payment account issued by an issuer bank. Each entry (i.e., payment transaction) further includes the set of data fields (e.g., a set of transaction fields).
The server system is then configured to pre-train the global neural network model (i.e., hierarchical BERT model) based, at least in part, on a plurality of operations. The plurality of operations includes converting the set of data fields into a set of embeddings based on initial field embedding layers and determining a plurality of intra-relational embeddings (e.g., a plurality of intra-transactional embeddings) corresponding to the set of data fields of each entry (e.g., payment transaction) based, at least in part, on the field transformer layer. The plurality of intra-relational embeddings of each entry represents relationships among the set of data fields of each entry.
In addition, the plurality of operations includes determining a plurality of inter-relational embeddings corresponding to each entry based, at least in part, on the sequence encoding transformer layer of the global neural network model and the plurality of intra-relational embeddings. The plurality of inter-relational embeddings represents temporal relationships among the plurality of entries (e.g., payment transactions).
The global neural network model is pre-trained based at least on one of the training methods. The training methods include a masked language modeling (MLM) method and a replaced token detection (RTD) method. The server system is further configured to provide the plurality of inter-relational embeddings corresponding to the set of data fields of each entry (e.g., payment transaction) to multi-layer perceptron (MLP) layers. Furthermore, the server system is configured to update layer parameters of the global neural network model based, at least in part, on a loss value of the MLP layers.
In one implementation, the server system is configured to determine whether a number of the series of entries (e.g., payment transactions) is less than a predetermined sequence length of the global neural network model. In response to determining that the number of the series of entries (i.e., payment transactions) is less than the predetermined sequence length of the global neural network model, the server system is configured to insert dummy entries (i.e., dummy payment transactions) in the input training data.
The server system is configured to fine-tune the pre-trained global neural network model based, at least in part, on task-specific training data of a particular downstream prediction task (e.g., financial downstream prediction task). The particular downstream prediction task is at least one of: (a) prediction of fraudulent transactions, (b) prediction of the future cross-border payment transactions for the cardholder, and the like.
To fine-tune the pre-trained global neural network model, the server system is configured to provide the task-specific training data to the global neural network model. The task-specific training data may include past entries (e.g., past payment transactions) associated with a plurality of cardholders with labeled task-specific indicators. In addition, the server system is configured to determine a contextualized embedding corresponding to each of the past entry (i.e., past payment transaction) based, at least in part, on an output of the pre-trained global neural network model.
The server system is further configured to provide a combination of the contextualized embedding and static entry features (e.g., static transaction features) corresponding to each of the past entry (e.g., past payment transaction) as an input to the task-specific neural network. Furthermore, the server system is configured to determine a loss value associated with the task-specific neural network based, at least in part, on the input.
In an embodiment, the server system is configured to update neural network parameters of the task-specific neural network based, at least in part, on the loss value of task-specific neural networks. In another embodiment, the server system is configured to update neural network parameters and layer parameters of the task-specific neural network and the global neural network model, respectively, based, at least in part, on the loss value of task-specific neural networks.
In one implementation, the historical tabular data corresponds to historical payment transaction data, the entries correspond to payment transactions, the entity corresponds to a cardholder, and the set of data fields corresponds to a set of transaction fields. Additionally, the plurality of intra-relational embeddings corresponds to a plurality of intra-transactional embeddings and the plurality of inter-relational embeddings corresponds to a plurality of inter-transactional embeddings.
Various embodiments of the present disclosure offer multiple technical advantages and technical effects. For instance, the present disclosure provides a server system configured to generate relational embeddings based on a hierarchical Bidirectional Encoder Representations from Transformers (BERT) architecture. Additionally, the server system is configured to utilize dynamic time-series sequential data to determine the relational embeddings. Further, the server system is configured to utilize transfer learning methods (e.g., fine-tuning) to save time and resources. Furthermore, the server system is configured to deal with the cold start problem by adding dummy entries to a series of historical entries in the tabular data if the series of historical entries is not enough to generate the relational embeddings. The server system is also configured to use a combination of static features and relational embeddings during the fine-tuning step to extract more information. In a nutshell, the server system provides a scalable and powerful global neural network model (i.e., based on BERT architecture) that generalizes intelligence over sequential data.
Various example embodiments of the present disclosure are described hereinafter with reference to FIGS. 1A-1B to FIG. 11.
FIG. 1A is an exemplary representation of an environment 100 related to at least some embodiments of the present disclosure. Although the environment 100 is presented in one arrangement, other embodiments may include the parts of the environment 100 (or other parts) arranged otherwise depending on, for example, performing downstream task predictions based on neural network models by capturing sequential information in tabular data, etc. The environment 100 generally includes a server system 102, a plurality of entities 104a, 104b, and 104c, and an entity database 106, each coupled to, and in communication with (and/or with access to) a network 108. The network 108 may include, without limitation, a light fidelity (Li-Fi) network, a local area network (LAN), a wide area network (WAN), a metropolitan area network (MAN), a satellite network, the Internet, a fiber-optic network, a coaxial cable network, an infrared (IR) network, a radio frequency (RF) network, a virtual network, and/or another suitable public and/or private network capable of supporting communication among the entities illustrated in FIG. 1A, or any combination thereof.
Various entities in the environment 100 may connect to the network 108 in accordance with various wired and wireless communication protocols, such as Transmission Control Protocol and Internet Protocol (TCP/IP), User Datagram Protocol (UDP), 2nd Generation (2G), 3rd Generation (3G), 4th Generation (4G), 5th Generation (5G) communication protocols, Long Term Evolution (LTE) communication protocols, future communication protocols or any combination thereof. For example, the network 108 may include multiple different networks, such as a private network or a public network (e.g., the Internet, etc.).
In one implementation, the plurality of entities 104a-104c may correspond to persons or things that have their independent and separate existence. In some examples, the plurality of entities 104a-104c may correspond to persons, organizations, institutions, government enterprises, corporations, individuals, firms, society, and the like.
The entity database 106 stores information associated with the plurality of entities 104a-104c in a tabular form i.e., in the form of rows and columns. In one example, the entity database 106 may store historical tabular data associated with the entities 104a-104c. In one implementation, the entity database 106 may store information associated with each entity (e.g., the entity 104a) in a single table. In addition, rows in the table may represent entries associated with the entity 104a. In particular, each row in the tabular data represents each entry associated with the entity 104a. In an example, the entries may correspond to interactions, relations, associations, transactions, and the like. Further, columns in the tabular data represent data fields associated with the entries.
This information may be transmitted to another computing device, for example, the server system 102 described herein according to various example embodiments. The historical tabular data may be stored, accessed, or read from the entity database 106 with the facilitation of a database management system (DBMS) and/or relational database management system (RDBMS).
The server system 102 is configured to perform one or more of the operations described herein. The server system 102 is configured to train a global neural network model for dynamic tabular data (i.e., tabular data stored in the entity database 106). The global neural network model includes initial field embedding layers and a hierarchical BERT model 112. The initial field embedding layers convert data fields of each entry into a set of embeddings that is further provided as an input to the two-level transformer architecture (i.e., the BERT model 112). The first-level transformer captures and applies attention at the row level and the second-level transformer applies attention to the sequence to learn inter-relational dependencies and generates a contextualized embedding corresponding to each entry. The contextualized embeddings are then used as inputs for task-specific neural networks for performing downstream task predictions.
The server system 102 is a separate part of the environment 100 and may operate apart from (but still in communication with, for example, via the network 108), the plurality of entities 104a-104c and any third-party external servers (to access data to perform the various operations described herein). However, in other embodiments, the server system 102 may actually be incorporated, in whole or in part, into one or more parts of the environment 100, for example, an entity server (not shown in figures) associated with the entity 104a. In addition, the server system 102 should be understood to be embodied in at least one computing device in communication with the network 108, which may be specifically configured, via executable instructions, to perform as described herein, and/or embodied in at least one non-transitory computer-readable media.
In one embodiment, the server system 102 coupled with a database 110 is embodied within the entity server (not shown in figures), however, in other examples, the server system 102 can be a standalone component (acting as a hub) connected to the entity server. The database 110 may be incorporated in the server system 102 or may be an individual component connected to the server system 102 or may be a database stored in cloud storage. In one embodiment, the database 110 may store layer parameters (e.g., weights and biases) and hyperparameters associated with the hierarchical BERT model 112.
FIG. 1B is another exemplary representation of an environment 120 related to at least some embodiments of the present disclosure. Although the environment 120 is presented in one arrangement, other embodiments may include the parts of the environment 120 (or other parts) arranged otherwise depending on, for example, performing financial downstream task predictions based on neural network models by capturing sequential information in financial data, etc. The environment 120 generally includes a server system 122, a plurality of cardholders 124a, 124b, and 124c, a plurality of merchants 126a, 126b, and 126c, an issuer server 128, an acquirer server 130, a payment network 134 including a payment server 136, and a transaction database 138, each coupled to, and in communication with (and/or with access to) a network 132. The network 132 may include, without limitation, a light fidelity (Li-Fi) network, a local area network (LAN), a wide area network (WAN), a metropolitan area network (MAN), a satellite network, the Internet, a fiber-optic network, a coaxial cable network, an infrared (IR) network, a radio frequency (RF) network, a virtual network, and/or another suitable public and/or private network capable of supporting communication among the entities illustrated in FIG. 1B, or any combination thereof.
Various entities in the environment 120 may connect to the network 132 in accordance with various wired and wireless communication protocols, such as Transmission Control Protocol and Internet Protocol (TCP/IP), User Datagram Protocol (UDP), 2nd Generation (2G), 3rd Generation (3G), 4th Generation (4G), 5th Generation (5G) communication protocols, Long Term Evolution (LTE) communication protocols, future communication protocols or any combination thereof. For example, the network 132 may include multiple different networks, such as a private network made accessible by the payment network 134 to the issuer server 128, the acquirer server 130, the payment server 136, and a wallet server 146, separately, and a public network (e.g., the Internet, etc.). In one example, the network 132 is similar to the network 108.
The cardholder (e.g., the cardholder 124a) may be an individual, representative of a corporate entity, non-profit organization, or any other person. In addition, each cardholder may have a payment account issued by corresponding issuing banks (associated with the issuer server 128) and may be provided with a payment card with financial or other account information encoded onto the payment card such that the each of the plurality of cardholders 124a-124c may use the payment card to initiate and complete a payment transaction using a bank account at the issuing bank. Examples of the payment card may include, but are not limited to, a smartcard, a debit card, a credit card, and the like.
In one embodiment, the issuer server 128 is a financial institution that manages accounts of multiple account holders (e.g., the plurality of cardholders 124a-124c). In addition, account details of the payment accounts established with the issuer bank are stored in account holder profiles of the account holders (e.g., the plurality of cardholders 124a-124c) in memory of the issuer server 128 or on a cloud server associated with the issuer server 128. The terms “issuer server”, “issuer”, or “issuing bank” will be used interchangeably herein.
Further, the cardholders 124a-124c may perform the payment transactions at the merchants 126a-126c. In an example, the plurality of cardholders 124a-124c may transact at the plurality of merchants 126a-126c (e.g., using a merchant terminal) to perform payment transactions for purchasing goods and/or services offered by the plurality of merchants 126a-126c. In an example, the merchant terminal may include Point-Of-Sale (POS) device, Point-Of-Purchase (POP) device, Point-Of-Interaction (POI) device, and the like. In one implementation, each payment account associated with the plurality of merchants 126a-126c may be managed by an acquiring bank (e.g., the acquirer server 130).
In one embodiment, the acquirer server 130 is associated with a financial institution (e.g., a bank) that processes financial transactions. The acquirer server 130 can be an institution that facilitates the processing of payment transactions for the plurality of merchants 126a-126c, or an institution that owns platforms to make online purchases or purchases made via software applications possible (e.g., shopping cart platform providers and in-app payment processing providers). The terms “acquirer”, “acquirer bank”, “acquiring bank”, or “acquirer server” will be used interchangeably herein.
In one non-limiting example, a cardholder opens a merchant application on their user device to perform a payment transaction using stored information of a primary account number (PAN) and/or a payment card associated therewith to purchase goods or services offered by a merchant (e.g., the merchant 126a). After submitting a payment from the payment account associated with the payment card, a payment authorization request associated with the submitted payment is authorized in near real-time and the required funds associated with the payment are kept pending for the payment transaction. The funds are then debited from the payment account associated with the cardholder and credited to the payment account associated with the merchant (e.g., the merchant 126a). The funds are exchanged in place of goods or services provided by the merchant 126a to the cardholder.
In general (not in accordance with embodiments of the present disclosure), there are many analytical artificial intelligence (AI) models used to predict different financial indicators for such payment transactions. For example, a fraud risk model is used to detect whether a payment transaction is fraudulent or not. Such analytical AI models are developed mainly based on card-level features or static transaction features of such payment transactions. However, the analytical AI models do not consider sequential transaction patterns of the cardholders and lack in providing correct predictions.
To overcome the above-mentioned issues, the server system 122 is configured to perform one or more of the operations described herein. In one implementation, the server system 122 is identical to the server system 102. The server system 122 is configured to train a global neural network model for dynamic tabular data. The global neural network model includes initial field embedding layers and a hierarchical BERT model 142. The initial field embedding layers convert transaction fields of each payment transaction into a set of embeddings that is further provided as an input to the two-level transformer architecture (i.e., the hierarchical BERT model 142). The first-level transformer captures and applies attention at the row level and the second-level transformer applies attention to the transaction sequence to learn inter-transaction dependencies and generates a contextualized embedding corresponding to each payment transaction. The contextualized embeddings are then used as inputs for task-specific neural networks for performing downstream task predictions.
The server system 122 is a separate part of the environment 120 and may operate apart from (but still in communication with, for example, via the network 132) the plurality of cardholders 124a-124c, the plurality of merchants 126a-126c, and any third-party external servers (to access data to perform the various operations described herein). However, in other embodiments, the server system 122 may actually be incorporated, in whole or in part, into one or more parts of the environment 120, for example, the cardholder 124a. In addition, the server system 122 should be understood to be embodied in at least one computing device in communication with the network 132, which may be specifically configured, via executable instructions, to perform as described herein, and/or embodied in at least one non-transitory computer-readable media.
In one embodiment, the server system 122 coupled with a database 140 is embodied within the payment server 136, however, in other examples, the server system 122 can be a standalone component (acting as a hub) connected to the issuer server 128. The database 140 may be incorporated in the server system 122 or may be an individual entity connected to the server system 122 or may be a database stored in cloud storage. In one embodiment, the database 140 may store layer parameters (e.g., weights and biases) and hyperparameters associated with the hierarchical BERT model 142. In one example, the database 140 is identical to the database 110.
The transaction database 138 stores information of the plurality of payment transactions performed by the plurality of cardholders 124a-124c. For example, the transaction database 138 may store authorization, clearing, and/or chargeback data associated with the cardholders 124a-124c. The transaction database 138 may store historical payment transaction data associated with the cardholders 124a-124c. This information may be transmitted to another computing device, for example, the server system 122 described herein according to various example embodiments.
In one embodiment, the payment network 134 may be used by the payment card issuing authorities as a payment interchange network. The payment network 134 may include a plurality of payment servers such as the payment server 136. Examples of payment interchange network include, but are not limited to, Mastercard® payment system interchange network. The Mastercard® payment system interchange network is a proprietary communications standard promulgated by Mastercard International Incorporated® for the exchange of financial transactions among a plurality of financial activities that are members of Mastercard International Incorporated®. (Mastercard is a registered trademark of Mastercard International Incorporated located in Purchase, N.Y.).
The number and arrangement of systems, devices, and/or networks shown in FIG. 1B are provided as an example. There may be additional systems, devices, and/or networks; fewer systems, devices, and/or networks; different systems, devices, and/or networks; and/or differently arranged systems, devices, and/or networks than those shown in FIG. 1B. Furthermore, two or more systems or devices shown in FIG. 1B may be implemented within a single system or device, or a single system or device shown in FIG. 1B may be implemented as multiple, distributed systems or devices. Additionally, or alternatively, a set of systems (e.g., one or more systems) or a set of devices (e.g., one or more devices) of the environment 120 may perform one or more functions described as being performed by another set of systems or another set of devices of the environment 120.
Referring now to FIG. 2, a simplified block diagram of a server system 200 is shown, in accordance with an embodiment of the present disclosure. The server system 200 is identical to the server system 102 of FIG. 1A or the server system 122 of FIG. 1B. In some embodiments, the server system 200 is embodied as a cloud-based and/or SaaS-based (software as a service) architecture.
In one embodiment, the server system 200 is configured to perform various downstream task-specific predictions (e.g., financial downstream task-specific predictions) for entities (e.g., the cardholders 124a-124c) by training a global neural network model based on dynamic data (e.g., dynamic cardholder transaction data) arranged in a tabular fashion. The server system 200 is configured to implement unsupervised learning methods for multivariate time-series data (e.g., transaction data) that requires modeling both inter-relation and intra-relation dependencies.
The server system 200 includes a computer system 202 and a database 204. The computer system 202 includes at least one processor 206 for executing instructions, a memory 208, a communication interface 210, and a storage interface 214 that communicate with each other via a bus 212.
In some embodiments, the database 204 is integrated within the computer system 202. For example, the computer system 202 may include one or more hard disk drives as the database 204. The storage interface 214 is any component capable of providing the processor 206 with access to the database 204. The storage interface 214 may include, for example, an Advanced Technology Attachment (ATA) adapter, a Serial ATA (SATA) adapter, a Small Computer System Interface (SCSI) adapter, a RAID controller, a SAN adapter, a network adapter, and/or any component providing the processor 206 with access to the database 204. In one embodiment, the database 204 is configured to store a hierarchical BERT model 228 and a task-specific neural network model 230. The hierarchical BERT model 228 is identical to the hierarchical BERT model 112 of FIG. 1A or the hierarchical BERT model 142 of FIG. 1B. In an example, the database 204 is identical to the database 110 of FIG. 1A. In another example, the database 204 is identical to the database 140 of FIG. 1B.
Examples of the processor 206 include, but are not limited to, an application-specific integrated circuit (ASIC) processor, a reduced instruction set computing (RISC) processor, a complex instruction set computing (CISC) processor, a graphical processing unit (GPU), a field-programmable gate array (FPGA), and the like. The memory 208 includes suitable logic, circuitry, and/or interfaces to store a set of computer-readable instructions for performing operations. Examples of the memory 208 include a random-access memory (RAM), a read-only memory (ROM), a removable storage drive, a hard disk drive (HDD), and the like. It will be apparent to a person skilled in the art that the scope of the disclosure is not limited to realizing the memory 208 in the server system 200, as described herein. In another embodiment, the memory 208 may be realized in the form of a database server or cloud storage working in conjunction with the server system 200, without departing from the scope of the present disclosure.
The processor 206 is operatively coupled to the communication interface 210 such that the processor 206 is capable of communicating with a remote device 216 such as, the payment server 136, the issuer server 128, or communicating with any entity connected to the network 132 (as shown in FIG. 1B). In one embodiment, the processor 206 is configured to access tabular data (e.g., payment transaction data) associated with an entity (e.g., the cardholder 124a) from a database (e.g., the transaction database 138).
It is to be noted that the server system 200 as illustrated and hereinafter described is merely illustrative of an apparatus that could benefit from embodiments of the present disclosure and, therefore, should not be taken to limit the scope of the present disclosure. It is to be noted that the server system 200 may include fewer or more components than those depicted in FIG. 2.
In one embodiment, the processor 206 includes a data pre-processing engine 218, a feature generation engine 220, a pre-training engine 222, a fine-tuning engine 224, and a task prediction engine 226. It should be noted that components, described herein, such as the data pre-processing engine 218, the feature generation engine 220, the pre-training engine 222, the fine-tuning engine 224, and the task prediction engine 226 can be configured in a variety of ways, including electronic circuitries, digital arithmetic and logic blocks, and memory systems in combination with software, firmware, and embedded technologies.
With reference to FIG. 1A, the data pre-processing engine 218 includes suitable logic and/or interfaces for accessing historical tabular data associated with the entities 104a-104c from the entity database 106. In some examples, the historical tabular data may represent interactions, associations, relations, transactions, and the like. In particular, the data pre-processing engine 218 may query the entity database 106 to access a table storing information associated with an entity (e.g., the entity 104b).
With reference to FIG. 1B, the data pre-processing engine 218 includes suitable logic and/or interfaces for accessing historical payment transaction data associated with the cardholders 124a-124c from the transaction database 138. In particular, the data pre-processing engine 218 may query the transaction database 138 for the plurality of payment transactions performed by the cardholders 124a-124c based on account identifiers of the cardholders 124a-124c. In some implementations, the data pre-processing engine 218 may access the plurality of payment transactions associated with the cardholders 124a-124c from the issuer server 128.
In one example, the plurality of payment transactions is represented in form of a table (see, table 1). The plurality of payment transactions may include a sequence of card transactions for a particular cardholder. Each row consists of transaction fields that can be continuous or categorical of a particular payment transaction and each column indicates a transaction field associated with the particular payment transaction. In general, the plurality of payment transactions formulates dynamic time-series data.
Card Year Month Day Time Amount Merchant city Zip MCC Cross Border Is Fraud?
0 2021 2 1 06:21 $134.09 La Verne 91750 5300 No No
0 2021 4 10 06:42 $38.48 Monterey Park 91754 5411 Yes No
0 2021 5 21 06:22 $120.34 La Verne 91754 5411 No No
0 2021 8 20 17:45 $128.95 Monterey Park 91754 5651 Yes No
0 2021 9 30 06:23 $104.71 La Verne 91750 5912 No No
0 2021 11 26 13:53 $86.19 Monterey Park 91755 5970 Yes No
0 2021 12 08 05:51 $93.84 Monterey Park 91754 5411 Yes No
Table 1
It can be noted that all rows are transactions of a particular cardholder with the same card, hence each row is independent. To learn the embeddings of a particular cardholder, the processor 206 is configured to learn context from inter-transaction as well as intra-transaction. This is an example of dynamic tabular data. In order to unlock the potential of language modeling techniques for tabular data, the data pre-processing engine 218 is configured to quantize continuous fields so that each transaction field is defined on its own local finite vocabulary. In a similar manner, information associated with an entity (e.g., the entity 104c) can be represented in the form of tabular data. The rows of the tabular data can represent associations, interactions, transactions, and the like of a particular entity, where each row is independent. Also, the columns may represent the set of data fields associated with the entities 104a-104c.
In some implementations, a number of the plurality of data fields is pre-defined based, at least in part, on one or more downstream prediction tasks. For example, the number and type of the plurality of data fields may vary based on various downstream prediction tasks.
In an example, the plurality of entries (e.g., payment transactions) may be represented in a table as T1, T2… TN. In addition, each entry (e.g., entry T1) is then segmented into a plurality of data fields. The plurality of data fields may be represented as F1, F2…FN.
In some non-limiting examples, the plurality of transaction fields may include a cross-border flag, payment type, payment amount, merchant category code (MCC), terminal flag, unique merchant identifier, geo-location, payment means, timestamp information, and the like. In some other non-limiting examples, the plurality of transaction fields may include merchant country, merchant state, merchant city, merchant location ID, payment currency, acquiring bank, acquiring country, issuing bank, issuing country, and the like. In yet some other non-limiting examples, the plurality of transaction fields may include card product type, e-commerce indicator, contactless payment flag, recurring transaction flag, user presence flag, and the like.
In some implementations, the data pre-processing engine 218 is configured to perform operations (such as data-cleaning, normalization, feature extraction, and the like) on entries (e.g., payment transactions) associated with each entity (e.g., the cardholder 124a). In one example, the data pre-processing engine 218 is configured to input raw transaction history of cardholders associated with the issuer server 128 as an input to the feature generation engine 220.
With reference to FIG. 1A, the feature generation engine 220 includes suitable logic and/or interfaces for generating a set of features via initial field embedding layers. In general, the initial field embedding layers include multiple neural network layers. The initial field embedding layers are configured to generate a set of embeddings based on the plurality of entries. In particular, the feature generation engine 220 is configured to input the plurality of data fields corresponding to each entry as an input to the initial field embedding layers.
With reference to FIG. 1B, the feature generation engine 220 includes suitable logic and/or interfaces for generating a set of transaction features via the initial field embedding layers. In general, the initial field embedding layers include multiple neural network layers. The initial field embedding layers are configured to generate a set of embeddings based on the plurality of payment transactions. In particular, the feature generation engine 220 is configured to input the plurality of transaction fields corresponding to each payment transaction as an input to the initial field embedding layers.
In one implementation, the set of transaction features may be termed as categorical features and continuous features. In one example, for categorical features, each category can act as a word and ‘n’ number of words can be derived from the categories. In one example, the continuous features can be fed into n number of bins, and then words can be derived from the bins. The set of transaction features may be determined from or engineered from the plurality of payment transactions. In an example, the initial field embedding layers are configured to convert each payment transaction into a 64-length vector. In another example, the initial field embedding layers are configured to convert each payment transaction into a 768-length vector.
In an example, for 30 payment transactions, the initial field embedding layers are configured to generate 30*768 length vector. In another example, for 10 payment transactions, the initial field embedding layers are configured to generate 10 * 768 length vector. The set of embeddings generated from the initial field embedding layers is then fed as an input to the pre-training engine 222.
In one embodiment, the initial field embedding layers are part of the global neural network model and layer parameters of the initial field embedding layers also get updated during pre-training of the global neural network model.
In one example, the feature generation engine 220 may use natural language processing (NLP) algorithms to generate the set of features. The set of features may be converted into a vector format to be fed as an input to the initial field embedding layers.
Additionally, or alternatively, the feature generation engine 220 may also generate static features (e.g., static card-level features or transaction features) associated with each entity (e.g., the entity 104a). In some examples, the static transaction features may include time-series features extracted over a year per merchant type, purchased product information, merchant country, transaction type, etc.
The pre-training engine 222 includes suitable logic and/or interfaces for training the global neural network model based, at least in part, on the historical tabular data (e.g., the historical payment transaction data). The global neural network model includes the initial field embedding layers and the hierarchical BERT model 228. The hierarchical BERT model 228 is implemented based on the BERT neural network architecture.
The term “BERT” herein stands for Bidirectional Encoder Representations from Transformers. In general, BERT extends transformers by provisioning bidirectional training. Rather than handling sequences in a left-to-right or right-to-left fashion, BERT processes the sequences in a way that removes the sequential dependency of the information.
The global neural network model includes one or more neural network layers including: (i) the initial field embedding layers, (ii) a field transformer layer, and (iii) a sequence encoding transformer layer. With reference to FIG. 1A, the global neural network model applies attention at row level i.e., entry field level and on the entries sequence to learn both intra and inter-transaction sample dependencies to generate a contextualized embedding corresponding to each entry. With reference to FIG. 1B, the global neural network model applies attention at row level i.e., transaction field level, and on the transaction sequence to learn both intra and inter-transaction sample dependencies to generate a contextualized embedding corresponding to each payment transaction. The contextualized embeddings are then used as inputs to perform various downstream task predictions (e.g., fraud detection, customer attrition prediction, etc.).
At first, the pre-training engine 222 inputs the set of embeddings corresponding to each entry into the field transformer layer. In other words, the pre-training engine 222 is configured to input the plurality of data fields or transaction fields (e.g., F1, F2… FN) associated with each entry or each payment transaction (e.g., T1) to the field transformer layer of the hierarchical BERT model 228. The field transformer layer of the hierarchical BERT model 228 is configured to determine a plurality of intra-relational embeddings (interchangeably referred to as intra-transactional embeddings in case of payment transactions) corresponding to the set of data fields or the set of transaction fields of each entry or each payment transaction. In an example, the field transformer layer may include one or more field transformers. With reference to FIG. 1A, the plurality of intra-relational embeddings of each entry represents relationships among the set of data fields of each entry. With reference to FIG. 1B, the plurality of intra-transactional embeddings of each payment transaction represents relationships among the set of transaction fields of each payment transaction.
In one implementation, the number of the plurality of intra-relational embeddings is equal to the number of the set of data fields. In particular, the field transformer layer is configured to generate one intra-relational embedding for each data field of the entry. In an example, the plurality of intra-relational embeddings may be represented as F1’, F2’…FN’.
Thereafter, the pre-training engine 222 is configured to provide the plurality of intra-relational embeddings for each entry as an input to a sequence encoding transformer layer of the hierarchical BERT model 228. The pre-training engine 222 is configured to determine a plurality of inter-relational embeddings (interchangeably referred to as inter-transactional embeddings in case of payment transactions) corresponding to each entry or payment transaction based, at least in part, on the sequence encoding transformer layer of the hierarchical BERT model 228. The sequence encoding transformer layer may include one or more sequence encoding transformers. The plurality of inter-relational embeddings represents temporal relationships among the plurality of entries. In an example, the plurality of entries may be represented as T1, T2…TN and the plurality of inter-relational embeddings may be represented as T1’, T2’… TN’.
In an example, suppose that number of the set of data fields is 5 (e.g., F1, F2…F5) and the number of the plurality of entries (e.g., T1, T2…T30) is 30. Therefore, the pre-training engine 222 is configured to provide a vector of length 30 * 768 as an input to the hierarchical BERT model 228, since the number of the plurality of entries is 30 and each entry is converted into a 768 length-vector format. Further, the field transformer layer of the hierarchical BERT model 228 is configured to determine encoded transactions (i.e., a plurality of intra-relational embeddings) corresponding to each entry to represent the relationship among the 5 fields of each entry. In particular, the field transformer layer of the hierarchical BERT model 228 is configured to generate 5 intra-relational embeddings (e.g., intra-transactional embeddings or encoded transactions) (e.g., F1’, F2’, F3’, F4’, and F5’) for each entry (e.g., payment transaction) of the 30 entries fed as an input to the hierarchical BERT model 228.
The sequence encoding transformer layer is configured to determine the plurality of inter-relational embeddings (e.g., 30 inter-transactional embeddings) (e.g., T1’, T2’…T30’) for the 30 entries (e.g., T1, T2…T30). Moreover, the 30 inter-relational embeddings may then be analyzed to perform one or more downstream prediction tasks.
In general, the sequence length of the hierarchical BERT model 228 is a predetermined hyper parameter. Sometimes, it might be possible that the number of entries of a particular entity may be less than the predetermined hyper parameter during the training phase which may lead to a cold start problem. To counter this problem, the pre-training engine 222 is configured to include dummy or pseudo entries in the input training data. In one implementation, the pre-training engine 222 may formulate synthetic entry data of the entity 104a.
In order to learn representations for multivariate tabular data, the pre-training engine 222 may employ at least one of training methods. The training methods may include: (i) masked language modelling, and (ii) replaced token detection (RTD). It is to be noted that the selection of the pre-training method (i.e., MLM or RTD) depends on the downstream prediction task, and the hierarchical BERT model 228 can be trained based on only one of the MLM method or RTD method at one time. A detailed explanation of pre-training of the hierarchical BERT model 228 using the MLM method or RTD method is hereinafter explained in detail with reference to FIG. 6, and therefore, it is not reiterated for the sake of brevity.
The fine-tuning engine 224 includes suitable logic and/or interfaces for fine-tuning the pre-trained hierarchical BERT model 228 based, at least in part, on task-specific training data of a particular downstream prediction task. The task-specific training data herein refers to training data corresponding to a particular downstream task. In an example, the task-specific training data for a downstream task A can be different from the task-specific training data for a downstream task B. In one example, the task-specific training data may include past payment transactions of cardholders with labeled task-specific indicator. For example, for fraud detection task, the task-specific training data may include the past payment transactions of the cardholders with corresponding fraud indicators.
In an example, the downstream prediction task may correspond to the prediction of contactless transaction. In another example, the downstream prediction task may correspond to the prediction of cross-border transactions. In yet another example, the downstream prediction task may correspond to prediction of wallet transactions.
Upon providing the task-specific training data to the global neural network model or the hierarchical BERT model 228, the fine-tuning engine 224 is configured to determine a contextualized embedding i.e., inter-relational embedding corresponding to each of the past entries based, at least in part, on an output of the hierarchical BERT model 228. Thereafter, the fine-tuning engine 224 is configured to concatenate the contextualized embedding with static features corresponding to the respective entry to determine a plurality of modified embeddings (with reference to FIG. 1A). With reference to FIG. 1B, the fine-tuning engine 224 is configured to concatenate the contextualized embedding with static transaction features corresponding to the respective payment transaction to determine a plurality of modified embeddings. The plurality of modified embeddings is then fed as an input to a task-specific neural network (e.g., LSTM, MLP, etc.). Further, the back-propagation process is performed until the initial field embedding layers and layers of the hierarchical BERT model 228 to update the neural network parameters (e.g., weights, biases, etc.) based on a loss function (e.g., cross-entropy (CE) loss function, mean squared error (MSE) loss function, etc.). It is to be noted that the objective of the back-propagation is to minimize the loss function of the task-specific neural network.
A detailed explanation of fine-tuning the hierarchical BERT model 228 is hereinafter explained in detail with reference to FIG. 7, and therefore, it is not reiterated for the sake of brevity.
The task prediction engine 226 includes suitable logic and/or interfaces for analyzing the plurality of inter-relational embeddings to perform the one or more downstream prediction tasks. In some examples, the one or more downstream prediction tasks may include prediction of contactless transaction, prediction of cross-border transaction, prediction of wallet transaction, and the like.
In an example, the task prediction engine 226 may transmit one or more notifications to the issuer server 128 or the payment server 136 based, at least in part, on the one or more downstream prediction tasks. In some examples, the one or more notifications may be transmitted based on the one or more downstream prediction tasks such as identifying cardholder segments that are likely to be attrition, identifying cardholder segments that are likely to perform fraudulent transactions, and the like.
In one experiment, the hierarchical BERT model 228 is compared with a deep learning model (DLM). The DLM model is a neural network model already known in the art. The downstream prediction task is to predict whether a cardholder is likely to perform a cross-border transaction in the next quarter or not. For pre-training the hierarchical BERT model 228, payment transactions performed through 2.5 million payment cards are used as an input training data. In addition, the sequence length is selected as 30 and the stride is selected as 15. Therefore, in a first batch, 0 to 30 payment transactions are input to the hierarchical BERT model 228. Further, in a second batch, 15 to 45 payment transactions are input to the hierarchical BERT model 228. Furthermore, in a third batch, 30 to 60 payment transactions are input to the hierarchical BERT model 228, and so on. The transaction fields (or the transaction features) used corresponding to the downstream prediction task (e.g., prediction of cross-border transaction) are “dw_net_pd_amt”, “industry_code”, “CrossBorderFlag”, “payment_type”, and “flag_atm”. For fine-tuning the hierarchical BERT model 228, payment transactions performed through 3.3 million payment cards are used as an input (i.e., training transactions) with an 80:20 split. The sequence length is selected as 30. In addition, payment transactions performed through 255 million payment cards are used as an input (i.e., testing transactions).
The hierarchical BERT model 228 was compared with the DLM model based on performance metrics recall and precision. It was observed that to achieve similar values of recall and precision, the DLM model had to be trained on 732 features whereas the hierarchical BERT model 228 was only trained on 5 features, thereby increasing performance and reducing time consumption. The results of the performance metrics of the comparison are illustrated below in Table 2. Also, it is to be noted that an increase in the features can significantly improve the performance metrics of the hierarchical BERT model 228 as compared to the DLM model.
Experiment model Recall Precision
DLM model (732 features) 0.89 0.02
Hierarchical BERT model (5 features) 0.89 0.02
Table 2: Performance metrics comparison between DLM model and the hierarchical BERT model
In some implementations, during the execution phase, the server system 200 is configured to access past entry data or past transaction data associated with an entity (e.g., the entity 104a) or a cardholder from a database (e.g., the entity database 106 or the transaction database 138). More specifically, the past entry data is processed (e.g., through the data pre-processing engine 218) and sent through the two-level transformer architecture of the hierarchical BERT model 228.
In one example, the server system 200 is configured to store layer parameters of the global neural network model in a quick-access master table in the database 204. In some implementations, a client (through a client device) sends a request to the server system 200 to retrieve predictions or embeddings for certain cardholders/dates for a particular downstream task. Based on the request received from the client, the server system 200 is configured to access the quick-access master table stored in the database 204 to fine-tune the global neural network model for the particular downstream task based on task-specific training data.
Thus, after pre-training, the global neural network model has generalized learnings of sequential transaction patterns for different entities or cardholders that can be used in downstream task predictions. In the fine-tuning, the global neural network model can be fine-tuned with the task-specific neural network in a plug and play manner.
FIG. 3 is a schematic representation 300 of a neural network architecture of the hierarchical BERT model 228, in accordance with an embodiment of the present disclosure.
With reference to FIG. 1B, the processor 206 is configured to implement the hierarchical BERT model 228 to learn inter-transactional as well as intra-transactional dependencies and relationships. The hierarchical BERT model 228 includes a two-level transformer architecture in a hierarchical manner (as shown in FIG. 3).
The hierarchical BERT model 228 includes a transaction field transformer layer 302 and a transaction sequence encoding transformer layer 304. The transaction field transformer layer 302 includes a plurality of transaction field transformers. The transaction field transformer layer 302 is configured to process the rows of the dynamic tabular time-series transaction data individually i.e., process a plurality of transaction fields in each of the plurality of payment transactions. Moreover, the transaction field transformer layer 302 is configured to generate a plurality of intra-transactional embeddings 306 (i.e., encoded transactions) based on the set of embeddings corresponding to the transaction fields of each of the plurality of payment transactions that are generated by the initial field embedding layers. In particular, the transaction field transformer layer 302 is configured to capture intra-transactional relationships (i.e., local relationship among various transaction fields). For example, the transaction field transformer layer 302 is configured to contextualize how an e-commerce transaction changes the interpretation of a cross-border transaction. In particular, the transaction field transformer layer 302 is configured to output an 'encoding' of each transaction, contextualizing all the information contained in a transaction such that the information can further be used for prediction.
The plurality of intra-transactional embeddings 306 is then fed as an input to the transaction sequence encoding transformer layer 304. As explained above, the transaction sequence encoding transformer layer 304 is configured to generate a plurality of inter-transactional embeddings (i.e., contextualized embeddings, see, 308). In one implementation, the transaction sequence encoding transformer layer 304 is configured to output an n-dimensional encoding vector per transaction. The plurality of inter-transactional embeddings 308 is then processed or analyzed to perform the one or more downstream prediction tasks. In particular, the transaction sequence encoding transformer layer 304 is configured to capture inter-transaction relationships (i.e., global relationship among the plurality of transactions). For example, the transaction sequence encoding transformer layer 304 is configured to contextualize how an e-commerce airline payment transaction performed 3 months ago relates to a card present (CP) taxi fare payment transaction performed this week. In particular, the transaction sequence encoding transformer layer 304 is configured to output an 'encoding' of the transaction sequence that deeply understands the cardholder’s history (e.g., prior behavior). The plurality of inter-transactional embeddings 308 can be then fed as an input into a task-specific neural network model 310 (e.g., logistic regression, classification, etc.) to perform various downstream prediction tasks.
FIG. 4 is a schematic representation 400 of a transaction field transformer included in a transaction field transformer layer of the hierarchical BERT model 228, in accordance with an embodiment of the present disclosure. As explained above, a transaction field transformer 402 is configured to capture intra-transactional relationships (i.e., local) among the plurality of payment transactions of the cardholder 124a.
At first, the processor 206 is configured to convert each transaction feature (i.e., each field of the plurality of transaction fields) of a particular payment transaction X into an embedding space via initial field embedding layers 404. The set of embeddings is fed as an input to the transaction field transformer 402. With reference to FIG. 4, f (i) denotes various transaction fields of the payment transaction X and f (1) to f (5) denotes 5 transaction fields (e.g., transaction features) of the payment transaction X. In one implementation, the transaction field transformer 402 is configured to apply multi-headed attention mechanisms which run through an attention mechanism several times in parallel. Each of these is called an attention head. The attention module splits its Query, Key, and Value parameters N-ways and passes each split independently through a separate Head. All of these similar attention calculations are then combined to produce a final attention score. This is called multi-head attention and gives the transformer greater power to encode multiple relationships and nuances for each transaction field.
The schematic representation 400 specifically shows mathematical calculations performed to generate transactional embedding of transaction feature f (2) by incorporating the context of other features of the payment transaction X. In the transaction field transformer 402, a self-attention layer takes its input in the form of three parameters - key, query, and value. All three parameters are similar in structure, associated with each transaction field of the transaction fields. Each input embedding (i.e., each embedding of the set of embeddings) is projected on these matrices, to generate their corresponding key, query, and value vectors. The transaction field transformer 402 produces an encoded representation for each transaction field in the transaction fields that captures the meaning and position of each transaction field.
Let K ϵ Rm*k be the matrix denoting key vectors of all the embeddings. Let Q ϵ Rm*k be a matrix denoting query vectors of all the embeddings. Let V ϵ Rm*v be the matrix denoting value vectors of all the embeddings. Here, m denotes the number of embeddings inputted to the transaction field transformer 402. Further, k and v denote the dimensions of the key and value vectors, respectively.
Moreover, each input embedding of the set of intra-transactional embeddings attends to all other embeddings (i.e., the remaining set of transactional embeddings) through an attention head. Mathematically, the attention head may be calculated as:
attention (K, Q, V) = A.V …Eqn. (1)
where A = softmax ((Q*KT)/√k). For each embedding, the attention matrix A ϵ Rm*m calculates how much it attends to other embeddings, thus transforming the embedding into contextual. In the Eqns. (2) and (3), the multi-head self-attention is denoted as MHSA, feed-forward is denoted as FF, the initial field embedding layers are denoted as E(), and a layer normalization layer is denoted as LN, then the intra-transactional embeddings can be calculated using Equation 2 and Equation 3.
Z (i) = LN (MHSA (E (x1, x2, …, xm))) …Eqn. (2)
RE (i) = LN (FF (Z (i))) + Z (i) Eqn. (3)
Thus, the transaction field transformer 402 is configured to output the number of transaction features*embedding size. In an example, the number of transactions in a sequence is denoted as T and embedding size is denoted as E. Then, after passing through the transaction field transformer 402, the output dimension of the embedding becomes T * number of features * E. These embeddings only contain intra row information (i.e., information of various fields of the payment transaction).
FIG. 5 is a schematic representation 500 of a transaction sequence encoding transformer (e.g., the transaction sequence encoding transformer layer 304) of the hierarchical BERT model 228, in accordance with an embodiment of the present disclosure. As explained above, a transaction sequence encoding transformer 502 is configured to capture attention between different transactions (e.g., the plurality of transactions) associated with a cardholder of the plurality of cardholders 124a-124c.
In addition, a transaction field transformer 504 is configured to generate the set of intra-transactional embeddings. Let us consider RE1, RE2…RET denotes the set of intra-transactional embeddings generated for each transaction of the plurality of transactions. Here, T denotes rows of the transactions or the number of the plurality of transactions. In an example, for 10 payment transactions, the transaction field transformer 504 is configured to generate 10 transactional embeddings. In another example, for 30 number of payment transactions, the transaction field transformer 504 is configured to generate 30 transactional embeddings.
The transaction sequence encoding transformer 502 is further configured to capture contextual information among the plurality of payment transactions. In particular, the set of transactional embeddings is passed through the next N layers of the transaction sequence encoding transformer 502 to generate contextualized embedding for each transaction that includes information about other transactions. More specifically, the intra-transactional embeddings (RE1-RET) of all the payment transactions are fed as input to the transaction sequence encoding transformer 502. The transaction sequence encoding transformer 502 then calculates self-attention between different transactions (i.e., the plurality of transactions) and generates the final embedding (i.e., the contextualized embedding) of a transaction. The transaction sequence encoding transformer 502 is then configured to generate the final embeddings (i.e., the plurality of inter-transactional embeddings) for the plurality of transactions. The plurality of inter-transactional embeddings is then fed as an input to the task prediction engine 226 (for example, multi-layer perceptron (MLP) layers) for performing various downstream prediction tasks.
With reference to FIG. 5, let X (i) denote the transactions (e.g., the encoded transactions output from the transaction field transformer 504). The schematic representation 500 shows mathematical calculations performed to generate contextualized embeddings of X (3) (i.e., third transaction) by incorporating field-level embeddings (i.e., transactional embeddings) of other transactions in the same sequence. In a similar manner, contextualized embeddings for the rest of the transactions can be calculated.
For the transaction sequence encoding transformer 502, if the multi-head self-attention is denoted as MHSA, feed-forward is denoted as FF, the initial field embedding layers are denoted as E (), and layer normalization is denoted as LN, then the embeddings (i.e., the contextualized embeddings) can be calculated using Equation 4 and Equation 5.
Z (i) = LN (MHSA (E (T1, T2, …, Tn))) …Eqn. (4)
SE (i) = LN (FF (Z (i)) + Z (i) …Eqn. (5)
The plurality of inter-transactional embeddings can then be fed as an input to the task prediction engine 226 to perform the one or more downstream prediction tasks.
FIG. 6 is a schematic representation 600 of a pre-training process of the global neural network model, in accordance with an embodiment of the present disclosure. As mentioned previously, the processor 206 is configured to pre-train the global neural network model in a self-supervised manner.
The plurality of inter-relational embeddings generated from the hierarchical BERT model 228 can directly be used on an end-to-end supervised labelled task. However, it is noted that the performance of the hierarchical BERT model 228 tends to improve when it is pre-trained on a self-supervision task. The server system 200 is configured to perform pre-training of the hierarchical BERT model 228 with either one of masked language modeling (MLM) method or replaced token detection (RTD) method based, at least in part, on the downstream prediction task.
For example, for a downstream prediction task A, pre-training the hierarchical BERT model 228 with the MLM method may result in better performance of the hierarchical BERT model 228. In other example, for a downstream prediction task B, pre-training the hierarchical BERT model 228 with RTD method may result in better performance of the hierarchical BERT model 228.
With reference to FIG. 6, payment transactions 602 can be represented as t1, t2…tn. Here, n is a natural number. The payment transactions 602 may represent historical payment transactions performed by a cardholder.
In addition, the server system 200 is configured to segment each payment transaction in a plurality of transaction fields. For example, the transaction tn is segmented into the plurality of transaction fields f1, f2…fm. Here, m is also a natural number. The payment transactions 602 are then passed as an input through initial field embedding layers 604 to generate a set of embeddings corresponding to the payment transactions 602. In an example, the number of the set of embeddings is equal to a number of the payment transactions 602. Further, the set of embeddings is fed as an input to the hierarchical BERT model 228. Furthermore, the hierarchical BERT model 228 is configured to generate the plurality of inter-transactional embeddings 606. The plurality of inter-transactional embeddings 606 can be represented as t_1^',t_2^',…,t_n^'.
For pre-training the hierarchical BERT model 228 with the MLM method, the server system 200 is configured to mask one or more transaction fields of the payment transactions 602 with masked tokens. In one implementation, the server system 200 is configured to randomly select and mask the one or more transaction fields corresponding to the payment transactions 602 with masked tokens.
The server system 200 is then configured to generate the plurality of inter-transactional embeddings 606 based, at least in part, on the implementation of the hierarchical BERT model 228. Further, the server system 200 is configured to predict a set of actual values corresponding to the masked tokens based, at least in part, on a multi-layer perceptron (MLP) layer 608. In an example, the MLP layer 608 may include various MLP layers. The server system 200 is configured to update layer parameters (e.g., weights, biases, etc.) of the initial field embedding layers 604 and the hierarchical BERT model 228 based, at least in part, on a loss function. In one non-limiting example, the loss function is a cross-entropy (CE) loss function.
Generally, “MLP” refers to a fully connected class of feed-forward artificial neural networks (ANNs). In addition, the MLP includes at least three layers – an input layer, a hidden layer, and an output layer. Generally, “cross-entropy loss function” is a loss function used to optimize (or measure performance of) classification models. In an example, the server system 200 is configured to update the neural network parameters of only the masked set of fields up to the initial field embedding layers 604 and the hierarchical BERT model 228.
In an example, in MLM method, the server system 200 randomly selects 15% of the transaction fields of the payment transactions 602 and then replaces them with masked fields or tokens. Further, the masked tokens are passed as an input through the transaction field transformer layer and the transaction sequence encoding transformer layer of the hierarchical BERT model 228. The transaction field transformer layer and the transaction sequence encoding transformer layer facilitate learning of both intra and inter contextual row embeddings, respectively.
The MLP layer 608 predicts the original fields for the masked fields (i.e., masked tokens) from the contextual row embeddings outputted from the hierarchical BERT model 228. Moreover, the neural network parameters (e.g., weights and biases, etc.) of the initial field embedding layers 604 and the transformer layers of the hierarchical BERT model 228 are updated based on calculation of the loss function (e.g., cross-entropy loss function). It is to be noted that the cross-entropy loss function is minimized to perform the pre-training of the hierarchical BERT model 228.
In one embodiment, for pre-training the hierarchical BERT model 228 with RTD method, the server system 200 is configured to corrupt one or more transaction fields of the payment transactions 602. More specifically, a set of transaction fields is randomly chosen and then replaced with corrupted values (i.e., random false values).
The server system 200 is then configured to generate the plurality of inter-transactional embeddings 606 based, at least in part, on implementation of the hierarchical BERT model 228. Further, the server system 200 is configured to identify the corrupted set of transaction fields of the payment transactions 602 based, at least in part, on execution of the MLP layers 608. Furthermore, the server system 200 is configured to update neural network parameters (e.g., weights, biases, etc.) associated with the corrupted set of fields based, at least in part, on the loss function. In a non-limiting example, the loss function is a cross-entropy (CE) loss function.
In the RTD method, the server system 200 is only configured to determine whether a transaction field is corrupted or not and does not determine the actual value for the corresponding transaction field. However, in the MLM method, along with the determination of the masked field, the server system 200 is also configured to predict the actual value for the masked field. Therefore, the RTD method is more efficient than the MLM method, however, the MLM method is more accurate than the RTD method.
In an example, in the RTD method, the server system 200 replaces original features (i.e., original transaction fields) with a random value of that corresponding feature. Further, the transaction fields are passed as an input through the hierarchical BERT model 228 to generate the plurality of inter-transactional embeddings 606. Furthermore, the generated plurality of inter-transactional embeddings 606 is passed through the MLP layers 608. Moreover, in the RTD method, the MLP layer 608 only predicts whether a field is corrupted or not instead of predicting the original fields (i.e., original values). Therefore, the RTD method is more efficient than the MLM method since the RTD method is simply a binary classification task (to predict whether a transaction field is corrupted or not) whereas the MLM method is a multiclass classification problem (to predict actual value for the masked transaction field).
To pre-train the global neural network model based on either the MLM method or RTD method, the server system 200 is configured to provide the plurality of inter-transactional embeddings 606 corresponding to the set of transaction fields of each payment transaction to the MLP layers 608. The server system 200 is then configured to update layer parameters of the global neural network model up to the initial field embedding layers 604 based, at least in part, on a loss value of the MLP layers.
In one example, a number of the payment transactions 602 associated with a cardholder, accessed from the transaction database 138 is based on a predetermined sequence length. The server system 200 is configured to fine-tune the predetermined sequence length based, at least in part, on the particular downstream prediction task. For example, an authorized person (e.g., administrator) has access to modify or update the predetermined sequence length as per the requirement. In an example, for a downstream prediction task A, last 30 payment transactions of a cardholder C1 may be accessed from the transaction database 138. In another example, for a downstream task B, last 40 payment transactions of a cardholder C2 may be accessed from the transaction database 138.
The server system 200 is configured to determine whether a number of the series of the payment transactions 602 is less than the predetermined sequence length of the global neural network model. In an example, suppose the pre-determined sequence length is set as 30, therefore, last 30 payment transactions of the cardholder need to be accessed from the transaction database 138. However, let us suppose that the cardholder has only performed 25 transactions till now. Therefore, transaction data related to 5 payment transactions is missing. This scenario can be referred to as “cold start problem”.
In response to determining that the number of the series of the payment transactions 602 is less than the predetermined sequence length of the global neural network model, the server system 200 is configured to insert dummy payment transactions (e.g., pseudo payment transactions) in the input training data. In addition, a number of the dummy payment transactions is equal to the difference between the predetermined sequence length and the number of the series of the payment transactions. In an embodiment, each pseudo payment transaction may be a replica of a normal payment transaction performed by the cardholder in the past. In another embodiment, the server system 200 randomly generates each pseudo payment transaction. In yet another embodiment, each pseudo payment transaction may be a replica of normal payment transaction, a randomly generated payment transaction, or a combination thereof.
For example, as stated in the above example, the predetermined sequence length is set as 30 whereas only last 25 payment transactions are available. In this scenario, the server system 200 must add dummy payment transactions on its own. Therefore, the server system 200 adds 5 dummy payment transactions since the difference between the predetermined sequence length (e.g., 30) and number of the series of the payment transactions 602 (e.g., 25) is 5. In this manner, the server system 200 is configured to overcome the “cold start problem” by adding dummy payment transactions to the series of the payment transactions 602.
FIG. 7 is a schematic representation 700 of a fine-tuning process of the global neural network model, in accordance with an embodiment of the present disclosure.
As mentioned previously, during fine-tuning, the processor 206 is configured to provide task-specific training data to the global neural network model. The task-specific training data may include past payment transactions (e.g., payment transactions 702) associated with the plurality of cardholders with labeled task-specific indicator. The labeled task-specific indicator may depend on the downstream prediction task. For example, the labeled task-specific indicator may indicate whether a payment transaction is fraudulent. With reference to FIG. 7, the payment transactions 702 can be represented as t1, t2…tn. Here, n is a natural number. The payment transactions 702 may represent historical payment transactions performed by a cardholder.
In addition, the server system 200 is configured to segment each payment transaction in a plurality of transaction fields. For example, the transaction tn is segmented into the plurality of transaction fields f1, f2…fm. Here, m is also a natural number. The payment transactions 702 are then passed as an input through initial field embedding layers 704 to generate a set of embeddings corresponding to the payment transactions 702. In one implementation, a number of the set of embeddings is equal to a number of the payment transactions 702. Further, the set of embeddings is fed as an input to the hierarchical BERT model 228. Furthermore, the hierarchical BERT model 228 is configured to determine the contextualized embedding 706 corresponding to each of the past payment transactions (i.e., the payment transactions 702) based, at least in part, on the output of the pre-trained global neural network model. The contextualized embedding 706 can be represented as t_1^',t_2^',…,t_n^'.
In one implementation, the hierarchical BERT model 228 is fine-tuned based, at least in part, on the downstream prediction task. In an example, the server system 200 is configured to fine-tune the hierarchical BERT model 228 based on prediction of fraudulent transactions. In another example, the server system 200 is configured to fine-tune the hierarchical BERT model 228 based on prediction of future cross-border payment transaction for the cardholder.
For fine-tuning the hierarchical BERT model 228, the server system 200 is configured to provide a combination of the contextualized embedding 706 and static transaction features corresponding to each of the past payment transactions as an input (see, 710) to the task-specific neural network 708. In an example, the static transaction features are concatenated with the contextualized embedding 706. In some implementations, the task-specific neural network 708 is one of a classification or regression model. In some non-limiting examples, the task-specific neural network model 708 is a multi-layer perceptron (MLP), long-short term memory (LSTM), and the like.
In an example, the static transaction features include information of product/service/good purchased, country information, time aggregated card transaction features such as transaction count and amount aggregated at monthly and super industry levels. In addition, the static transaction features include time-series features extracted over a period of 12 months per super industry. The time-series features include, but may not be limited to, mean, median, minimum, maximum, range, standard deviation (SD), Interquartile range (IQR), upper quartile, lower quartile, kurtosis, skewness, and coefficient of variation (CV). The time series features may also include last 12 months count per super industry, last 12 months amount per super industry, total recurring count for last 12 months, total recurring amount for last 12 months, total cardholder present count for last 12 months, total cardholder present amount for last 12 months, and the like. Further, the static transaction features include card-present cross-border payment transactions performed in the last 3 months, etc.
Further, the server system 200 is configured to provide the combination of the contextualized embedding 706 and the static transaction features as an input to the task-specific neural network model 708. Examples of the task-specific neural network model 708 include, but are not limited to, long-short term memory (LSTM) and multi-layer perceptron (MLP). The task-specific neural network model 708 is configured to predict the label for the downstream prediction task.
In an example, if the downstream prediction task is cross-border transaction prediction, the task-specific neural network model 708 may be configured to output a binary value 1 if a payment transaction is a cross-border transaction or binary value 0 if the payment transaction is not a cross-border transaction. In another example, if the downstream prediction task is fraudulent transaction prediction, the task-specific neural network model 708 may be configured to output a binary value 1 if a payment transaction is a fraudulent transaction or binary value 0 if the payment transaction is not a fraudulent transaction.
Furthermore, the server system 200 is configured to determine a loss value associated with the task-specific neural network 708 based, at least in part, on the input. The server system 200 is then configured to back propagate the loss value till a first layer of the initial field embedding layers 704 of the GNN model to update neural network parameters (e.g., weights, biases, etc.) associated with the task-specific neural network 708. In an example, the loss value corresponds to a cross-entropy (CE) loss. In another example, the loss value corresponds to a mean squared error (MSE) loss. In yet another example, the loss value corresponds to any other similar loss. Generally, “mean squared error loss” related to a loss is used to optimize (or measure performance of) regression models.
In an implementation, the server system 200 is configured to update the neural network parameters of the task-specific neural network 708 based, at least in part, on the loss value of the task-specific neural network 708. In another implementation, the server system 200 is configured to update neural network parameters and layer parameters of the task-specific neural network 708 and the global neural network model, respectively, based, at least in part, on the loss value of the task-specific neural network 708.
It is to be noted that the objective of the fine-tuning is to minimize the loss value as much as possible. For example, the loss value may be minimized until it is less than a threshold value. Once the hierarchical BERT model 228 is pre-trained using either the MLM method or RTD method, the hierarchical BERT model 228 is fine-tuned based on the particular downstream prediction task.
Performance Metrics
In one implementation, an experiment is performed to evaluate performance metrics of the hierarchical BERT model 228. In an example, the downstream prediction task is a binary classification problem of fraud detection. In addition, a synthetic credit-card transactions dataset is used as an input data. In one example, the dataset has 24 million transactions performed by around 20000 users. Further, for each transaction, there are total 12 features (including both continuous and categorical features per transaction).
In one implementation, during fine-tuning, initial layers of the transformers (i.e., the hierarchical BERT model 228) are frozen and layers of only the task-specific neural network (e.g., MLP) are updated. In another implementation, during fine-tuning, initial layers of the transformers (i.e., the hierarchical BERT model 228) and layers of the task-specific neural network (e.g., the task-specific neural network 708) are updated. The second approach of fine-tuning the hierarchical BERT model 228 and the task-specific neural network has better training results (as illustrated below in Table 2). This is because in the second approach, the initial layers of the transformers (i.e., the hierarchical BERT model 228) are also updated based on the downstream prediction task.
Each transaction in the dataset has 12 features including 'Isfraud' feature. For data preparation, a sliding window of ‘n’ with stride 's' over the payment transactions is used. Moreover, the label generated for each window is 1 if one of the payment transactions in the window has value of 'Isfraud' feature as 1, otherwise it is 0. In one example, both categorical and continuous features are encoded using a global vocabulary. The continuous features are initially quantized and then encoded.
Once data preparation is complete, the input is passed to the transaction field transformer layer and the transaction sequence encoding transformer layer in batches. If the batch size is denoted as b, window size is denoted as n, and feature size is denoted as f, then the input shape may be denoted as (b, n, f). Further, the two levels of transformer layers (i.e., the transaction field transformer layer and the transaction sequence encoding transformer layer) generate the contextualized embeddings. Furthermore, the contextualized embeddings are used to perform downstream tasks using two different approaches. In first approach, the generated contextualized embeddings are directly passed through a multi-layer perceptron (MLP) layer. In second approach, the generated contextualized embeddings are passed through a long short-term memory (LSTM), to capture temporal dependencies, on these contextualized embeddings.
As explained above, the hierarchical BERT model 228 is pre-trained to improve its performance. The hierarchical BERT model 228 may be pre-trained based on either MLM method or RTD method. After pre-training, the hierarchical BERT model 228 is fine-tuned with MLP and LSTM networks end-to-end in two different methods. In first method, the initial pre-trained transformer layers were frozen and only the end layers were trained. In second method, the transformer layers were fine-tuned while training end to end. Therefore, 8 experiments were performed in total and the results of these experiments are illustrated below in Table 2. The performance of these 8 experiments was evaluated based on F1-score. Generally, “F1-score” is a harmonic mean of precision and recall scores of a classifier.
Pre-training method F1-Score
MLM With fine-tuning MLP 0.89
LSTM 0.92
Without fine-tuning MLP 0.78
LSTM 0.89
RTD With fine-tuning MLP 0.90
LSTM 0.92
Without fine-tuning MLP 0.74
LSTM 0.87
Table 2: Performance of the hierarchical BERT model evaluated based on F1-score
FIG. 8 is a process flow chart of a method 800 for training a global neural network model, in accordance with an embodiment of the present disclosure. The sequence of operations of the method 800 may not be necessarily executed in the same order as they are presented. Further, one or more operations may be grouped and performed in the form of a single step, or one operation may have several sub-steps that may be performed in parallel or in a sequential manner. Operations of the method 800, and combinations of operation in the method may be implemented by, for example, hardware, firmware, a processor, circuitry, and/or a different device associated with the execution of software that includes one or more computer program instructions. The process flow starts at operation 802.
At 802, the server system 200 accesses a series of payment transactions associated with the cardholder 124a from the transaction database 138. The series of payment transactions may be arranged in form of tabular data. Each payment transaction may include a plurality of transaction fields. In an example, the payment transactions may have been performed in an offline manner (e.g., at a merchant terminal (for example, point-of-sale (POS) terminal, etc.)). In another example, the payment transactions may have been performed in an online manner (e.g., at a merchant website accessed on a web browser installed in a user device (for example, laptop, mobile device, etc.)).
At 804, the server system 200 pre-trains a global neural network model by performing a plurality of operations, in an iterative manner. The plurality of operations includes steps 804a-804d. The global neural network model includes the initial field embedding layers, the transaction field transformer layer, and the transaction sequence encoding transformer layer. In one embodiment, the pre-training of the global neural network model is performed based on one of training methods, such as, MLM method and RTD method.
At 804a, the server system 200 generates the set of embeddings corresponding to transaction fields of each payment transaction via the initial field embedding layers of the global neural network model. It is to be noted that the plurality of transaction fields may differ for a different downstream prediction task. For example, the plurality of transaction fields for a downstream prediction task (e.g., prediction of cross-border transactions) may differ from the plurality of transaction fields for a downstream prediction task (e.g., prediction of fraudulent transactions).
At 804b, the server system 200 determines the plurality of intra-transactional embeddings corresponding to the transaction fields of each payment transaction based on the transaction field transformer layer. The plurality of intra-transactional embeddings represents relationships among the plurality of transaction fields of the payment transaction. The plurality of intra-transaction embeddings is combined to generate a single transactional embedding corresponding to the payment transaction.
At 804c, the server system 200 determines the plurality of inter-transactional embeddings corresponding to the payment transactions based on the transaction sequence encoding transformer layer. The plurality of inter-transactional embeddings captures sequential dependencies among the plurality of payment transactions.
At 804d, the server system 200 provides the plurality of inter-transactional embeddings corresponding to each payment transaction to the MLP layer. The MLP layer compares the intra-transactional embeddings with original transaction fields or actual values and a loss value is calculated based upon the comparison.
At 804e, the server system 200 updates layer parameters of the global neural network model (up to the initial field embedding layers) based on the calculated loss value of the MLP layer.
Thereafter, at 806, the server system 200 fine-tunes the pre-trained global neural network model in a supervised manner based on task-specific training data of a particular downstream prediction task.
The sequence of steps of the method 800 need not be necessarily executed in the same order as they are presented. Further, one or more operations may be grouped together and performed in form of a single step, or one operation may have several sub-steps that may be performed in parallel or in a sequential manner.
FIG. 9 is a process flow of a method 900 for predicting fraudulent transactions of a cardholder, in accordance with an embodiment of the present disclosure. The sequence of operations of the method 900 may not be necessarily executed in the same order as they are presented. Further, one or more operations may be grouped and performed in the form of a single step, or one operation may have several sub-steps that may be performed in parallel or in a sequential manner. Operations of the method 900, and combinations of operation in the method may be implemented by, for example, hardware, firmware, a processor, circuitry, and/or a different device associated with the execution of software that includes one or more computer program instructions. The process flow starts at method 902.
It is to be noted that the global neural network model is fine-tuned according to the fraudulent payment detection task during fine-tuning.
At operation 902, the server system 200 accesses previous payment transactions performed by a cardholder within a particular time interval (e.g., last 3 months, 6 months, 9 months, etc.). The previous payment transactions may be performed via online or offline manner. In addition, each payment transaction includes transaction fields. The transaction fields may be determined based on the downstream prediction task (i.e., fraudulent payment detection).
At operation 904, the server system 200 converts transaction fields of each payment transaction into an embedding space. In particular, the transaction fields of each payment transaction are converted into the set of embeddings based, at least in part, on the initial field embedding layers of the global neural network model.
At operation 906, the server system 200 provides the converted transaction fields (i.e., the set of embeddings) of each payment transaction to the transformer layers (i.e., the transaction field transformer layer and the transaction sequence encoding transformer layer of the hierarchical BERT model 228).
At operation 908, the server system 200 determines sequential transaction features associated with the cardholder based, at least in part, on the trained global neural network.
At operation 910, the server system 200 provides a combination of the sequential transaction features and static transaction features of the cardholder to the task-specific neural network model (e.g., fraud detection model).
At operation 912, the server system 200 predicts whether the cardholder will perform fraudulent payment transactions or not based on an output of the task-specific neural network model. In particular, for each payment transaction, the server system 200 may output 1 (for fraudulent payment transaction) and 0 (for non-fraudulent payment transaction).
The sequence of operations of the method 900 need not be necessarily executed in the same order as they are presented. Further, one or more operations may be grouped together and performed in form of a single step, or one operation may have several sub-steps that may be performed in parallel or in a sequential manner.
FIG. 10 is a process flow chart of a computer-implemented method 1000 for training a global neural network model to perform various downstream task predictions, in accordance with an embodiment of the present disclosure. The method 1000 depicted in the flow chart may be executed by, for example, a computer system. The computer system is identical to the server system 200. Operations of the flow chart of the method 1000, and combinations of operation in the flow chart of the method 1000, may be implemented by, for example, hardware, firmware, a processor, circuitry, and/or a different device associated with the execution of software that includes one or more computer program instructions. It is noted that the operations of the method 1000 can be described and/or practiced by using a system other than these computer systems. The method 1000 starts at operation 1002.
At operation 1002, the method 1000 includes accessing, by the server system 200, historical tabular data from the entity database 106 as the input training data. The historical tabular data may include a series of entries associated with the entity 104a represented in form of tabular data. Each entry includes the set of data fields.
At operation 1004, the method 1000 includes pre-training, by the server system 200, the global neural network model by performing a plurality of operations 1004a and 1004b. The global neural network model is implemented based, at least in part, on the hierarchical BERT neural network architecture, and the global neural network model includes the initial field embedding layers, the field transformer layer, and the sequence encoding transformer layer.
At operation 1004a, the method 1000 includes determining, by the server system 200, the plurality of intra-relational embeddings corresponding to the set of data fields of each entry based, at least in part, on the field transformer layer of the global neural network model. The plurality of intra-relational embeddings of each entry represents relationships among the set of data fields of each entry.
At operation 1004b, the method 1000 includes determining, by the server system 200, the plurality of inter-relational embeddings corresponding to the series of entries based, at least in part, on the sequence encoding transformer layer of the global neural network model and the plurality of intra-relational embeddings. The plurality of inter-relational embeddings represents temporal relationships among the plurality of entries.
At operation 1006, the method 1000 includes fine-tuning, by the server system 200, the pre-trained global neural network model based, at least in part, on task-specific training data of a particular downstream prediction task.
The sequence of operations of the method 1000 need not be necessarily executed in the same order as they are presented. Further, one or more operations may be grouped together and performed in form of a single step, or one operation may have several sub-steps that may be performed in parallel or in a sequential manner.
FIG. 11 is a simplified block diagram of a payment server 1100, in accordance with an embodiment of the present disclosure. The payment server 1100 is an example of the payment server 136 of FIG. 1. The payment server 1100 and the server system 200 may use the payment network 134 as a payment interchange network. Examples of payment interchange networks include, but are not limited to, Mastercard® payment system interchange network. In one example, the server system 200 is an example of the payment server 1100.
The payment server 1100 includes a processing system 1105 configured to extract programming instructions from a memory 1110 to provide various features of the present disclosure. The components of the payment server 1100 provided herein may not be exhaustive and that the payment server 1100 may include more or fewer components than those depicted in FIG. 11. Further, two or more components may be embodied in one single component, and/or one component may be configured using multiple sub-components to achieve the desired functionalities. Some components of the payment server 1100 may be configured using hardware elements, software elements, firmware elements, and/or a combination thereof.
Via a communication interface 1115, the processing system 1105 receives a request from a remote device 1120, such as the issuer server 128 or the acquirer server 130. The request may be a request for conducting the payment transaction. The communication may be achieved through API calls, without loss of generality. The payment server 1100 includes a database 1125. The database 1125 also includes transaction processing data such as issuer ID, country code, acquirer ID, merchant identifier (MID), among others.
When the payment server 1100 receives a payment transaction request from the acquirer server 130 or a payment terminal (e.g., point of sale (POS) device, etc.), the payment server 1100 may route the payment transaction request to an issuer server (e.g., the issuer server 128). The database 1125 is configured to store transaction identifiers for identifying transaction details such as, transaction amount, payment card details, acquirer account information, transaction records, merchant account information, and the like.
In one example embodiment, the acquirer server 130 is configured to send an authorization request message to the payment server 1100. The authorization request message includes, but is not limited to, the payment transaction request.
The processing system 1105 further sends the payment transaction request to the issuer server 128 for facilitating the payment transactions from the remote device 1120. The processing system 1105 is further configured to notify the remote device 1120 of the transaction status in form of an authorization response message via the communication interface 1115. The authorization response message includes, but is not limited to, a payment transaction response received from the issuer server 128. Alternatively, in one embodiment, the processing system 1105 is configured to send an authorization response message for declining the payment transaction request, via the communication interface 1115, to the acquirer server 130.
The disclosed methods with reference to FIGS. 1 to 11, or one or more operations of the methods 800, 900 and 1000 may be implemented using software including computer-executable instructions stored on one or more computer-readable media (e.g., non-transitory computer-readable media, such as one or more optical media discs, volatile memory components (e.g., DRAM or SRAM), or nonvolatile memory or storage components (e.g., hard drives or solid-state nonvolatile memory components, such as Flash memory components) and executed on a computer (e.g., any suitable computer, such as a laptop computer, net book, Web book, tablet computing device, smart phone, or other mobile computing devices). Such software may be executed, for example, on a single local computer or in a network environment (e.g., via the Internet, a wide-area network, a local-area network, a remote web-based server, a client-server network (such as a cloud computing network), or other such networks) using one or more network computers. Additionally, any of the intermediate or final data created and used during implementation of the disclosed methods or systems may also be stored on one or more computer-readable media (e.g., non-transitory computer-readable media) and are considered to be within the scope of the disclosed technology. Furthermore, any of the software-based embodiments may be uploaded, downloaded, or remotely accessed through a suitable communication means. Such a suitable communication means includes, for example, the Internet, the World Wide Web, an intranet, software applications, cable (including fiber optic cable), magnetic communications, electromagnetic communications (including RF, microwave, and infrared communications), electronic communications, or other such communication means.
Although the disclosure has been described with reference to specific exemplary embodiments, it is noted that various modifications and changes may be made to these embodiments without departing from the broad spirit and scope of the disclosure. For example, the various operations, blocks, etc. described herein may be enabled and operated using hardware circuitry (for example, complementary metal oxide semiconductor (CMOS) based logic circuitry), firmware, software and/or any combination of hardware, firmware, and/or software (for example, embodied in a machine-readable medium). For example, the apparatuses and methods may be embodied using transistors, logic gates, and electrical circuits (for example, application-specific integrated circuit (ASIC) circuitry and/or in Digital Signal Processor (DSP) circuitry).
Particularly, the server system 200 (e.g., the server system 102 or the server system 122) and its various components such as the computer system 202 and the database 204 may be enabled using software and/or using transistors, logic gates, and electrical circuits (for example, integrated circuit circuitry such as ASIC circuitry). Various embodiments of the disclosure may include one or more computer programs stored or otherwise embodied on a computer-readable medium, wherein the computer programs are configured to cause a processor or computer to perform one or more operations. A computer-readable medium storing, embodying, or encoded with a computer program, or similar language, may be embodied as a tangible data storage device storing one or more software programs that are configured to cause a processor or computer to perform one or more operations. Such operations may be, for example, any of the steps or operations described herein. In some embodiments, the computer programs may be stored and provided to a computer using any type of non-transitory computer readable media. Non-transitory computer readable media include any type of tangible storage media. Examples of non-transitory computer readable media include magnetic storage media (such as floppy disks, magnetic tapes, hard disk drives, etc.), optical magnetic storage media (e.g., magneto-optical disks), CD-ROM (compact disc read only memory), CD-R (compact disc recordable), CD-R/W (compact disc rewritable), DVD (Digital Versatile Disc), BD (BLU-RAY® Disc), and semiconductor memories (such as mask ROM, PROM (programmable ROM), EPROM (erasable PROM), flash memory, RAM (random access memory), etc.). Additionally, a tangible data storage device may be embodied as one or more volatile memory devices, one or more non-volatile memory devices, and/or a combination of one or more volatile memory devices and non-volatile memory devices. In some embodiments, the computer programs may be provided to a computer using any type of transitory computer readable media. Examples of transitory computer readable media include electric signals, optical signals, and electromagnetic waves. Transitory computer readable media can provide the program to a computer via a wired communication line (e.g., electric wires, and optical fibers) or a wireless communication line.
Various embodiments of the invention, as discussed above, may be practiced with steps and/or operations in a different order, and/or with hardware elements in configurations, which are different than those which are disclosed. Therefore, although the invention has been described based upon these exemplary embodiments, it is noted that certain modifications, variations, and alternative constructions may be apparent and well within the spirit and scope of the invention.
Although various exemplary embodiments of the invention are described herein in a language specific to structural features and/or methodological acts, the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as exemplary forms of implementing the claims.
, Claims:CLAIMS
We claim:
1. A computer-implemented method, comprising:
accessing, by a server system, historical tabular data from a database as input training data, the historical tabular data comprising a series of entries associated with an entity represented in form of tabular data, each entry comprising a set of data fields;
pre-training, by the server system, a global neural network model by performing a plurality of operations, the plurality of operations comprising:
determining, by the server system, a plurality of intra-relational embeddings corresponding to the set of data fields of each entry based, at least in part, on a field transformer layer of the global neural network model, the plurality of intra-relational embeddings of each entry representing relationships among the set of data fields of each entry; and
determining, by the server system, a plurality of inter-relational embeddings corresponding to the series of entries based, at least in part, on a sequence encoding transformer layer of the global neural network model and the plurality of intra-relational embeddings, the plurality of inter-relational embeddings representing temporal relationships among the series of entries; and
fine-tuning, by the server system, the pre-trained global neural network model based, at least in part, on task-specific training data of a particular downstream prediction task.
2. The computer-implemented method as claimed in claim 1, wherein the global neural network model is implemented based, at least in part, on a hierarchical Bidirectional Encoder Representations from Transformers (BERT) neural network architecture, and wherein the global neural network model comprises initial field embedding layers, the field transformer layer, and the sequence encoding transformer layer.
3. The computer-implemented method as claimed in claim 2, wherein the plurality of operations further comprises:
converting, by the server system, the set of data fields of each entry into a set of embedding based, at least in part, on the initial field embedding layers.
4. The computer-implemented method as claimed in claim 2, wherein the plurality of operations further comprises:
providing, by the server system, the plurality of inter-relational embeddings corresponding to the set of data fields of each entry to a multi-layer perceptron (MLP) layer; and
updating, by the server system, layer parameters of the global neural network model based, at least in part, on a loss value of the MLP layer.
5. The computer-implemented method as claimed in claim 1, wherein the global neural network model is pre-trained based at least on one of training methods, the training methods comprising a masked language modeling (MLM) method and a replaced token detection (RTD) method.
6. The computer-implemented method as claimed in claim 1, wherein fine-tuning the pre-trained global neural network model comprises:
providing, by the server system, the task-specific training data to the global neural network model, the task-specific training data comprising past entries associated with a plurality of entities with labeled task-specific indicator;
determining, by the server system, a contextualized embedding corresponding to each of the past entries based, at least in part, on an output of the pre-trained global neural network model;
providing, by the server system, a combination of the contextualized embedding and static entry features corresponding to each of the past entries as an input to a task-specific neural network; and
determining, by the server system, a loss value associated with the task-specific neural network based, at least in part, on the input.
7. The computer-implemented method as claimed in claim 6, further comprising:
updating, by the server system, neural network parameters of the task-specific neural network based, at least in part, on the loss value of the task-specific neural network.
8. The computer-implemented method as claimed in claim 6, further comprising:
updating, by the server system, neural network parameters and layer parameters of the task-specific neural network and the global neural network model, respectively, based, at least in part, on the loss value of the task-specific neural network.
9. The computer-implemented method as claimed in claim 1, further comprising:
determining, by the server system, whether a number of the series of entries is less than a predetermined sequence length of the global neural network model; and
in response to determining that the number of the series of entries is less than the predetermined sequence length of the global neural network model, inserting, by the server system, dummy entries in the input training data.
10. The computer-implemented method as claimed in claim 1, wherein the particular downstream prediction task is at least one of: (a) prediction of fraudulent transactions, and (b) prediction of future cross-border payment transaction for a cardholder.
11. The computer-implemented method as claimed in any of claims 1 to 10, wherein the historical tabular data corresponds to historical payment transaction data, wherein the entries correspond to payment transactions, wherein the entity corresponds to a cardholder, wherein the set of data fields corresponds to a set of transaction fields, wherein the plurality of intra-relational embeddings corresponds to a plurality of intra-transactional embeddings, and wherein the plurality of inter-relational embeddings corresponds to a plurality of inter-transactional embeddings.
12. A server system configured to perform the computer-implemented method as claimed in any of the claims 1-11.
| # | Name | Date |
|---|---|---|
| 1 | 202241058414-STATEMENT OF UNDERTAKING (FORM 3) [12-10-2022(online)].pdf | 2022-10-12 |
| 2 | 202241058414-POWER OF AUTHORITY [12-10-2022(online)].pdf | 2022-10-12 |
| 3 | 202241058414-FORM 1 [12-10-2022(online)].pdf | 2022-10-12 |
| 4 | 202241058414-FIGURE OF ABSTRACT [12-10-2022(online)].pdf | 2022-10-12 |
| 5 | 202241058414-DRAWINGS [12-10-2022(online)].pdf | 2022-10-12 |
| 6 | 202241058414-DECLARATION OF INVENTORSHIP (FORM 5) [12-10-2022(online)].pdf | 2022-10-12 |
| 7 | 202241058414-COMPLETE SPECIFICATION [12-10-2022(online)].pdf | 2022-10-12 |
| 8 | 202241058414-Correspondence_Form-26_28-10-2022.pdf | 2022-10-28 |
| 9 | 202241058414-Proof of Right [22-11-2022(online)].pdf | 2022-11-22 |