Abstract: ABSTRACT SERVER SYSTEMS AND METHODS FOR IDENTIFYING AMBIGUOUS MERCHANT DATA BASED ON ARTIFICIAL INTELLIGENCE MODELS Embodiments provide methods and systems for identifying ambiguous merchant data in electronic payment transaction using artificial intelligence. Method performed by server system includes accessing electronic payment transaction records associated with merchants. Each of the electronic payment transaction records includes merchant data fields associated with a merchant. Method includes identifying set of electronic payment transaction records having matching probability score less than predefined threshold score and applying clustering algorithm over set of electronic payment transaction records for constructing set of clusters based on matching features associated with each of set of electronic payment transaction records. Method includes determining ambiguous instances from set of clusters by performing steps: selecting cluster, determining labeled data points associated with instances of the cluster, providing labeled data points for training supervised machine learning model, evaluating accuracy metrics of supervised machine learning model, and determining at least one ambiguous instance from the cluster based on evaluating step. FIG. 2
Claims:CLAIMS
We claim:
1. A computer-implemented method, performed by a server system, comprising:
accessing a plurality of electronic payment transaction records associated with a plurality of merchants from a database, each of the plurality of electronic payment transaction records comprising merchant data fields associated with a merchant of the plurality of merchants;
identifying a set of electronic payment transaction records from the plurality of electronic payment transaction records, each of the set of electronic payment transaction records having a matching probability score less than a predefined threshold score, wherein the matching probability score for an electronic payment transaction record is computed by matching the electronic payment transaction record with corresponding data stored in a merchant database;
applying a clustering algorithm over the set of electronic payment transaction records for constructing a set of clusters, the set of clusters constructed based, at least in part, on a set of matching features corresponding to the merchant data fields associated with each of the set of electronic payment transaction records; and
determining ambiguous instances from the set of clusters by performing steps:
selecting a cluster,
determining labeled data points associated with one or more instances of the cluster,
in response to determining the labeled data points for the one or more instances, providing the labeled data points to a supervised machine learning model for training the supervised machine learning model,
evaluating an accuracy metrics of the supervised machine learning model, and
determining at least one ambiguous instance from the cluster based, at least in part, on the evaluating step.
2. The computer-implemented method as claimed in claim 1, wherein the steps for determining the ambiguous instances are performed iteratively.
3. The computer-implemented method as claimed in claim 1, further comprising:
sampling the cluster from the set of clusters to identify the one or more instances using pool based sampling methods; and
identifying the one or more instances by querying the sampled cluster using marginal sampling methods.
4. The computer-implemented method as claimed in claim 2, wherein evaluating the accuracy metrics of the supervised machine learning model comprises:
determining whether the accuracy metrics of the supervised machine learning model at a current iteration does have an incremental value compared to an accuracy metrics of the supervised machine learning model at a previous iteration or not, the current iteration associated with the selected cluster;
upon determining that the accuracy metrics at the current iteration does not have the incremental value compared to the accuracy metrics at the previous iteration, the computer-implemented method further comprising:
classifying a plurality of instances of the cluster as ambiguous instances;
discarding the supervised machine learning model generated at the current iteration; and
accessing the supervised machine learning model generated at the previous iteration for next iterations.
5. The computer-implemented method as claimed in claim 4, further comprising:
upon determining the accuracy metrics at the current iteration does have the incremental value compared to the accuracy metrics at the previous iteration, retaining the supervised machine learning model generated at the current iteration for the next iterations.
6. The computer-implemented method as claimed in claim 1, further comprising:
in response to not determining a labeled data point for an instance of the one or more instances, classifying the instance as an ambiguous instance.
7. The computer-implemented method as claimed in claim 1, further comprising:
determining the set of matching features associated with each of the set of electronic payment transaction records based, at least in part, on a matching of merchant data fields of each of the set of electronic payment transaction records with corresponding data stored in the merchant database.
8. The computer-implemented method as claimed in claim 7, wherein the clustering algorithm is K-prototype clustering algorithm.
9. The computer-implemented method as claimed in claim 1, wherein the merchant data fields are at least one or more of:
merchant name,
acquirer merchant identifier,
merchant address,
merchant city,
merchant zip code,
merchant state code, and
merchant country.
10. A server system, comprising:
a communication interface;
a memory comprising executable instructions; and
a processor communicably coupled to the communication interface, the processor configured to execute the executable instructions to cause the server system to at least:
access a plurality of electronic payment transaction records associated with a plurality of merchants from a database, each of the plurality of electronic payment transaction records comprising merchant data fields associated with a merchant of the plurality of merchants,
identify a set of electronic payment transaction records from the plurality of electronic payment transaction records, each of the set of electronic payment transaction records having a matching probability score less than a predefined threshold score, wherein the matching probability score for an electronic payment transaction record is computed by matching the electronic payment transaction record with corresponding data stored in a merchant database,
apply a clustering algorithm over the set of electronic payment transaction records for constructing a set of clusters, the set of clusters constructed based, at least in part, on a set of matching features corresponding to the merchant data fields associated with each of the set of electronic payment transaction records, and
determine ambiguous instances from the set of clusters by performing:
selecting a cluster,
determining labeled data points associated with one or more instances of the cluster,
in response to determining the labeled data points for the one or more instances, providing the labeled data points to a supervised machine learning model for training the supervised machine learning model,
evaluating an accuracy metrics of the supervised machine learning model, and
determining at least one ambiguous instance from the cluster based, at least in part, on the evaluation.
11. The server system as claimed in claim 10, wherein operations for determining the ambiguous instances from the set of clusters are performed iteratively.
12. The server system as claimed in claim 10, wherein the server system is further caused at least in part to:
sample the cluster from the set of clusters to identify the one or more instances using pool based sampling methods, and
identify the one or more instances by querying the sampled cluster using marginal sampling methods.
13. The server system as claimed in claim 10, wherein, to evaluate the accuracy metrics of the supervised machine learning model, the server system is further caused at least in part to:
determine whether the accuracy metrics of the supervised machine learning model at a current iteration does have an incremental value compared to an accuracy metrics of the supervised machine learning model at a previous iteration or not, the current iteration associated with the selected cluster;
in response to a determination that the accuracy metrics at the current iteration does not have the incremental value compared to the accuracy metrics at the previous iteration, the server system is further caused at least in part to:
classify a plurality of instances of the cluster as ambiguous instances;
discard the supervised machine learning model generated at the current iteration; and
access the supervised machine learning model generated at the previous iteration for next iterations.
14. The server system as claimed in claim 13, wherein the server system is further caused at least in part to:
in response to a determination that the accuracy metrics at the current iteration does have the incremental value compared to the accuracy metrics of the previous iteration, retain the supervised machine learning model generated at the current iteration for the next iterations.
15. The server system as claimed in claim 10, wherein the server system is further caused at least in part to:
in response to not determining a labeled data point for an instance of the one or more instances, classify the instance as an ambiguous instance.
16. The server system as claimed in claim 10, wherein the server system is further caused at least in part to:
determine the set of matching features associated with each of the set of electronic payment transaction records based on a matching of each of the set of electronic payment transaction records with corresponding data stored in the merchant database.
17. A computer-implemented method, performed by a server system, comprising:
accessing a plurality of electronic payment transaction records associated with a plurality of merchants from a database, each of the plurality of electronic payment transaction records comprising merchant data fields associated with a merchant of the plurality of merchants;
identifying a set of electronic payment transaction records from the plurality of electronic payment transaction records, each of the set of electronic payment transaction records having a matching probability score less than a predefined threshold score, wherein the matching probability score for an electronic payment transaction record is computed by matching the electronic payment transaction record with corresponding data stored in a merchant database;
applying K-prototype clustering algorithm over the set of electronic payment transaction records for constructing a set of clusters, the set of clusters constructed based, at least in part, on a set of matching features corresponding to the merchant data fields associated with each of the set of electronic payment transaction records; and
determining ambiguous instances from the set of clusters by performing steps:
selecting a cluster,
sampling the cluster from the set of clusters to identify the one or more instances using pool based sampling methods,
determining labeled data points associated with one or more instances of the cluster,
in response to determining the labeled data points for the one or more instances, providing the labeled data points to a supervised machine learning model for training the supervised machine learning model,
evaluating an accuracy metrics of the supervised machine learning model, and
determining at least one ambiguous instance from the cluster based, at least in part, on the evaluating step.
18. The computer-implemented method as claimed in claim 17, wherein the steps for determining the ambiguous instances from the set of clusters are performed iteratively.
19. The computer-implemented method as claimed in claim 17, wherein evaluating the accuracy metrics of the supervised machine learning model comprises:
determining whether the accuracy metrics of the supervised machine learning model at a current iteration does have an incremental value compared to an accuracy metrics of the supervised machine learning model at a previous iteration or not, the current iteration associated with the selected cluster;
upon determining that the accuracy metrics at the current iteration does not have the incremental value compared to the accuracy metrics at the previous iteration, the computer-implemented method further comprising:
classifying a plurality of instances of the cluster as ambiguous instances;
discarding the supervised machine learning model generated at the current iteration; and
accessing the supervised machine learning model generated at the previous iteration for next iterations.
20. The computer-implemented method as claimed in claim 17, further comprising:
determining the set of matching features associated with each of the set of electronic payment transaction records based on a matching of each of the set of electronic payment transaction records with corresponding data stored in the merchant database.
, Description:
FORM 2
THE PATENTS ACT 1970
(39 of 1970)
&
The Patent Rules 2003
COMPLETE SPECIFICATION
(refer section 10 & rule 13)
TITLE OF THE INVENTION:
SERVER SYSTEMS AND METHODS FOR IDENTIFYING AMBIGUOUS MERCHANT DATA BASED ON ARTIFICIAL INTELLIGENCE MODELS
APPLICANT(S):
Name:
Nationality:
Address:
MASTERCARD INTERNATIONAL INCORPORATED
United States of America
2000 Purchase Street, Purchase, NY 10577, United States of America
PREAMBLE TO THE DESCRIPTION
The following specification particularly describes the invention and the manner in which it is to be performed.
DESCRIPTION
(See next page)
SERVER SYSTEMS AND METHODS FOR IDENTIFYING AMBIGUOUS MERCHANT DATA BASED ON ARTIFICIAL INTELLIGENCE MODELS
TECHNICAL FIELD
The present disclosure relates to artificial intelligence processing systems and, more particularly to, electronic methods and complex processing systems for identifying ambiguous merchant data present in payment transactions by utilizing supervised and unsupervised learning techniques.
BACKGROUND
Data quality is an important component of any successful machine learning model. Traditionally, business entities employ many sophisticated machine learning models as well as large rule engines to maintain the quality of payment transaction data. However, these rule engines rely upon the syntactic properties of the merchant information embedded in the payment transaction data. Further, the merchant information is matched in a probabilistic manner to a clean merchant database such as Pitney Bowes or Factual to augment the merchant information. Moreover, transaction data from multiple payment transactions are aggregated for marketing and macroeconomic reporting to merchants.
However, the merchant information embedded in the payment transaction data is not always clean enough to provide a confident match in the clean merchant database. The payment transaction data may include noisy data that distorts the merchant information. This noisy data primarily occurs during data collection, storage, and processing at the end of the acquirer or the payment aggregator who are responsible for sending the payment transaction information. Some payment transaction data include merchant information that is so ambiguous that it may be difficult even for a human annotator to correctly match with a set of candidate merchant records.
Moreover, the presence of noisy data does not only affect the predictive power of merchant entity linking models leading to spurious matches but also affects downstream products depending on the payment transaction data. The machine learning model also becomes increasingly complex trying to learn these ambiguous cases with noisy data.
Conventionally, the problem of identifying noisy data has been tackled by unsupervised learning techniques like standard k-means, k-means noise clustering, and density-based spatial clustering of applications with noise (DBSCAN). The standard k-means algorithm fails in cases where there is a limited number of outliers, such as in merchant location data where less than 1% of the locations might be spurious and comparable amount of noisy data is injected into good clusters. In both cases, since each point must be assigned to some cluster these outliers get assigned to good clusters corrupting them. Thus, the clusters are not pure, and distinguishing between a good cluster and a corrupted cluster becomes difficult especially when the clusters and the number of features are huge.
Although, K-means noise clustering tries to tackle the above issues, the K-means noise clustering technique has a serious disadvantage in specifying the distance parameter which makes it extremely difficult for practical purposes. The DBSCAN technique is reasonably robust to outliers by not assigning them to any cluster. However, the DBSCAN technique does not work very well with clusters of similar densities (i.e. it will not be able to distinguish two corrupted clusters with similar densities). The performance of the machine learning model is also severely affected by the dimensionality of the feature vectors.
Thus, there exists a need for a technical solution of identification and removal of ambiguous/noisy merchant data from electronic payment transactions using automated means to an unprecedented manner/degree.
SUMMARY
Various embodiments of the present disclosure provide methods and systems for identifying ambiguous merchant data in electronic payment transactions using unsupervised and semi-supervised machine learning techniques.
In an embodiment, a computer-implemented method is disclosed. The computer-implemented method performed by a server system includes accessing a plurality of electronic payment transaction records associated with a plurality of merchants from a database. Each of the plurality of electronic payment transaction records includes merchant data fields associated with a merchant of the plurality of merchants. The computer-implemented method includes identifying a set of electronic payment transaction records from the plurality of electronic payment transaction records. Each of the set of electronic payment transaction records has a matching probability score less than a predefined threshold score. The matching probability score for an electronic payment transaction record is computed by matching the electronic payment transaction record with corresponding data stored in a merchant database. The computer-implemented method includes applying a clustering algorithm over the set of electronic payment transaction records for constructing a set of clusters. The set of clusters are constructed based, at least in part, on a set of matching features corresponding to the merchant data fields associated with each of the set of electronic payment transaction records. The computer-implemented method includes determining ambiguous instances from the set of clusters by performing steps: selecting a cluster, determining labeled data points associated with one or more instances of the cluster, in response to determining the labeled data points for the one or more instances, providing the labeled data points to a supervised machine learning model for training the supervised machine learning model, evaluating an accuracy metrics of the supervised machine learning model, and determining at least one ambiguous instance from the cluster based, at least in part, on the evaluating step.
In another embodiment, a server system is disclosed. The server system includes a communication interface, a memory comprising executable instructions and a processor communicably coupled to the communication interface. The processor is configured to execute the executable instructions to cause the server system to at least access a plurality of electronic payment transaction records associated with a plurality of merchants from a database. Each of the plurality of electronic payment transaction records includes merchant data fields associated with a merchant of the plurality of merchants. The server system is further caused to identify a set of electronic payment transaction records from the plurality of electronic payment transaction records. Each of the set of electronic payment transaction records has a matching probability score less than a predefined threshold score. The matching probability score for an electronic payment transaction record is computed by matching the electronic payment transaction record with corresponding data stored in a merchant database. The server system is further caused to apply a clustering algorithm over the set of electronic payment transaction records for constructing a set of clusters. The set of clusters are constructed based, at least in part, on a set of matching features corresponding to the merchant data fields associated with each of the set of electronic payment transaction records. The sever system is further caused to determine ambiguous instances from the set of clusters by performing: selecting a cluster, determining labeled data points associated with one or more instances of the cluster, in response to determining the labeled data points for the one or more instances, providing the labeled data points to a supervised machine learning model for training the supervised machine learning model, evaluating an accuracy metrics of the supervised machine learning model, and determining at least one ambiguous instance from the cluster based, at least in part, on the evaluation.
In yet another embodiment, a computer-implemented method is disclosed. The computer-implemented method performed by a server system includes accessing a plurality of electronic payment transaction records associated with a plurality of merchants from a database. Each of the plurality of electronic payment transaction records includes merchant data fields associated with a merchant of the plurality of merchants. The computer-implemented method includes identifying a set of electronic payment transaction records from the plurality of electronic payment transaction records. Each of the set of electronic payment transaction records has a matching probability score less than a predefined threshold score. The matching probability score for an electronic payment transaction record is computed by matching the electronic payment transaction record with corresponding data stored in a merchant database. The computer-implemented method includes applying K-prototype clustering algorithm over the set of electronic payment transaction records for constructing a set of clusters. The set of clusters are constructed based, at least in part, on a set of matching features corresponding to the merchant data fields associated with each of the set of electronic payment transaction records. The computer-implemented method includes determining ambiguous instances from the set of clusters by performing steps: selecting a cluster, sampling the cluster from the set of clusters to identify the one or more instances using pool based sampling methods, determining labeled data points associated with one or more instances of the cluster, in response to determining the labeled data points for the one or more instances, providing the labeled data points to a supervised machine learning model for training the supervised machine learning model, evaluating an accuracy metrics of the supervised machine learning model, and determining at least one ambiguous instance from the cluster based, at least in part, on the evaluating step.
Other aspects and example embodiments are provided in the drawings and the detailed description that follows.
BRIEF DESCRIPTION OF THE FIGURES
For a more complete understanding of example embodiments of the present technology, reference is now made to the following descriptions taken in connection with the accompanying drawings in which:
FIG. 1 is an example representation of an environment, in which at least some example embodiments of the present disclosure can be implemented;
FIG. 2 is a simplified block diagram of a server system, in accordance with an embodiment of the present disclosure;
FIGS. 3A and 3B, collectively, represent a schematic block diagram representation of a process flow for clustering electronic payment transaction records having low matching probability score, in accordance with an embodiment of the present disclosure;
FIG. 4 is an example representation of matching merchant data fields of a payment transaction record with candidate merchant records stored at a merchant database, in accordance with an example embodiment of the present disclosure;
FIG. 5A is an example representation of constructing clusters based on a set of matching features associated with merchant data fields of a set of electronic payment transaction records, in accordance with an example embodiment of the present disclosure;
FIG. 5B is an example representation of a table depicting description of each cluster, in accordance with an example embodiment of the present disclosure;
FIGS. 6A and 6B, collectively, represent a flow chart of a process flow for identifying ambiguous electronic payment transaction records with noisy merchant data fields, in accordance with an embodiment of the present disclosure;
FIG. 7 represents a flow diagram of a method for identifying ambiguous instances in electronic payment transaction records, in accordance with another example embodiment of the present disclosure;
FIG. 8 is a simplified block diagram of a payment server, in accordance with an example embodiment of the present disclosure; and
FIG. 9 is a simplified block diagram of an acquirer server, in accordance with an example embodiment of the present disclosure.
The drawings referred to in this description are not to be understood as being drawn to scale except if specifically noted, and such drawings are only exemplary in nature.
DETAILED DESCRIPTION
In the following description, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the present disclosure. It will be apparent, however, to one skilled in the art that the present disclosure can be practiced without these specific details.
Reference in this specification to “one embodiment” or “an embodiment” means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment of the present disclosure. The appearance of the phrase “in an embodiment” in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Moreover, various features are described which may be exhibited by some embodiments and not by others. Similarly, various requirements are described which may be requirements for some embodiments but not for other embodiments.
Moreover, although the following description contains many specifics for the purposes of illustration, anyone skilled in the art will appreciate that many variations and/or alterations to said details are within the scope of the present disclosure. Similarly, although many of the features of the present disclosure are described in terms of each other, or in conjunction with each other, one skilled in the art will appreciate that many of these features can be provided independently of other features. Accordingly, this description of the present disclosure is set forth without any loss of generality to, and without imposing limitations upon, the present disclosure.
The term "acquirer" is an organization that transmits a purchase transaction to a payment card system for routing to the issuer of the payment card account in question. Typically, the acquirer has an agreement with merchants, wherein the acquirer receives authorization requests for purchase transactions from the merchants, and routes the authorization requests to the issuers of the payment cards being used for the purchase transactions. The terms “acquirer”, “acquiring bank”, “acquiring bank” or “acquirer bank” will be used interchangeably herein. Further, one or more server systems associated with the acquirer are referred to as "acquirer server" to carry out its functions.
The term "payment network", used herein, refers to a network or collection of systems used for transfer of funds through use of cash-substitutes. Payment networks may use a variety of different protocols and procedures in order to process the transfer of money for various types of transactions. Transactions that may be performed via a payment network may include product or service purchases, credit purchases, debit transactions, fund transfers, account withdrawals, etc. Payment networks may be configured to perform transactions via cash-substitutes, which may include payment cards, letters of credit, checks, financial accounts, etc. Examples of networks or systems configured to perform as payment networks include those operated by such as, Mastercard®.
The term "merchant", used throughout the description generally refers to a seller, a retailer, a purchase location, an organization, or any other entity that is in the business of selling goods or providing services, and it can refer to either a single business location, or a chain of business locations of the same entity. Further, the term "aggregated merchant name", used throughout the description, refers to a standard merchant name of a merchant despite variations shown by different franchisee outlets or different merchants (merchant at different geographical locations). The information associated with such aggregated merchant is ‘pre-defined’ and stored in a database available at a server system.
The term “oracle”, used throughout the description refers to a human annotator who provides labels for merchant data fields (e.g., merchant location) in a payment transaction data based on information gathered from one or more external sources. Further, the terms, “oracle” and “human oracle” have been used interchangeably throughout the present description.
The term "instance", used throughout the description generally relates to an electronic payment transaction record with merchant data fields. For example, the term "ambiguous instance" refers to ambiguous merchant data fields in the electronic payment transaction record that make it difficult to be mapped to a clean merchant database. Moreover, sometimes such ambiguous instances in the electronic payment transaction records are difficult to be labelled even by the human oracle.
OVERVIEW
Various embodiments of the present disclosure provide methods, systems, electronic devices and computer program products for identifying electronic payment transaction records with ambiguous merchant data using unsupervised and semi-supervised machine learning techniques. More specifically, embodiments of the present disclosure provide a semi-supervised machine learning model for identifying merchant data fields in an electronic payment transaction record that are noisy. Such techniques for identifying ambiguous data in electronic payment transaction records help to eliminate incorrect entries in a database.
In an example, the present disclosure describes a server system that determines the ambiguous merchant data in a plurality of electronic payment transaction records. The server system includes at least a processor and a memory. In one non-limiting example, the server system is a payment server. The server system is configured to access the plurality of electronic payment transaction records associated with a plurality of merchants from a database. Each electronic payment transaction record includes at least merchant data fields associated with a merchant of the plurality of merchants. In an embodiment, the server system is configured to extract, the merchant data fields from each of the plurality of electronic payment transaction records. The merchant data fields include, but are not limited to, merchant name, merchant contact number, merchant acquirer ID, merchant address (e.g., door number, street name and/or number), merchant city, merchant state code, merchant zip code, and merchant country code. The merchant data fields include categorical and numerical data. For example, merchant contact number is numerical data and merchant city is categorical data.
In one embodiment, the merchant data fields of each electronic payment transaction record of the plurality of electronic payment transaction records are matched with corresponding data in a merchant database to obtain a matching probability score. More specifically, merchant data fields of an electronic payment transaction record are mapped to a corresponding merchant attribute of a candidate merchant record to determine a similarity.
In an embodiment, the server system is configured to identify a set of electronic payment transaction records from the plurality of electronic payment transaction records with matching probability scores less than a predefined threshold score. In other words, the merchant data fields associated with electronic payment transaction records that are ambiguous and do not match with similar candidate merchant records are selected. More specifically, these set of electronic payment transaction records may include ambiguous merchant data fields and hence are not able to have a confidence match with high matching probability score.
The server system is configured to apply a clustering algorithm over the set of electronic payment transaction records for constructing a set of clusters. The set of clusters are constructed based, at least in part, on a set of matching features corresponding to the merchant data fields associated with each of the set of electronic payment transaction records. During matching, the server system is configured to determine a set of matching features associated with each of the set of electronic payment transaction records based on a matching of each of the set of electronic payment transaction records with corresponding data stored in the merchant database. More specifically, clustering is performed such that the electronic payment transaction records that have the same/similar set of matching features are grouped in a cluster. In other words, electronic payment transaction records that differ from corresponding candidate merchant records in the same merchant data fields are grouped. In an embodiment, the K-prototype clustering algorithm is utilized to construct the set of clusters.
In one embodiment, the server system is configured to determine ambiguous instances from the set of clusters by performing some operations iteratively. The server system is configured to select a cluster of the set of clusters. The cluster can be chosen randomly from the set of clusters. The server system is configured to identify one or more instances of the plurality of instances in the cluster. In an embodiment, a pool based sampling is used to identify the one or more instances. The server system uses active learning to identify informative instances in a cluster for labeling. More specifically, each instance of the plurality of instances is queried using marginal sampling to identify the one or more instances.
The server system is configured to determine labeled data points associated with the one or more instances. A labeling request is sent to an oracle for determining a label for an instance. The oracle may access external databases to determine merchant related information associated with an instance to determine a label for the instance. If the oracle determines a label for the instance, the oracle sends the label to the server system. The server system provides the label to the instance to form a labeled data point. Alternatively, if the oracle is unable to determine a label for an instance, the server system classifies the instance as an ambiguous instance. The ambiguous instance implies that the merchant data field includes noisy data and/or errors that make it too ambiguous for the oracle to label.
In an embodiment, the labeled data points of one or more instances of the selected cluster are provided to a supervised machine learning model for training the supervised machine learning model. After training with the selected cluster, an accuracy metrics of the supervised machine learning model generated at a current iteration is evaluated. The server system is configured to determine whether the accuracy metrics of the supervised machine learning model at the current iteration does have an incremental value compared to an accuracy metrics of the supervised machine learning model at a previous iteration or not. More specifically, the accuracy metrics of the supervised machine learning model generated at the current iteration is compared with an accuracy metrics of the supervised machine learning model generated at the previous iteration.
In one embodiment, when the sever system determines that the accuracy metrics at the current iteration does not have the incremental value compared to the accuracy metrics of the previous iteration, the server system is configured to classify a plurality of instances of the cluster as ambiguous instances and discard the supervised machine learning model generated at the current iteration. The server system is configured to access the supervised machine learning model generated at the previous iteration for the next iterations. This indicates that the cluster includes ambiguous merchant data fields and therefore the accuracy metrics of the supervised machine learning model has declined.
In another embodiment, when the server system determines that the accuracy metrics of the supervised machine learning model at the current iteration does have the incremental value compared to the accuracy metrics at the previous iteration, the supervised machine learning model is retained. In other words, when the accuracy metrics of the supervised machine learning model generated at the current iteration is greater than the accuracy metrics of the supervised machine learning model generated at the previous iteration the supervised machine learning model generated at the current iteration is retained and used during next iterations.
Without in any way limiting the scope, interpretation, or application of the claims appearing below, technical effects of one or more of the example embodiments disclosed herein is to identify ambiguous data in merchant data fields of electronic payment transaction records automatically. Further, the present disclosure allows servers to automatically identify noisy instances in electronic payment transaction records, thereby eliminating inaccurate entries in memory and improving data accuracy and payment processing speed. Thus, the present disclosure is directed towards identifying and isolating noisy payment transaction records from payment transactions automatically, thereby reducing computational complexity in aggregation process of the transaction records and minimizing improper computer processing.
The amount of payment transaction data being huge, iteratively sampling ambiguous instances is tedious and time-consuming. The present disclosure utilizes a combination of unsupervised and semi-supervised learning technique that drastically reduces the number of instances being sampled by grouping the huge data into clusters using an unsupervised learning technique and then employing a semi-supervised learning technique for labeling only a subset of instances in each cluster. Additionally, clustering payment transaction records with low matching rate aid in identifying ambiguous data of specific merchant data fields due to which the supervised machine learning model (i.e., classifier) is not able to classify the payment transaction records correctly. Moreover, active learning technique ensures that the representativeness of the evaluation set is good as the server system only samples a subset of informative instances in each cluster, thereby enabling filtering-out most informative instances of each cluster. Since the sampling strategy in active learning aims to select the most informative instances of a cluster, therefore, the instance space is specifically sampled according to classifier characteristics and a performance function is optimized.
Various example embodiments of the present disclosure are described hereinafter with reference to FIGS. 1 to 9.
FIG. 1 illustrates an exemplary representation of an environment 100 related to at least some example embodiments of the present disclosure. Although the environment 100 is presented in one arrangement, other embodiments may include the parts of the environment 100 (or other parts) arranged otherwise depending on, for example, identifying ambiguous instances, etc. The environment 100 generally includes a plurality of entities, for example, an acquirer server 102, a payment network 104 including a payment server 106, each coupled to, and in communication with (and/or with access to) a network 110. The network 110 may include, without limitation, a light fidelity (Li-Fi) network, a local area network (LAN), a wide area network (WAN), a metropolitan area network (MAN), a satellite network, the Internet, a fiber-optic network, a coaxial cable network, an infrared (IR) network, a radio frequency (RF) network, a virtual network, and/or another suitable public and/or private network capable of supporting communication among the entities illustrated in FIG. 1, or any combination thereof.
Various entities in the environment 100 may connect to the network 110 in accordance with various wired and wireless communication protocols, such as Transmission Control Protocol and Internet Protocol (TCP/IP), User Datagram Protocol (UDP), 2nd Generation (2G), 3rd Generation (3G), 4th Generation (4G), 5th Generation (5G) communication protocols, Long Term Evolution (LTE) communication protocols, or any combination thereof. For example, the network 110 may include multiple different networks, such as a private network made accessible by the payment network 104 to the acquirer server 102 and the payment server 106, separately, and a public network (e.g., the Internet, etc.).
The environment 100 also includes a server system 108 configured to perform one or more of the operations described herein. In one example, the server system 108 is the payment server 106. In general, the server system 108 is configured to identify electronic payment transaction records that have ambiguous merchant data fields (e.g., merchant location). The server system 108 is a separate part of the environment 100, and may operate apart from (but still in communication with, for example, via the network 110) the acquirer server 102, the payment server 106, and any third party external servers (to access data to perform the various operations described herein). However, in other embodiments, the server system 108 may actually be incorporated, in whole or in part, into one or more parts of the environment 100, for example, the payment server 106. In addition, the server system 108 should be understood to be embodied in at least one computing device in communication with the network 110, which may be specifically configured, via executable instructions, to perform as described herein, and/or embodied in at least one non-transitory computer readable media.
In one embodiment, the acquirer server 102 is associated with a financial institution (e.g., a bank) that processes financial transactions. This can be an institution that facilitates the processing of payment transactions for physical stores, merchants, or an institution that owns platforms that make online purchases or purchases made via software applications possible (e.g., shopping cart platform providers and in-app payment processing providers).
In one embodiment, a plurality of merchants 112a, 112b, and 112c is associated with the acquirer server 102. The plurality of merchants 112a-112c may be physical stores such as retail establishments or a merchant facilitated e-commerce website interface (online store). The plurality of merchants 112a, 112b, and 112c is hereinafter collectively represented as "the merchant 112".
To accept payment transactions from customers, the merchant 102 normally establishes an account with a financial institution (i.e., “acquirer server 102”) that is part of the financial payment system. Account details of the merchant accounts established with the acquirer bank are stored in merchant profiles of the merchants in a memory of the acquirer server 102 or on a cloud server associated with the acquirer server 102. It shall be noted that all the merchants 112a-112c may not be associated with a single acquirer and the merchants may establish financial accounts with different acquirers and thereby payment transactions may be facilitated by more than one acquirer server and have not been explained herein for the sake of brevity.
In one embodiment, the merchant 112 has a payment transaction terminal (not shown in figures) that communicates directly or indirectly with the acquirer server 102. Examples of the payment transaction terminal may include, but not limited to, a Point-of-Sale (POS) terminal, and a customer device with a payment gateway application. The POS terminal is usually located at stores or facilities of the merchant 112. The merchant 112 can have more than one payment transaction terminal. In one embodiment, a customer may perform a payment transaction using the customer device (i.e., the mobile phone) which conforms to an e-commerce payment transaction.
In one example, a customer purchases goods or services from the merchant 112 using a payment card. The customer may utilize the payment card to effectuate payment by presenting/swiping the payment card to the POS terminal. Upon presentation of the physical or virtual payment card, account details (i.e., account number) are accessed by the POS terminal. The POS terminal sends payment transaction details to the acquirer server 102. The acquirer server 102 sends a payment transaction request to the server system 108 or the payment server 106 for routing the payment transaction to a card issuer associated with the customer. The payment transaction request includes a plurality of data elements. The plurality of data elements may include, but is not limited to, BIN of the card issuer of the payment card, a payment transaction identifier, a payment transaction amount, a payment transaction date/time, a payment transaction terminal identifier, a merchant name and location, an acquirer identifier etc. In one embodiment, the payment transaction request may be an electronic message that is sent via the server system 108 or the payment server 106 to the card issuer of the payment card to request authorization for a payment transaction. The payment transaction request may comply with a message type defined by an International Organization for Standardization (ISO) 8583 standard, which is a standard for systems that exchanges electronic transaction information associated with payments made by users using the payment card, or the payment account.
In one example, an ISO 8583 transaction message may include one or more data elements that store data usable by the server system 108 to communicate information such as transaction requests, responses to transaction requests, inquiries, indications of fraud, security information, or the like. For example, the ISO 8583 message may include a PAN in the second data field (also known as DE2), an amount of a transaction in DE4, a date of settlement in DE15, a location of the merchant 112 in DE41, DE42, and/or DE43, or the like. In particular, the acquirer server 102 transmits merchant name, location, city, and country-code in the DE 43 data element.
The card issuer approves or denies an authorization request, and then routes, via the payment network 104, an authorization response back to the acquirer server 102. The acquirer server 102 sends the approval to the POS terminal of the merchant 112. Thereafter, seconds later, the customer completes the purchase and receives a receipt.
In one embodiment, the server system 108 accesses electronic payment transaction records stored in a transaction database 114 for reporting and data analysis. In one embodiment, the transaction database 114 is a central repository of data which is created by storing electronic payment transaction records from payment transaction requests occurring within acquirers and issuers associated with the payment network 104. The database 114 stores real-time electronic payment transaction records of a plurality of merchants. The electronic payment transaction records may include, but not limited to, payment transaction attributes, such as, merchant data fields such as merchant name, merchant identifier, merchant location, merchant category code (MCC), transaction amount, source of funds such as bank or credit cards, transaction channel used for loading funds such as POS terminal, payment transaction location information, external data sources, and other internal data to evaluate/analyze each payment transaction. In one embodiment, the server system 108 stores, reviews, and/or analyzes information used in merchant aggregation.
While storing the electronic payment transaction records, the server system 108 extracts merchant data fields (e.g., merchant location) from payment transaction data. Sometimes, the merchant data fields extracted from the electronic payment transaction records are ambiguous or noisy in nature and do not have a high probabilistic match with candidate merchant records stored in a merchant database 118 (i.e., clean merchant database). In one embodiment, the server system 108 is configured to identify payment transaction records with ambiguous merchant data fields in a fully automated manner using machine learning models and provide those payment transaction records to a human annotator for rectification.
The server system 108 is configured to perform one or more of the operations described herein. In particular, the server system 108 is configured to match merchant data fields of each of the electronic payment transaction records with candidate merchant records in the merchant database 118 (also referred to as ‘clean merchant database 118’) to obtain matching probability scores. The merchant database 118 stores merchant entries which unambiguously identify attributes of a merchant, such as, merchant name, merchant location. In other words, the merchant database 118 houses candidate merchant records of merchant 112 that include merchant data fields as registered with the acquirer 102. More specifically, these unambiguous entries are referred to as ‘clean merchant data’.
In some example embodiments, the oracle 116 labels instances (i.e., merchant data fields of payment transaction records) when requested by the server system 108. However, some instances are so ambiguous that even the oracle 116 cannot decipher a corresponding merchant location/information. In general, such instances are noisy in nature and are identified by the server system 108 using a combination of clustering and semi-supervised machine learning techniques.
In one embodiment, the payment network 104 may be used by the payment cards issuing authorities as a payment interchange network. The payment network 104 may include a plurality of payment servers such as, the payment server 106. Examples of payment interchange network include, but are not limited to, Mastercard® payment system interchange network. The Mastercard® payment system interchange network is a proprietary communications standard promulgated by Mastercard International Incorporated® for the exchange of financial transactions among a plurality of financial activities that are members of Mastercard International Incorporated®. (Mastercard is a registered trademark of Mastercard International Incorporated located in Purchase, N.Y.).
The number and arrangement of systems, devices, and/or networks shown in FIG. 1 are provided as an example. There may be additional systems, devices, and/or networks; fewer systems, devices, and/or networks; different systems, devices, and/or networks; and/or differently arranged systems, devices, and/or networks than those shown in FIG. 1. Furthermore, two or more systems or devices shown in FIG. 1 may be implemented within a single system or device, or a single system or device shown in FIG. 1 may be implemented as multiple, distributed systems or devices. Additionally, or alternatively, a set of systems (e.g., one or more systems) or a set of devices (e.g., one or more devices) of the environment 100 may perform one or more functions described as being performed by another set of systems or another set of devices of the environment 100.
Referring now to FIG. 2, a simplified block diagram of a server system 200 is shown, in accordance with an embodiment of the present disclosure. The server system 200 is similar to the server system 108. In some embodiments, the server system 200 is embodied as a cloud-based and/or SaaS-based (software as a service) architecture. In one embodiment, the server system 200 is a part of the payment network 104 or integrated within the payment server 106. In another embodiment, the server system 200 is the acquirer server 102.
The server system 200 includes a computer system 202 and a database 204. The computer system 202 includes at least one processor 206 for executing instructions, a memory 208, a communication interface 210, and a storage interface 214 that communicate with each other via a bus 212.
In some embodiments, the database 204 is integrated within the computer system 202. For example, the computer system 202 may include one or more hard disk drives as the database 204. A storage interface 214 is any component capable of providing the processor 206 with access to the database 204. The storage interface 214 may include, for example, an Advanced Technology Attachment (ATA) adapter, a Serial ATA (SATA) adapter, a Small Computer System Interface (SCSI) adapter, a RAID controller, a SAN adapter, a network adapter, and/or any component providing the processor 206 with access to the database 204. In some example embodiments, the database 204 is configured to store one or more trained supervised machine learning models.
Examples of the processor 206 include, but are not limited to, an application-specific integrated circuit (ASIC) processor, a reduced instruction set computing (RISC) processor, a complex instruction set computing (CISC) processor, a field-programmable gate array (FPGA), and the like. The memory 208 includes suitable logic, circuitry, and/or interfaces to store a set of computer-readable instructions for performing operations. Examples of the memory 208 include a random-access memory (RAM), a read-only memory (ROM), a removable storage drive, a hard disk drive (HDD), and the like. It will be apparent to a person skilled in the art that the scope of the disclosure is not limited to realizing the memory 208 in the server system 200, as described herein. In another embodiment, the memory 208 may be realized in the form of a database server or cloud storage working in conjunction with the server system 200, without departing from the scope of the present disclosure.
The processor 206 is operatively coupled to the communication interface 210 such that the processor 206 is capable of communicating with a remote device 218 such as, the acquirer server 102, the payment server 106, or communicating with any entity connected to the network 110 (as shown in FIG. 1). Further, the processor 206 is operatively coupled to the user interface 216 for interacting with the oracle 116 for labeling merchant data fields of the electronic payment transaction records that have low matching probability scores with data stored at the merchant database 118.
It is noted that the server system 200 as illustrated and hereinafter described is merely illustrative of an apparatus that could benefit from embodiments of the present disclosure and, therefore, should not be taken to limit the scope of the present disclosure. It is noted that the server system 200 may include fewer or more components than those depicted in FIG. 2.
In one embodiment, the processor 206 includes a data pre-processing engine 220, a matching engine 222, a clustering engine 224, a training engine 226, and an evaluation engine 228. It should be noted that components, described herein, can be configured in a variety of ways, including electronic circuitries, digital arithmetic and logic blocks, and memory systems in combination with software, firmware, and embedded technologies.
In one embodiment, the processor 206 is configured to access a plurality of electronic payment transaction records from the transaction database 114. An electronic payment transaction record includes merchant data fields, payment transaction amount, payee identifier, transaction time/identifier, etc. In one embodiment, the merchant data fields may include, but not limited to, merchant name, acquirer merchant identifier, merchant address (e.g., city and Zip number), merchant state/country code, merchant category code (MCC), etc. Moreover, the merchant data fields include continuous/numerical data and categorical data. Examples of continuous data include, but not limited to, acquirer merchant identifier, merchant address, merchant zip code, contact number, merchant taxpayer identifier, and the like. Examples of categorical data include, but not limited to, merchant city, merchant state code, merchant country, a merchant business category, and the like.
It shall be noted that the merchant data fields used herein are for example purposes only and embodiments of the present disclosure can be practiced on fewer or more merchant data fields than those described herein.
The data pre-processing engine 220 includes suitable logic and/or interfaces for extracting the merchant data fields from each electronic payment transaction record. The processor 206 is configured to parse the electronic payment transaction records for extracting the merchant data fields. In one example, to parse the electronic payment transaction record, merchant data fields, transaction amount, transaction identifier, etc. in the electronic payment transaction record are separated using a defined set of delimiters (e.g., spaces, equal signs, colons, semicolons, etc.). In one embodiment, the processor 206 is configured to filter the merchant data fields to remove noise/junk characters including numbers, special characters, lowercase, punctuations, etc. that may have been introduced during storage and transmission. These noise and/or junk characters carry no significant information and are usually filtered out.
The data pre-processing engine 220 is configured to perform the scrubbing process over the merchant data fields to obtain standardized merchant data fields. This happens as payment transaction data format adopted by every acquirer is different and the electronic payment transaction records coming from different acquirers may not conform to one standard format that may introduce glitches while processing the merchant data fields. More specifically, the scrubbing process is employed by the processor 206 for merchant location data standardization and/or normalization.
However, it shall be noted that data pre-processing is optional and embodiments of the present disclosure can be practiced on the merchant data fields as received in the electronic payment transaction records.
The matching engine 222 includes suitable logic and/or interfaces for matching the electronic payment transaction records to at least one candidate merchant record in the merchant database 118. Based on the match, the matching engine 222 is configured to compute a matching probability score. In one embodiment, the matching engine 222 utilizes a blocking algorithm for matching data elements of the merchant database 118 with associated merchant data fields of the electronic payment transaction records. In a non-limiting example, each of the electronic payment transaction records is matched with a candidate merchant record in a clean merchant database such as, Pitney Bowes. More specifically, the processor 206 is configured to map the merchant data fields of an electronic payment transaction record to corresponding merchant attributes of the candidate merchant record to obtain the matching probability score.
Moreover, the blocking algorithm is utilized for matching possible candidate merchant records for the merchant data fields of each electronic payment transaction records. In an example, merchant data fields of a particular payment transaction may be as follows after data pre-processing:
Crème Inn 15 Fort Square Manhattan New York 10009 NY USA 347 691 2234
The merchant attributes of a candidate merchant record stored in the merchant database 118 matches with the merchant data fields of the above example are as follows:
Crème Inn Treat 15 Fort Square Manhattan New York 10009 NY USA 347 691 2234
The candidate merchant record and the electronic payment transaction record differ in the merchant data field (i.e., merchant name) by one token/word. The processor 206 is configured to map each merchant data field of the electronic payment transaction record to a corresponding merchant attribute in the candidate merchant record to generate the matching probability score. In the above example, the merchant data fields differ by a word in the merchant name and hence the matching probability score is 0.8.
Additionally, the matching engine 222 is configured to generate a set of matching features corresponding to merchant data fields of each electronic payment transaction record based on the matching of the merchant data fields with the candidate merchant record. The set of matching features associated with each electronic payment transaction record may be categorical and/or numerical data based on the match. An example of matching an electronic payment transaction record with candidate merchant records is shown and explained with reference to FIG. 4.
In one embodiment, the processor 206 is configured to identify a set of electronic payment transaction records from the plurality of electronic payment transaction records having a matching probability score lesser than a threshold value. The low matching probability matching score indicates that the merchant data fields of an electronic payment transaction records may have ambiguous data that are generally noisy in nature.
The clustering engine 224 includes suitable logic and/or interfaces for applying a clustering algorithm over merchant data fields of each of the set of electronic payment transaction records for constructing a set of clusters. More specifically, the set of clusters is constructed based at least in part on the set of matching features corresponding to the merchant data fields of each set of electronic payment transaction records. In other words, the set of clusters is constructed based on matched merchant data fields of a particular payment transaction in the merchant database.
In an embodiment, the K-prototype clustering algorithm is employed to construct the set of clusters. The K-prototype clustering is a combination of K-means and K-modes clustering algorithms for clustering mixed data/attributes (categorical and numerical data).
The K-prototype clustering technique enables clustering for datasets including numerical and categorical attributes. However, embodiments may use other clustering techniques to cluster the dataset, e.g., k-means clustering and other centroid clustering algorithms.
In one embodiment, the clustering engine 224 is configured to determine a prototype (also referred to as ‘cluster center’) for each cluster. Initially, a set of matching features (also referred to as ‘matching values of a prototype’) associated with each payment transaction record is chosen as the prototype of that cluster. Further, the clustering engine 224 is configured to assign the payment transaction record to a particular cluster based on the associated set of matching features. More specifically, a dissimilarity measure is determined between the set of matching features of the payment transaction record and matching values of the prototype of each of the clusters. The payment transaction record is assigned to a cluster with whose prototype the payment transaction record has the least dissimilarity measure. In other words, the clustering engine 224 assigns the payment transaction record that is nearest to a prototype for minimizing the dissimilarity measure within the cluster. The dissimilarity measure is given by the following equation:
D(x,p)=E(x,p)+?C(x,p) Eqn. 1
Where:
x represents merchant data fields of payment transaction
p is a prototype of a cluster,
D(x,p) is a dissimilarity measure between x and p,
E(x,p) is Euclidean distance between numerical attributes of x and p,
C(x,p) is the number of mismatched categorical attributes between x and p, and
? is weightage for the categorical variable value.
The clustering engine 224 is configured to iteratively update prototype and assign clusters for the payment transaction records till the clusters stabilize. In one embodiment, the prototype is updated by updating the matching values of the prototypes. The matching values include numerical data and categorical data. Accordingly, a weighted average of numerical data in the cluster and mode of the categorical data in that cluster are determined for updating the matching values of the prototypes. Further, each of the cluster centers is represented by a central vector or cluster center value, which may not necessarily be a member of the data set. The cluster center values may be utilized as proxy representations to replace data in the dataset generating a reduced dataset.
In one non-limiting example, for payment transactions R1 and R2, merchant name and merchant phone numbers are matched to corresponding data stored in the merchant database. Further, the merchant addresses differ from corresponding data in the database and are matched with a matching probability score equal to 0.3 which is lower than the predefined threshold value. Thus, based on similar matching characteristics, the payment transactions R1 and R2 are clustered into one cluster.
In another example, consider electronic payment transaction records R1, R2, and R3 are accessed. These records are matched with the merchant database and matching features associated with each merchant field are generated for respective records. The matching features {f1-1, f1-2, f1-3} are associated with the electronic payment transaction record R1, matching features {f2-1, f2-2, f2-3} are associated with an electronic payment transaction record R2 and matching features {f3-1, f3-2, f3-3} are associated with an electronic payment transaction record R3. The set of matching features for each record is shown in the below table 1:
Matching Features R1 R2 R3
fn-1 Yes No Yes
fn-2 No Yes No
fn-3 72% 100% 78%
Table 1
In the above example, the electronic payment transaction records R1 and R3 are grouped in a cluster (e.g., cluster C1) based on similar characteristics of the set of matching features, and the electronic payment transaction record R2 is grouped in another cluster (e.g., cluster C2). In other words, the electronic payment transaction records in cluster C1 match with candidate merchant records in the merchant name but differ slightly in the merchant location and the electronic payment transaction records in cluster C2 match with candidate merchant records in the merchant location but differ in the merchant name. In general, a cluster includes merchant data fields of electronic payment transaction records that are similar and are dissimilar from the merchant data fields in another cluster.
In some embodiments, it shall be noted that the clustering engine 224 may use numerical data clustering algorithms such as, but not limited to, K-means clustering, hierarchical clustering, density-based clustering, spectral and graph clustering or Gaussian Mixture if the merchant data fields include only numerical data/attributes and categorical data clustering algorithms such as, but not limited to, K-modes clustering, squeezer clustering, Cobweb clustering, Limbo clustering techniques when the merchant data fields include only categorical data/attributes.
The training engine 226 includes suitable logic and/or interfaces for training a supervised machine learning model. In an embodiment, the supervised machine learning model is trained for identifying ambiguous merchant data fields in the electronic payment transaction records. In an embodiment, the supervised machine learning model is trained using one or more instances of each set of clusters in an iterative manner. One or more instances are related to merchant data fields associated with one or more payment transactions that need to be labeled. More specifically, the training engine 226 is configured to train the supervised machine learning model using the merchant data fields of the set of clusters to identify ambiguous instances using active learning methods, iteratively.
“Active learning” generally refers to a semi-supervised learning that smartly labels instances (e.g., merchant location field) of the cluster. The labeled instances (also referred to as ‘labeled data points’) are used for training the supervised machine learning model. Active learning proactively selects the most desirable subset of instances in a cluster for labeling. The instances selected by active learning are the most informative instances of the cluster.
In one embodiment, the training engine 226 is configured to train the supervised machine learning model based on the labeled data points associated with the one or more instances of the cluster. In a non-limiting example, the supervised machine learning model is an XGBoost (eXtreme Gradient) model. However, it shall be noted that other supervised machine learning models such as, but not limited to, Random Forest (RF) model may also be used to identify ambiguous instances.
In one embodiment, the XGBoost model is trained in a supervised manner and defines a model parameter and a target function based on the training. The model parameter is used to control how to determine the value of the target variable (including a classification result or a fitting value) based on the sample. The target function is used to restrict the process of training the model to obtain an ideal parameter. A less target function (also known as a cost function) indicates a higher prediction precision of the XGBoost model. A process of training the XGBoost model is a process of enabling value of the target function to be less than a particular value or to converge to a particular degree.
In one embodiment, the XGBoost model includes a classification and regression tree (CART) function. When a classification problem is resolved, for example, whether a payment transaction has noisy merchant data fields or not (that is, a binary classification problem) is predicted, the classification tree is used.
During training, a cluster from the set of clusters is selected by the training engine 226. It shall be noted that a cluster is selected randomly from the set of clusters and/or a cluster may be selected based on a predefined rule. The cluster is provided to the training engine 226. The training engine 226 is configured to select one or more instances of a cluster for labeling. In a non-limiting example, a pool based sampling is employed and all instances in the pool are queried using margin sampling to select the one or more instances. In an embodiment, the training engine 226 requests an oracle to provide a label for the one or more instances. In an example, the training engine selects an instance (e.g., merchant address ‘4215 Spring View’, merchant name ‘Lemon Tree’) for labeling and sends it to the oracle.
In an embodiment, the labels are accessed via the user interface 216 from the oracle 116. It shall be noted that an instance can have more than one label but the description is limited to a single label for the sake of brevity.
In an embodiment, instances that are not labeled by the oracle 116 are classified as ambiguous instances and are discarded. In other words, the merchant data fields (e.g., merchant address) corresponding to the ambiguous instances are noisy and hence discarded.
The evaluation engine 228 includes suitable logic and/or interfaces for determining an accuracy metrics of the supervised machine learning model after training the supervised machine learning model based on the labeled data points of the cluster. The accuracy metrics of the supervised machine learning model may be evaluated using performance evaluating techniques such as, but not limited to, F1 score (which is a harmonic mean of the precision and recall), Receiver Operating Characteristics (ROC) score, and the like.
In one embodiment, the evaluation engine 228 is configured to determine whether the accuracy metrics of the supervised machine learning model at a current iteration does have an incremental value compared to an accuracy metrics of the supervised machine learning model at a previous iteration (preceding iteration of the current iteration) or not. In other words, the evaluation engine 228 checks whether the performance of the supervised machine learning model has improved when trained with labeled data points of a current cluster. When the incremental value exists and is greater than a threshold value, the supervised machine learning model trained at the current iteration is retained or considered for further iterations.
Thereafter, the training engine 226 is configured to train the supervised machine learning model using labeled data points of a next/subsequent cluster. Alternatively, when the accuracy metrics of the supervised machine learning model at the current iteration does not have an incremental value greater than the threshold value, all instances of the cluster (including labeled data points) are classified as ambiguous instances and are discarded. In other words, when the supervised machine learning model trained on labeled data points of a current cluster performs badly when compared to the supervised machine learning model generated at the end of the previous iteration (preceding iteration of the current iteration), all instances of the cluster are noisy or corrupted which hampers the performance of the supervised machine learning model and are hence discarded.
Similarly, the supervised machine learning model is iteratively trained on remaining clusters of the set of clusters to identify ambiguous instances in electronic payment transaction records.
In one embodiment, once the supervised machine learning model is trained with the set of clusters, the processor 206 is configured to determine a number of electronic payment transaction records again, which have matching probability scores lesser than a predefined threshold score from the set of electronic payment transaction records. In one embodiment, the number of electronic payment transaction records may further include new electronic payment transaction records that the processor 206 accessed from the transaction database 114. These remaining numbers of electronic payment transaction records are clustered together again and each cluster is then added to the supervised machine learning model using similar operations as described earlier. The supervised machine learning model is updated based on the clusters. In one embodiment, the clusters are iteratively provided to the supervised machine learning model and an accuracy metrics value of the supervised machine learning model is evaluated after each iteration. In an embodiment, when a cluster improves the accuracy metrics of the supervised machine learning model, the cluster is retained and the supervised machine learning model is updated based on the labeled data points of the cluster. Alternatively, when updating the supervised machine learning model based on data points of a cluster decreases the accuracy metrics of the supervised machine learning model, the cluster is classified as ambiguous and discarded. More specifically, the cluster may include payment transaction records with merchant data fields that are ambiguous, resulting in a performance decrease of the supervised machine learning model.
In one embodiment, when the accuracy metrics of the supervised machine learning model decreases, data points (i.e., merchant data fields associated with payment transaction record) associated with the cluster are sent to the oracle 116. The oracle 116 may label one or more example instances, and the cluster is again provided to the supervised machine learning model. If the accuracy metrics improve, the cluster is retained (i.e., the supervised machine learning model is updated based on the cluster) and if the accuracy metrics decreases, the cluster including payment transaction records is discarded for being noisy.
In an embodiment, after determining all clusters (i.e., ambiguous clusters) which are not able to increase the accuracy metrics of the supervised machine learning model in a sufficient number of iterations, the processor 206 is configured to provide the payment transaction records (associated with those ambiguous clusters) having the most noisy merchant data fields to the oracle 116.
In one embodiment, when the server system 200 receives new transaction records, the server system 200 classifies the new transaction records as noisy instances based on the trained supervised machine learning model.
FIGS. 3A and 3B, collectively, represent a schematic block diagram representation 300 of a process flow for clustering electronic payment transaction records having low matching probability scores, in accordance with an embodiment of the present disclosure.
At first, the processor 206 is configured to access and perform data pre-processing (see, 304) over the electronic payment transaction records (see 302) associated with payment transaction requests received from the acquirer server 102. The processor 206 is configured to extract (see 306) merchant data fields, in raw text form, from the electronic payment transaction records (see 302) and filter the merchant data fields to remove noise and/or junk characters from the merchant data fields (see 308). The merchant data fields of the electronic payment records are not standardized for processing, hence, the processor 206 is configured to scrub the merchant data fields to obtain standardized merchant data fields (see 310). In other words, the scrubbing process is employed by the processor 206 for merchant location data standardization and/or normalization (see, Table 318).
During matching (see 312), the processor 206 is configured to match (see, 312) merchant data fields of each electronic payment transaction record with candidate merchant records in the clean merchant database 118. A matching probability score is determined based on the similarity of the merchant data fields with the candidate merchant records (see, Table 320). This process helps in determining merchant data fields of electronic payment transaction records that do not have an exact match in the clean merchant database. Moreover, a set of matching features (see, Table 320) is also generated for each electronic payment transaction record including categorical and numerical values of matched merchant data fields. The processor 206 is configured to identify (see, 314) a set of electronic payment transaction records with matching probability scores less than the predefined threshold (e.g., 0.3) (see, Table 322).
The processor 206 is configured to cluster/group (see, 316) the set of electronic payment transaction records into a set of clusters (see, Table 324) based at least on the set of matching features corresponding to the merchant data fields associated with the electronic payment transaction records. The set of matching features is used to find similarity in between the electronic payment transaction records.
Referring now to FIG. 4, an example representation 400 of matching merchant data fields of a payment transaction record with candidate merchant records stored at the merchant database 118, is illustrated, in accordance with an example embodiment of the present disclosure. The payment transaction record includes, but not limited to, merchant data fields 402. The merchant data fields 402 may include categorical and numerical data representing merchant information such as, merchant name, merchant address, merchant city, merchant state code, merchant zip code, merchant country, contact number, and acquirer merchant identifier (ID).
In one embodiment, the processor 206 is configured to utilize a blocking algorithm for matching the payment transaction record with corresponding candidate merchant records stored at the merchant database 118. The processor 206 is configured to generate an optimal blocking function that selects candidate merchant records based on a set of predicates. Examples of predicates for constructing blocking function include, but not limited to, exact match, a common token, a common integer, same integers, differ by one integer, same ‘n’ first characters.
As shown in the FIG. 4, the payment transaction record R1 includes merchant data fields 402. In an embodiment, a query string is generated based on the merchant data fields 402 in the payment transaction record R1. The blocking function matches the query string with candidate merchant records 404 and 406.
In an example scenario, the processor 206 determines two entries in the merchant database 118 that are similar to the query string. The candidate merchant records 404 and 406 are identified in the merchant database 118 that are similar to the query string.
The candidate merchant record 404 differs from the merchant data fields 402, namely, merchant name by one token, merchant address by two integers and three tokens, and merchant acquirer identifier by 2 integers. However, the merchant data fields 402 and candidate merchant record 404 have an exact match in merchant city, merchant state code, and merchant country. In this example, the merchant (“Book Wagon”) associated with the payment transaction record R1 has changed a telephone line, for example, acquired a new telephone line that may have earlier belonged to a merchant (“Book Worms”) with a similar merchant name and operating in the same city.
The candidate merchant record 406 differs from the merchant data fields 402, namely, merchant contact number by 10 integers. However, merchant data fields 402 of the payment transaction record R1 and the candidate merchant record 406 matches in merchant name, merchant address, merchant city, merchant state code, and merchant country. As explained above, the merchant (“Book Wagon”) associated with the payment transaction record R1 may have changed a telephone line and may not have updated it, thereby differing from the candidate merchant record 406 in one field.
In one embodiment, a set of matching features (i.e., {F = f1, f2, f3, …, fn}) is generated based on the matching of the query string with the candidate merchant record. More specifically, the set of matching features are various string distances based on matching each of the merchant data fields 402 of the payment transaction record R1 to a corresponding attribute in the candidate merchant record. For example, when the merchant names in the merchant data field 402 and candidate merchant record 404 are matched and differ by just one word (i.e., one token), a corresponding string distance is calculated. This string distance forms a matching feature of the set of matching features. In an embodiment, the set of matching features may be either categorical or numerical data. In a non-limiting example, the string distances are determined using string similarity algorithms. Examples of the string distances include, but not limited to, Jaccard similarity, Levenshtein distance, Jaro Winkler distance, etc.
In one embodiment, a matching probability score is determined between the merchant data fields of the electronic payment transaction record R1 and each of the candidate merchant records 404 and 406 based on the set of matching features generated. More specifically, the matching probability score is determined based on the string distances between each of the merchant data fields and corresponding merchant attributes in a candidate merchant record (e.g., candidate merchant record 406). In an example, the matching probability score determined on matching merchant data fields 402 of the electronic payment transaction record R1 and the candidate merchant record 404 is 0.2. The matching probability score between the merchant data fields 404 of the payment transaction record R1 and the candidate merchant record 404 is less as the set of matching features (or the string distance) indicates that the payment transaction record R1 and 404 differ in most of the fields (e.g., merchant name, merchant address, acquirer merchant ID). Similarly, the matching probability score (i.e., 0.5) is determined based on a matching of merchant data fields 402 and the candidate merchant record 406. Thus, the candidate merchant record 406 is identified as a possible match for the payment transaction record R1.
Referring now to FIG. 5A, an example representation 500 of constructing clusters based on a set of matching features associated with merchant data fields associated with a set of payment transactions, is shown, in accordance with an example embodiment of the present disclosure.
As mentioned previously, the processor 206 is configured to construct a set of clusters based at least in part on the set of matching features corresponding to the merchant data fields of each set of electronic payment transaction records. The set of matching features represents matching numerical/categorical values of merchant data fields of each payment transaction.
In an embodiment, the clustering engine 224 is configured to randomly select ‘k’ prototypes from the electronic payment transaction records whose similarity probability scores are less than the predefined threshold. These ‘K’ prototypes are referred to as ‘centroids’ of a cluster. The clustering engine 224 is further configured to assign each electronic payment transaction record to one of the ‘K’ clusters based on a categorical dissimilarity measure and a numerical dissimilarity measure. In a non-limiting example, the numerical dissimilarity measure is a Euclidean distance determined between the numerical data of matching features associated with an electronic payment transaction record and each of the ‘K’ prototypes (centroids). A categorical dissimilarity measure is a number of mismatched categorical data of matching features associated with the electronic payment transaction record with the ‘K’ prototypes (centroids). Thereafter, the merchant data fields of the electronic payment transaction record are assigned to a cluster with whose centroids the merchant data fields have the least categorical dissimilarity measure and a least numerical dissimilarity measure. In a similar manner, the merchant data fields of each of the electronic payment transaction record with a matching probability score less than the predefined threshold are assigned to a cluster of the set of ‘K’ clusters.
After the clusters are formed, the ‘centroid’ of each cluster is updated and the merchant data fields of each of the electronic payment transaction records are re-allocated if needed based on the categorical and numerical dissimilarity measures determined with updated centroids. The re-allocation of merchant data fields of electronic payment transaction records between clusters happens till the centroids (‘K’ prototypes) remain unchanged thereby resulting in the convergence of the clustering process.
As shown in the FIG. 5A, each payment transaction record (e.g., R1, R2) includes merchant data fields (e.g., d1_1, d1_2…d1_8). The merchant data fields of each payment transaction record are matched with data stored in the merchant database 118 and a matching probability score is computed based on the match. The processor 206 is configured to generate a set of matching features for each payment transaction record. The set of matching features represents similarity percentage of each merchant field with a candidate merchant record stored in the merchant database 118.
In one example, a matching feature denotes categorical similarity measure value (may be in binary form) indicating whether merchant location (e.g., ZIP number) associated with a particular payment transaction exists or not. In another example, a matching feature is a numerical similarity measure value indicating matching between the merchant address and stored merchant address associated with the respective merchant. It shall be noted that the set of matching features explained herein are for example purposes only and lesser or more number of matching features may be generated than those explained above.
Then, payment transaction records with matching probability scores less than a predefined threshold value are selected for clustering. The processor 206 is configured to generate k number of clusters (e.g., cluster 502, cluster 504, cluster 506) based on the set of matching features associated with the payment transaction records by applying K-prototype clustering algorithm over the payment transaction records. In general, the K-prototype clustering algorithm determines similarity measure values to classify the payment transaction records into the set of clusters 502, 504, 506. In an embodiment, ‘K’ prototypes are randomly selected from the payment transactions whose similarity probability scores are less than the predefined threshold score. These ‘K’ prototypes are referred to as ‘centroids’ of a cluster. Each payment transaction is assigned to one of the ‘K’ clusters based on a dissimilarity measure. As depicted by Eqn. 1, the dissimilarity measure is a sum of categorical dissimilarity measure and a numerical dissimilarity measure. The categorical dissimilarity measure gives a number of mismatched categorical data between the matching features of a payment transaction and a corresponding value of the prototype of a cluster. Similarly, the numerical dissimilarity measure is a Euclidean distance determined between the numerical data of matching features associated with a payment transaction and a prototype (centroid) of a cluster. Thereafter, the payment transaction record R1 is assigned to a cluster (e.g., cluster 504) with whose centroids the merchant data fields d1_1, d1_2…d1_8 have the least dissimilarity measure. In a similar manner, the payment transactions with matching probability score less than the predefined threshold are assigned to a cluster of the set of clusters (e.g., cluster 502, cluster 504, and cluster 506).
After the clusters (e.g., cluster 502, cluster 504, and cluster 506) are formed, the ‘centroid’ of each cluster is updated and the merchant data fields of each of the payment transactions are re-allocated if needed based on the categorical and numerical dissimilarity measures determined with updated centroids. The re-allocation of merchant data fields of payment transactions between clusters happens till the centroids (‘3’ prototypes) remain unchanged thereby resulting in the convergence of the clustering process.
Moreover, it shall be noted that the example representation 500 of clustering shown in FIG. 5A is exemplary and only provided for the purposes of explanation. In practical, the clustering algorithm may form lesser or more clusters of merchant data fields from electronic payment transaction records than those depicted in FIG. 5A.
FIG. 5B, in conjunction with FIG. 5A, shows an example representation 520 of a table depicting description of each cluster, in accordance with an example embodiment of the present disclosure. As mentioned above, merchant data fields of electronic payment transaction records that match with candidate merchant records in merchant name, merchant location but differ in contact number are grouped in the cluster 502 (see, row 522 in FIG. 5B). More specifically, the matching features associated with merchant name, merchant address and merchant contact number of payment transaction records are similar, and hence the merchant data fields associated with corresponding payment transaction records (e.g., R1, R5, R8) are grouped in cluster 1. In similar manner, the cluster 504 includes merchant data fields of payment transaction records R3, R6, R7 associated with matching features indicating that merchant names differ by at least one word and merchant address match only in word but phone numbers may or may not match (see, row 524 in FIG. 5B). The cluster 506 includes payment transaction records R2, R4 with an exact match of merchant locations with merchant database but which differ from corresponding candidate merchant records in merchant name by at least one word, and contact numbers may or may not match (see, row 526 in FIG. 5B).
FIGS. 6A and 6B, collectively, represent a flow chart 600 of a process flow for identifying ambiguous payment transaction records with noisy merchant data fields, in accordance with an example embodiment of the present disclosure. The sequence of operations of the flow chart 600 may not be necessarily executed in the same order as they are presented. Further, one or more operations may be grouped and performed in form of a single step, or one operation may have several sub-steps that may be performed in parallel or in a sequential manner.
At 602, the server system 200 accesses the plurality of electronic payment transaction records (e.g., R1, R2…Rn) from the transaction database 114. Each electronic payment transaction record is associated with a merchant of a plurality of merchants. These electronic payment transaction records are received from one or more acquirers (e.g., the acquirer 102).
At 604, the server system 200 extracts merchant data fields associated with a merchant from an electronic payment transaction record. The merchant data fields may include categorical data and/or numerical data. The merchant data fields include, but not limited to, data elements including information of merchant name, merchant acquirer identifier, merchant address, merchant city, merchant zip, merchant state code, and merchant country, etc. An example of an electronic payment transaction record R1 is as follows:
TREOS AWZ1237 1415 North Wayne Street Atlanta GA USA 30344 404-632-5512
The electronic payment transaction record R1 has the following merchant data fields (d1_1, d1_2, … , d1-8):
d1_1 : Merchant Name : TREOS
d1_2 : Acquirer Merchant ID : AWZ1237
d1_3 : Merchant Address: 1415 North Wayne Street
d1_4 : Merchant City: Atlanta
d1_5 : Merchant State Code : GA
d1_6 : Merchant Country : USA
d1_7 : Merchant Zip : 30344
d1_8 : Merchant Contact Number : 404-632-5512
At 606, the server system 200 matches each of the plurality of electronic payment transaction records (R1, R2, …, Rn) to at least one candidate merchant record in the merchant database 118 to obtain a matching probability score. In general, the merchant data fields in the electronic payment transaction record R1 are matched to corresponding data element/attribute associated with at least one candidate merchant record (C1) in the merchant database 118 to obtain the matching probability score. More specifically, the server system 200 is configured to search a candidate merchant record in the merchant database that is similar to the one or more merchant data fields in the electronic payment transaction record.
The matching probability score is a measure of similarity between the electronic payment transaction record (e.g., R1) and the candidate merchant record. More particularly, the matching probability score is determined for each of the plurality of electronic payment transaction records based on the set of matching features associated with each of the plurality of electronic payment transaction records.
At 608, the server system 200 identifies a set of electronic payment transaction records of the plurality of electronic payment transaction records with matching probability scores less than a predefined threshold score. At 610, the server system 200 clusters the set of electronic payment transaction records to construct a set of clusters based on a set of matching features associated with each of the set of electronic payment transaction records. The set of matching features are generated based on the matching of merchant data fields of each of the set of electronic payment transaction records with data stored in the merchant database 118
At 612, the server system 200 randomly selects a cluster i from the set of K clusters, where ‘i’ is less than K value. When all clusters are selected once, the process flow is stopped.
At 614, the server system 200 samples the cluster to identify one or more instances. It shall be noted that the term ‘instance’ refer to an electronic payment transaction record with merchant data fields. In one embodiment, the cluster is sampled using a pool based sampling technique and the instances in the cluster are queried using margin sampling for identifying the one or more instances. More specifically, the one or more instances are selected using an active learning technique that selects informative instances in the cluster for labeling.
At 616, the server system 200 determines whether labeled data points are available for the one or more instances. In some example embodiments, the server system 200 sends a label request to an oracle for labeling the one or more instances (i.e., ambiguous merchant data fields) of each payment transaction record. The oracle (e.g., the oracle 116) tries to determine labeled data points (i.e., correct value of non-matched merchant data fields) for each instance by referring to an external database and/or third-party sources. If the oracle is able to determine a correct labeled data point for an instance, the server system 200 accesses the labeled data point from the oracle. If the label is not available for a particular instance, operation 618 is performed, otherwise, operation 620 is performed.
At 618, the server system 200 classifies the payment transaction record associated with the instance as an ambiguous instance. Accordingly, the instance classified as the ambiguous instance is discarded.
At 620, the server system 200 labels the instance to generate a labeled data point. If the oracle determines a label for an instance, the server system 200 retrieves the label and assigns the label to the corresponding instance. A labeled instance is hereinafter referred to as ‘a labeled data point’. At 622, the server system 200 trains a supervised machine learning model based on the labeled data points.
At 624, the server system 200 determines the accuracy metrics of the supervised machine learning model. In one example embodiment, the F1 score of the supervised machine learning model is determined.
At 626, the server system 200 determines whether the accuracy metrics of the supervised machine learning model has improved when compared with the accuracy metrics of the supervised machine learning model generated at a previous iteration. The previous iteration precedes the current iteration. If the accuracy metrics of the supervised machine learning model at the current iteration has improved, step 628 is performed and the process restarts with a selection of new cluster from the step 612 mentioned in the FIG. 6A, otherwise operation 630 is performed.
At 628, the server system 200 retains the supervised machine learning model (i.e., M2) generated at the current iteration. More particularly, the labeled data points of the cluster have improved the performance of the trained supervised machine learning model. Thereafter, the supervised machine learning model M2 is trained using labeled data points of the next clusters provided at subsequent/next iterations.
At 630, the server system 200 classifies all instances in the cluster as ambiguous instances and discards the cluster for training the supervised machine learning model. If the accuracy metrics of the machine learning model in a current iteration has not improved or shows a decreased performance when compared with a supervised machine learning model (i.e., M1) generated at the previous iteration, it indicates all/most of the instances in the cluster are noisy thereby decreasing the performance of the supervised machine learning model.
At 632, the server system 200 discards the supervised machine learning model (i.e., M2) generated at the current iteration. At 634, the server system 200 retrieves the supervised machine learning model generated at the previous iteration from the database 204 and performs the steps 612 again (i.e., selecting a new cluster from the K clusters).
In an embodiment, the server system 200 is configured to again determine the number of electronic payment transaction records, which have matching probability scores less than a predefined threshold score from the set of electronic payment transaction records and cluster the number of electronic payment transaction records into a set of clusters. Each cluster is provided iteratively to the supervised machine learning model and the server system 200 performs similar operations 614-634 as described earlier for updating the supervised machine learning model.
In an embodiment, the accuracy metrics of the supervised machine learning model is determined after each iteration to determine an increase/decrease in the accuracy metrics of the supervised machine learning model. When a cluster increases the accuracy metrics of the supervised machine learning model, the supervised machine learning model is updated based on the cluster and when the cluster decreases the accuracy metrics of the supervised machine learning model, the cluster is classified as ambiguous/noisy and discarded. After determining clusters that do not improve accuracy metrics of the supervised machine learning model, the server system 200 is configured to terminate the training process of the supervised machine learning model and provide those clusters classified as ambiguous/noisy to the oracle 116 for annotation.
The sequence of operations of the method 600 need not be necessarily executed in the same order as they are presented. Further, one or more operations may be grouped and performed in the form of a single step, or one operation may have several sub-steps that may be performed in a parallel or sequential manner.
FIG. 7 represents a flow diagram of a method 700 for identifying ambiguous instances in electronic payment transaction records, in accordance with an example embodiment. The method 700 depicted in the flow diagram may be executed by, the at least one server, for example, the server system 108 or the server system 200 explained with reference to FIG. 2, the payment server 106, or the acquirer server 102. Operations of the flow diagram of method 700, and combinations of operation in the flow diagram, may be implemented by, for example, hardware, firmware, a processor, circuitry and/or a different device associated with the execution of software that includes one or more computer program instructions. It is noted that the operations of the method 700 can be described and/or practiced by using a system other than these server systems. The method 700 starts at operation 702.
At operation 702, the method 700 includes accessing a plurality of electronic payment transaction records associated with a plurality of merchants from the transaction database 114. Each of the plurality of electronic payment transaction records includes merchant data fields associated with a merchant of the plurality of merchants.
At operation 704, the method 700 includes identifying a set of electronic payment transaction records from the plurality of electronic payment transaction records. Each of the set of electronic payment transaction records has a matching probability score less than a predefined threshold score. The matching probability score for an electronic payment transaction record is computed by matching the electronic payment transaction record with corresponding data stored in a merchant database 118.
At operation 706, the method 700 includes applying a clustering algorithm over the set of electronic payment transaction records for constructing a set of clusters. The set of clusters are constructed based, at least in part, on a set of matching features corresponding to merchant data fields associated with the set of electronic payment transaction records. The set of matching features represent matching of the merchant data fields of each electronic payment transaction record with candidate merchant records stored in the merchant database 118.
At operation 708, the method 700 includes determining ambiguous instances from the set of clusters. The ambiguous instances are determined by performing operations 708a-708e, iteratively.
At operation 708a, the method 700 includes selecting a cluster from the set of clusters. At operation 708b, the method 700 includes determining labeled data points associated with one or more instances of the cluster.
At 708c, the method 700 includes providing the labeled data points to a supervised machine learning model for training the supervised machine learning model in response to determining the labeled data points for the one or more instances.
At operation 708d, the method 700 includes evaluating an accuracy metrics of the supervised machine learning model. At operation 708e, the method 700 includes determining at least one ambiguous instance from the cluster based, at least in part, on the evaluating step.
The sequence of operations of the method 700 need not be necessarily executed in the same order as they are presented. Further, one or more operations may be grouped together and performed in form of a single step, or one operation may have several sub-steps that may be performed in parallel or in sequential manner.
FIG. 8 is a simplified block diagram of a payment server 800, in accordance with an embodiment of the present disclosure. The payment server 800 is an example of the payment server 106 of FIG. 1. The payment network 104 may be used by the payment server 800, the acquirer server 102, and an issuer server as a payment interchange network. Examples of the payment network 104 may include, but not limited to, Mastercard® payment system interchange network. The payment server 800 includes a processing system 805 configured to extract programming instructions from a memory 810 to provide various features of the present disclosure. The components of the payment server 800 provided herein may not be exhaustive and the payment server 800 may include more or fewer components than those depicted in FIG. 8. Further, two or more components may be embodied in one single component, and/or one component may be configured using multiple sub-components to achieve the desired functionalities. Some components of the payment server 800 may be configured using hardware elements, software elements, firmware elements, and/or a combination thereof.
Via a communication interface 815, the processing system 805 receives electronic payment transaction record (i.e., “payment transaction data”) from a remote device 820 such as the acquirer server 102. The communication may be achieved through API calls, without loss of generality. The payment server 800 includes a database, such as a transaction database 825. The transaction database 825 may include, but not limited to, payment transaction data, such as Issuer ID, country code, acquirer ID, merchant name, merchant location, etc. In one embodiment, the transaction database 825 stores a plurality of electronic payment transaction records which may include ambiguous instances/entries. The payment server 800 may also perform similar operations as performed by the server system 108 or the server system 200 for identifying ambiguous instances in electronic payment transaction records. For the sake of brevity, the detailed explanation of the payment server 800 is omitted herein with reference to the FIGS. 1 and 2.
FIG. 9 is a simplified block diagram of an acquirer server 900, in accordance with one embodiment of the present disclosure. The acquirer server 900 is associated with an acquirer bank, which may be associated with one or more merchants (e.g., the merchants 112a-112c). The merchant may have established an account to accept payment for the purchase of goods from customers. The acquirer server 900 is an example of the acquirer server 102 of FIG. 1 or may be embodied in the acquirer server 102. Further, the acquirer server 900 is configured to facilitate transactions with an issuer server (not shown) for payment transactions using the payment network 104 of FIG. 1. The acquirer server 900 includes a processing module 905 communicably coupled to a merchant database 910 and a communication module 915. The communication module 915 is configured to receive payment transaction data associated with a payment transaction performed at a merchant terminal. This payment transaction data is stored in the merchant database as electronic payment transaction records and also sent to the payment server 800 via the payment network 104.
The components of the acquirer server 900 provided herein may not be exhaustive, and the acquirer server 900 may include more or fewer components than those depicted in FIG. 9. Further, two or more components may be embodied in one single component, and/or one component may be configured using multiple sub-components to achieve the desired functionalities. Some components of the acquirer server 900 may be configured using hardware elements, software elements, firmware elements, and/or a combination thereof.
Further, the merchant database 910 includes a table which stores one or more merchant parameters, such as, but not limited to, a merchant primary account number (PAN), a merchant name, a merchant ID (MID), a merchant category code (MCC), a merchant city, a merchant postal code, an MAID, a merchant brand name, industry code, merchant URL, merchant ticket size, terminal identification numbers (TIDs) associated with merchant terminals (e.g., the POS terminals or any other merchant electronic devices) used for processing transactions, among others. The processing module 905 is configured to use the MID or any other merchant parameter such as the merchant PAN to identify the merchant during the normal processing of payment transactions, adjustments, chargebacks, end-of-month fees, loyalty programs associated with the merchant and so forth. The processing module 905 may be configured to store and update the merchant parameters in the merchant database 910 for later retrieval. In an embodiment, the communication module 915 is capable of facilitating operative communication with a remote device 920 such as, a merchant terminal, a payment server (e.g., the payment server 800).
The disclosed methods with reference to FIGS. 1 to 9, or one or more operations of the method 600/700 may be implemented using software including computer-executable instructions stored on one or more computer-readable media (e.g., non-transitory computer-readable media, such as one or more optical media discs, volatile memory components (e.g., DRAM or SRAM), or nonvolatile memory or storage components (e.g., hard drives or solid-state nonvolatile memory components, such as Flash memory components) and executed on a computer (e.g., any suitable computer, such as a laptop computer, net book, Web book, tablet computing device, smart phone, or other mobile computing device). Such software may be executed, for example, on a single local computer or in a network environment (e.g., via the Internet, a wide-area network, a local-area network, a remote web-based server, a client-server network (such as a cloud computing network), or other such network) using one or more network computers. Additionally, any of the intermediate or final data created and used during implementation of the disclosed methods or systems may also be stored on one or more computer-readable media (e.g., non-transitory computer-readable media) and are considered to be within the scope of the disclosed technology. Furthermore, any of the software-based embodiments may be uploaded, downloaded, or remotely accessed through a suitable communication means. Such suitable communication means include, for example, the Internet, the World Wide Web, an intranet, software applications, cable (including fiber optic cable), magnetic communications, electromagnetic communications (including RF, microwave, and infrared communications), electronic communications, or other such communication means.
Although the disclosure has been described with reference to specific exemplary embodiments, it is noted that various modifications and changes may be made to these embodiments without departing from the broad spirit and scope of the disclosure. For example, the various operations, blocks, etc. described herein may be enabled and operated using hardware circuitry (for example, complementary metal oxide semiconductor (CMOS) based logic circuitry), firmware, software and/or any combination of hardware, firmware, and/or software (for example, embodied in a machine-readable medium). For example, the apparatuses and methods may be embodied using transistors, logic gates, and electrical circuits (for example, application specific integrated circuit (ASIC) circuitry and/or in Digital Signal Processor (DSP) circuitry).
Particularly, the server system 200 and its various components such as the computer system and the database may be enabled using software and/or using transistors, logic gates, and electrical circuits (for example, integrated circuit circuitry such as ASIC circuitry). Various embodiments of the disclosure may include one or more computer programs stored or otherwise embodied on a computer-readable medium, wherein the computer programs are configured to cause a processor or computer to perform one or more operations. A computer-readable medium storing, embodying, or encoded with a computer program, or similar language, may be embodied as a tangible data storage device storing one or more software programs that are configured to cause a processor or computer to perform one or more operations. Such operations may be, for example, any of the steps or operations described herein. In some embodiments, the computer programs may be stored and provided to a computer using any type of non-transitory computer readable media. Non-transitory computer readable media include any type of tangible storage media. Examples of non-transitory computer readable media include magnetic storage media (such as floppy disks, magnetic tapes, hard disk drives, etc.), optical magnetic storage media (e.g., magneto-optical disks), CD-ROM (compact disc read only memory), CD-R (compact disc recordable), CD-R/W (compact disc rewritable), DVD (Digital Versatile Disc), BD (BLU-RAY® Disc), and semiconductor memories (such as mask ROM, PROM (programmable ROM), EPROM (erasable PROM), flash memory, RAM (random access memory), etc.). Additionally, a tangible data storage device may be embodied as one or more volatile memory devices, one or more non-volatile memory devices, and/or a combination of one or more volatile memory devices and non-volatile memory devices. In some embodiments, the computer programs may be provided to a computer using any type of transitory computer readable media. Examples of transitory computer readable media include electric signals, optical signals, and electromagnetic waves. Transitory computer readable media can provide the program to a computer via a wired communication line (e.g., electric wires, and optical fibers) or a wireless communication line.
Various embodiments of the invention, as discussed above, may be practiced with steps and/or operations in a different order, and/or with hardware elements in configurations, which are different than those which are disclosed. Therefore, although the invention has been described based upon these exemplary embodiments, it is noted that certain modifications, variations, and alternative constructions may be apparent and well within the spirit and scope of the invention.
Although various exemplary embodiments of the invention are described herein in a language specific to structural features and/or methodological acts, the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as exemplary forms of implementing the claims.
| # | Name | Date |
|---|---|---|
| 1 | 202041043790-STATEMENT OF UNDERTAKING (FORM 3) [08-10-2020(online)].pdf | 2020-10-08 |
| 2 | 202041043790-POWER OF AUTHORITY [08-10-2020(online)].pdf | 2020-10-08 |
| 3 | 202041043790-FORM 1 [08-10-2020(online)].pdf | 2020-10-08 |
| 4 | 202041043790-FIGURE OF ABSTRACT [08-10-2020(online)].jpg | 2020-10-08 |
| 5 | 202041043790-DRAWINGS [08-10-2020(online)].pdf | 2020-10-08 |
| 6 | 202041043790-DECLARATION OF INVENTORSHIP (FORM 5) [08-10-2020(online)].pdf | 2020-10-08 |
| 7 | 202041043790-COMPLETE SPECIFICATION [08-10-2020(online)].pdf | 2020-10-08 |
| 8 | 202041043790-Correspondence-POA-16-10-2020.pdf | 2020-10-16 |
| 9 | 202041043790-Proof of Right [15-02-2021(online)].pdf | 2021-02-15 |
| 10 | 202041043790-Correspondence, Assignment_19-02-2021.pdf | 2021-02-19 |
| 11 | 202041043790-FORM 18 [01-10-2024(online)].pdf | 2024-10-01 |
| 12 | 202041043790-FER.pdf | 2025-10-13 |
| 1 | 202041043790_SearchStrategyNew_E_Search_strategy_202041043790E_08-10-2025.pdf |