Method, Systems And Computer Readable Media For Distributed

Method, Systems And Computer Readable Media For Distributed Probablistic Matrix Factorization

Abstract: The present invention provides a method and system for distributed probabilistic matrix factorization. In accordance with a disclosed embodiment, the method may include partitioning a sparse matrix into a first set of blocks on a distributed computer cluster, whereby a dimension of each block is MB rows and NB columns. Further, the method shall include initializing a plurality of matrices including first mean matrix 0, a first variance matrix U, a first prior variance matrix £P, a second mean matrix V, a second variance matrix V, and a second prior variance matrix Vp, by a set of values from a probability distribution function. The plurality of matrices can be partitioned into a set of blocks on the distributed computer cluster, whereby each block can be of a shorter dimension K, and the plurality of matrices can be updated iteratively until a cost function of the sparse matrix converges. REF FIG: 1

Patent Information

Application #

Filing Date

23 September 2013

Publication Number

13/2015

Publication Type

INA

Invention Field

COMPUTER SCIENCE

Status

Email

Parent Application

Applicants

INFOSYS LIMITED

IP CELL, PLOT NO.44, ELECTRONIC CITY, HOSUR ROAD, BANGALORE - 560 100

Inventors

1. HARI MANASSERY KODUVELY

321, RICHFIELDS APARTMENT, RING ROAD, MARATHAHALLI, BANGALORE - 560 037

2. SARBENDU GUHA

FLAT-SB202, SHRIRAM SMRITHI APARTMENTS, ATTIBELE-SARJAPUR ROAD, BIDARGUPPE VILLAGE, ANEKAL TALUKA, BANGALORE - 562 107

3. ARUN YADAV

101A, MAHAVEER DAZZLE, BASAVANNA NAGAR, MAIN ROAD, HOODI, BANGALORE - 560 048

4. GLADBIN DAVID C

CHUNGATH HOUSE, WADAKKANCHERY ROAD, P.O. KUNNAMKULAM, THRISSUR(DT) - 680 503

5. NAVEEN CHANDRA TEWARI

S/O PRAYAG DATT TEWARI, HOUSE NO: 24, PANCHSHEEL COLONY, MORADABAD 244 001

6. UTKARSH GUPTA

67, PREM NAGAR, BEHIND 20 PREM NAGAR, GWALIOR - 474 002

Specification

METHODS, SYSTEMS AND COMPUTER-READABLE MEDIA FOR DISTRIBUTED PROBABILISTIC MATRIX FACTORIZATION FIELD OF THE INVENTION The present invention relates generally to a method and system for distributed collaborative filtering. More specifically, the present invention relates to a method and system for probabilistic matrix factorization in a distributed computing cluster. BACKGROUND In an e-commerce scenario, collaborative filtering is a commonly used technology for recommending products to users. In collaborative filtering similarity between products or similarity between users can be found by ratings given to the products by the users. Hence, a product which is not purchased by a user may be recommended to the user based on the ratings given to the product by similar users. Currently several techniques exist for collaborative filtering, such as Nearest Neighbor methods, Probabilistic Graphical methods, and Matrix Factorization Methods. However a limitation of aforementioned methods, lies in recommendation of products that exist at the long tail of a product spectrum, where frequency of purchase is low. Due to low frequency of purchase, historical transaction data in the long tail product spectrum is highly sparse, thereby making accurate recommendations, difficult. Certain machine learning techniques, such as Bayesian Probabilistic Matrix Factorization, and Variational Bayesian Matrix Factorization, attempt to make accurate recommendations of products lying in the long tail product spectrum. However it is usually difficult to scale the aforesaid methods to realistic commercial scenarios, where a number of users and products lie in range of a million. Further, a memory requirement of a serial processor required for executing the Bayesian Probabilistic matrix factorization method on a dataset of 1.5 GB, is 35GB random access memory (RAM). Further, time taken for executing the Bayesian Probabilistic matrix factorization method could exceed thirty hours. Hence unless a retailer of the e-commerce business, invests in special hardware accurate recommendations of the long tail product spectrum seems difficult. Hence an alternative system and method is required for addressing scalability of the Variational Bayesian Matrix Factorization method to large data sets of the long tail product spectrum in the e-commerce scenario. The alternate system and method must parallelize an algorithm implemented for executing the Variational Bayesian Matrix Factorization method on the large data sets on a distributed computing framework. Thus a system and method for performing a distributed probabilistic matrix factorization is proposed. SUMMARY The present invention provides a method and system for distributed probabilistic matrix factorization. In accordance with a disclosed embodiment, the method may include partitioning a sparse matrix into a first set of blocks on a distributed computer cluster, whereby a dimension of each block is MB rows and NB columns. Further, the method shall include initializing a first mean matrix U, a first variance matrix U , a first prior variance matrix Up, a second mean matrix V, a second variance matrix V, and a second prior variance matrix ¥p by a set of values from a probability distribution function. The V, the U, and the Dp, can be partitioned into a second set of blocks on the distributed computer cluster whereby a dimension of each block is the MB rows and K columns. The V, the V, and the Vp can be partitioned into a third set of blocks on the distributed computer cluster whereby a dimension of each block is the NB rows and the K columns. The U, the U, the Up, the V, the V, and the Vp can be updated iteratively until a cost function of the sparse matrix converges, whereby each iteration comprises of a plurality of MapReduce steps. In an additional embodiment, a system for distributed probabilistic matrix factorization is disclosed. The system comprises an initializing component configured to initialize a plurality of matrices by a set of values from a probability distribution function, whereby the plurality of matrices include a first mean matrix U, a first variance matrix 0 and a first prior variance matrix Up, a second mean matrix V, a second variance matrix V, and a second prior variance matrix Vp. Further a partitioning component is configured to partition a sparse matrix into a first set of blocks on a distributed computer cluster, whereby a dimension of each block is MB rows and NB columns; and partition the plurality of matrices into a second set of blocks and a third set of blocks on the distributed computer cluster. The system further includes an updating component configured to update the plurality of matrices, iteratively until a cost function of the sparse matrix converges, whereby each iteration comprises of a plurality of MapReduce steps. These and other features, aspects, and advantages of the present invention will be better understood with reference to the following description and claims. BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 is a flowchart illustrating an embodiment of a method for distributed probabilistic matrix factorization. FIG. 2 is a flowchart illustrating a preferred embodiment of a method for distributed probabilistic matrix factorization. FIG. 3 shows an exemplary system for distributed probabilistic matrix factorization. FIG. 4 illustrates a generalized example of a computing environment 400. While systems and methods are described herein by way of example and embodiments, those skilled in the art recognize that systems and methods for electronic financial transfers are not limited to the embodiments or drawings described. It should be understood that the drawings and description are not intended to be limiting to the particular form disclosed. Rather, the intention is to cover all modifications, equivalents and alternatives falling within the spirit and scope of the appended claims. Any headings used herein are for organizational purposes only and are not meant to limit the scope of the description or the claims. As used herein, the word "may" is used in a permissive sense (i.e., meaning having the potential to) rather than the mandatory sense (i.e., meaning must). Similarly, the words "include", "including", and "includes" mean including, but not limited to. DETAILED DESCRIPTION FIG. 1 illustrates a computer-implemented system 100 in accordance with an embodiment of the invention. The system 100 includes a matrix component 102, that represents the sparse matrix X. The matrix X, is distributed across a plurality of machines of a distributed computer cluster 104. The distributed computer cluster 104, can be a typical Hadoop framework. However, the Hadoop framework is not to be construed as limiting in any way, as the architecture maybe deployed on other suitable frameworks. In order to perform a factorization of the sparse matrix X, the matrix X is partitioned such that parallelism and data locality is maximized. Disclosed embodiments provide computer-implemented methods, systems, and computer-program products for distributed probabilistic matrix factorization. More specifically the methods, and systems disclosed implement a Variational Bayesian Probabilistic Matrix Factorization Method, on a distributed computer cluster such as a Hadoop Cluster by executing a series of Map Reduce operations. FIG. 1 is a flowchart that illustrates a method performed in distributed probabilistic matrix factorization in accordance with an embodiment of the present invention. A set of observable data, also referred to as transaction data is represented by a sparse matrix X of dimension M rows and N columns. In a particular embodiment the transaction data may include a plurality of commodities purchased by a plurality of customers, and a plurality of ratings given to the plurality of commodities by the plurality of customers. The plurality of commodities may include products or services intended to be procured by a customer. The sparse matrix X maybe approximated as a product of two low rank matrices, U of dimension M rows and K columns and a transpose of matrix V of dimension N rows and K columns, where K represents a latent feature space of a lower dimension. Alternatively K may represent a set of features on whose basis, the plurality of products may be categorized, or a set of features depicting categories of customers. A category of customers shall include customers of like preference, in an embodiment. As per a Variational Bayesian Matrix Factorization method, a probability distribution of the sparse matrix X, U and V maybe represented by Equation 2, Equation 3 and Equation 4 as follows: P{X\U,V) = nf=inf=1^(^i2f=is*V°x) Equation 2 P(U) = 11^ Ifei W(«J"i*»») Equation 3 P(¥} = nf=inf=1 ®(vJk\vjk,vjk) Equation 4 In the aforesaid equations, the parameters u^UtejU^v^Vj^v^,a7M* ^are found by solving following set of iterative equations: Equation 11 Where in above equations, the equation 1 to the equation 11, u*k , is an (i,k) element of a first prior variance matrix, Up. The fifk refers to a prior variance of element (i,k) in the U matrix; uik, is an (i,k) element of a first variance matrix 0. The uikJ refers to a posterior variance of element (i,k) in the U matrix; uik ,ls an (I, k) element of a first mean matrix U. The uik refers to a posterior mean of element (i,k)in the U matrix; v* , is an (i,k) element of a second prior variance matrix ¥p .The v£, refers to a prior variance of element (j,k) in the V matrix; vJk, is an (i,k) element of a second variance matrix V. The vik refers to a posterior variance of element (j,k) in the V matrix; vJk„ is an (i,k) element of a second mean matrix V. The vik refers to a posterior mean of element (j,k) in the V matrix; and ax , refers to an observation variance in data or the X matrix. The aforesaid set of equations can be computed iteratively until a cost function CKL of the sparse matrix X, converges to a minimum value. A convergence of the cost function implies the parameters Ui&&QC,iJpyc,VjktVjk,vJk, have reached a fairly accurate approximate of the sparse matrix. Alternatively, elements of the matrices U and V shall represent an accurate approximate of the sparse matrix X, when the cost function converges. Hence computation of the equation 1 to the equation 11 shall terminate when, the cost function converges. The cost function CXL is a cost function due to Kullback-Leibler (KL) divergence. The CKL is a sum of three distinct component costs viz. CKLK, CKLU, CgJ, where; In order to scale the computation of the aforementioned equations to a large size data, parallelization of the equations is required. As the iterative equations are necessary, parallelization can be done during a computation of each step of iteration. Hence, computation of a plurality of elements of matrices, the tj, the D, the £p, the V, the V, and the fp, the observation variance ax, and the cost function CKL, at the each step of the iteration can be parallelized on distributed computer cluster such as that using a Hadoop framework, in order to handle voluminous data. The each iteration can include processing of a sequence of MapReduce steps. At step 102, the sparse matrix can be partitioned into a first set of blocks, by a MapReduce step, on a distributed computer cluster such as a Hadoop Cluster. A dimension of each block includes MB rows and NB columns. Each block maybe indexed by parameters I and J where I can range from 1 to M' = M/MB, and J can range from 1 to N'=N/NB. At step 104, first mean matrix U, the first variance matrix 0 , the first prior variance matrix U*, the second mean matrix V, the second variance matrix V, and the second prior variance matrix Vp are initialized by a set of values from a probability distribution function. Further at step 106, the U, the U, and the Up are partitioned into a second set of blocks by a Mapreduce step, where a dimension of each block can be MB rows and K columns. Typically the each block of this matrix shall have a width of the K columns, implying each row shall exist within a single block. An index F can represent the each partitioned block of the U, the fir, and the Up, and and index i

Documents

Application Documents

# Name Date

1 4292-CHE-2013 FORM-3 23-09-2013.pdf 2013-09-23

1 abstract4292-CHE-2013.jpg 2014-07-09

2 4292-CHE-2013 FORM-1 30-05-2014.pdf 2014-05-30

2 4292-CHE-2013 FORM-2 23-09-2013.pdf 2013-09-23

3 4292-CHE-2013 ABSTRACT 23-09-2013.pdf 2013-09-23

3 4292-CHE-2013 FORM-1 23-09-2013.pdf 2013-09-23

4 4292-CHE-2013 CLAIMS 23-09-2013.pdf 2013-09-23

4 4292-CHE-2013 DRAWINGS 23-09-2013.pdf 2013-09-23

5 4292-CHE-2013 DESCRIPTION (COMPLETE) 23-09-2013.pdf 2013-09-23

5 4292-CHE-2013 CORRESPONDENCE OTHERS 23-09-2013.pdf 2013-09-23

6 4292-CHE-2013 CORRESPONDENCE OTHERS 23-09-2013.pdf 2013-09-23

6 4292-CHE-2013 DESCRIPTION (COMPLETE) 23-09-2013.pdf 2013-09-23

7 4292-CHE-2013 CLAIMS 23-09-2013.pdf 2013-09-23

7 4292-CHE-2013 DRAWINGS 23-09-2013.pdf 2013-09-23

8 4292-CHE-2013 ABSTRACT 23-09-2013.pdf 2013-09-23

8 4292-CHE-2013 FORM-1 23-09-2013.pdf 2013-09-23

9 4292-CHE-2013 FORM-1 30-05-2014.pdf 2014-05-30

9 4292-CHE-2013 FORM-2 23-09-2013.pdf 2013-09-23

10 abstract4292-CHE-2013.jpg 2014-07-09

10 4292-CHE-2013 FORM-3 23-09-2013.pdf 2013-09-23

#	Name	Date
1	4292-CHE-2013 FORM-3 23-09-2013.pdf	2013-09-23
1	abstract4292-CHE-2013.jpg	2014-07-09
2	4292-CHE-2013 FORM-1 30-05-2014.pdf	2014-05-30
2	4292-CHE-2013 FORM-2 23-09-2013.pdf	2013-09-23
3	4292-CHE-2013 ABSTRACT 23-09-2013.pdf	2013-09-23
3	4292-CHE-2013 FORM-1 23-09-2013.pdf	2013-09-23
4	4292-CHE-2013 CLAIMS 23-09-2013.pdf	2013-09-23
4	4292-CHE-2013 DRAWINGS 23-09-2013.pdf	2013-09-23
5	4292-CHE-2013 DESCRIPTION (COMPLETE) 23-09-2013.pdf	2013-09-23
5	4292-CHE-2013 CORRESPONDENCE OTHERS 23-09-2013.pdf	2013-09-23
6	4292-CHE-2013 CORRESPONDENCE OTHERS 23-09-2013.pdf	2013-09-23
6	4292-CHE-2013 DESCRIPTION (COMPLETE) 23-09-2013.pdf	2013-09-23
7	4292-CHE-2013 CLAIMS 23-09-2013.pdf	2013-09-23
7	4292-CHE-2013 DRAWINGS 23-09-2013.pdf	2013-09-23
8	4292-CHE-2013 ABSTRACT 23-09-2013.pdf	2013-09-23
8	4292-CHE-2013 FORM-1 23-09-2013.pdf	2013-09-23
9	4292-CHE-2013 FORM-1 30-05-2014.pdf	2014-05-30
9	4292-CHE-2013 FORM-2 23-09-2013.pdf	2013-09-23
10	abstract4292-CHE-2013.jpg	2014-07-09
10	4292-CHE-2013 FORM-3 23-09-2013.pdf	2013-09-23