Determination Of Contiguous And Connected Association Cluster

< Back

Determination Of Contiguous And Connected Association Cluster

Abstract: A process for determining or revealing new associations from among a plurality of attributes selected from a set of transactional process data comprises a method of converting attribute data into consistent, itemized data through an interactive bucketing process, computing association rule parameters and further displaying mined associations in an intelligent visualization setup to easily identify associations. Association rules are displayed in a visually engaging two dimensional matrix such that areas or regions showing association are instantly visible. Further, higher level association rules can be uncovered or drilled-down by adopting efficient and fast computational techniques based on the defined rules and the resulting visualization is realized through a contiguous region or connected set.

Get Free WhatsApp Updates!
Notices, Deadlines & Correspondence

Patent Information

Application #

Filing Date

27 September 2011

Publication Number

22/2014

Publication Type

INA

Invention Field

COMPUTER SCIENCE

Status

Parent Application

Patent Number

Legal Status

Grant Date

2019-01-02

Renewal Date

Applicants

TATA CONSULTANCY SERVICES LIMITED

Nirmal Building 9th Floor Nariman Point Mumbai Maharashtra

Inventors

1. MOHANTY Santosh Kumar

TCS Gateway Park Andheri east Mumbai Maharashtra 400093

Specification

FORM 2
THE PATENTS ACT, 1970
(39 of 1970)
&
THE PATENTS RULES, 2003
COMPLETE SPECIFICATION
(See section 10, rule 13)
J. Title of the invention: Determination of Contiguous and Connected Association Cluster
2. Applicant(s)
NAME NATIONALITY ADDRESS
TATA CONSULTANCY SERVICES INDIA Nirmal Building, 9th Floor, Nariman Point,
LIMITED Mumbai Maharashtra 400021 India
3. Preamble to the description
COMPLETE SPECIFICATION
The following specification particularly describes the invention and the manner in which it
is to be performed.

Field of Invention
[0001] The present invention relates in general to the field of data mining, more particularly towards a method and system for determination of an association cluster and a corresponding visualization of such associations between data attributes identified from a large set of transactional data to aid any entity in their strategic decision making process.
Background of the Invention
[0002] Data mining in the computing environment comprises analyzing or extracting useful information from large datasets to determine patterns, associations and sequence. Association rule mining is one of the many techniques of data mining that has been widely used to derive association between items of a large dataset. As an example, the retail industry has widely been using market basket analysis based on customer transactions. The premise of the market basket analysis is that if customers purchase a certain group of items in their shopping baskets, they are likely to buy another set of items also. The objective of market basket analysis is to arrive at a relationship between items purchased by customers. Analysis of this data can help retailers in deciding placement of goods on shelves, planning store layout, devising marketing campaigns, or rolling out special schemes and prices for goods purchased. In another example, association rule can be used very effectively in clustering web pages based on context. As an information seeker, once we provide content for search, web crawlers can intelligently tap a set of web pages that has content specific association with the prime web page.
[0003] Association rules are of the form A->B. Where A and B are sets of attributes or variables. Variable on the left hand side (LHS) is called the antecedent and the variable on the right hand side (RHS) is called consequent. For example, information that 70% of the customers who buy bread and egg also buy cereal reveals an association between items purchased. The strength of an association is determined based on two parameters - support and confidence. Support is the count of transactions where A and B appear; confidence is the ratio of {count of (A and B) and count (A)}. Based on the example above, a supermarket will try to find association between items purchased by customer per transaction to establish an association between the items purchased. Through such information, supermarkets may decide to place bread, eggs and cereal in adjacent areas or may roll out coupons to promote sales of such closely associated items.
[0004] Most organizations and business entities rely on historical data for forecasting and for predictive analytics. Due to large volumes of data, it is important for organizations to get the correct insight for making the right business decisions. Additionally, stakeholders and decision makers should also be able

to observe any change in pattern or shift in values for attributes, view strong trends and correlation, and manage data in discrete sets or ranges to identify associations that occur in large datasets.
[0005] Currently, few methods exist that determine association between contiguous or connected attributes through a computed parameter and generate a visualization grid containing two antecedent attributes and a consequent attribute with its corresponding values to display the association. A common technique is to reproduce associations in a graph (2 D or 3D) form to indicate one- to-one or one-to-many relationship along with computed parameters such as confidence and support. This type of visualization tends to become cluttered as the number of items increase. There is a need for an intuitive visualization method that is capable of displaying connected regions so that they are easily identifiable to indicate relative strength of the association.
Summary
[0006] The present invention comprises a system and method for determining contiguous and connected association clusters from among a set of attributes belonging to any business transaction. In one implementation, attribute values belonging to an attribute type are segregated or partitioned into buckets further to which a computation operation is performed for data analysis to determine association between itemized antecedent and consequent attribute values. This association is then displayed in a decision right matrix which in turn is capable of generating different type of views that help in drilling down or deriving higher level associations to facilitate decision making.
Brief Description of Drawings
[0007] Further objects, embodiments, features and advantages of the present invention will be more apparent when read together with the detailed description and the accompanied drawings. The objective of the drawings included herein is to lay more emphasis on understanding the underlying principle of the invention. The manner in which the drawings are presented in no way limit the scope of the invention and the advantages one can garner from the embodiments of the present invention.
[0008] The detailed description is described with reference to the accompanying figure(s). Different numeral references on figures designate corresponding elements throughout different views. In the figure(s), the left-most digit(s) of a reference number identifies the figure in which the reference number first appears. The use of the same reference number in different figure(s) indicates similar or identical items.
[0009] Figure 1 illustrates the computing setup comprising computing devices that communicate with a source data over a network for one of the embodiments of the present invention.

[00010] Figure 2 provides details of the association cluster and visualization system that comprises an input data interface, source data interaction module, computational process memory and the display module according to one of the embodiments of the present invention.
[00011] Figure 3 represents a flow diagram for data processing, computation and visualization according to one of the embodiments of the present invention.
[00012] Figure 4 describes the components of the visualization module referred to in figure 2 according to one of the embodiments of the present invention.
Detailed Description
[00013] For the purpose of gaining an understanding into the underlying principles of the invention and its various features, reference will be made to the embodiments illustrated in the drawings. Also it is important to note that the detailed description presented herein does not intend to dilute or limit the scope of the invention. Any alterations and further modifications in the described embodiments and any further application of the principles of the invention as described herein are construed as it would normally occur to one skilled in the art to which the invention relates.
[00014] One embodiment of the invention is realized as a product being used in a computing environment. Software programs for such a product may comprise routines and subroutines / source codes written in a computer language to perform data computation and visualization, described as the embodiment of the present invention. Such a product can be contained on a variety of computer-readable or computer-rewritable media, wherein such media may include, but are not restricted to CD-ROM, DVD-ROM, floppy disks, hard disk drive or diskette drive, other form of storage media or others, on which alterable/upgradable information can be stored.
[00015] Figure 1 represents an exemplary computing environment 100 for implementation of method and system for the intelligent association visualization of contiguous variables, according to one of the embodiments of the invention. In said embodiment the computing environment 100 comprises a plurality of user devices 101-1, 101-2 ....101-n, collectively referred to as user devices 101. The plurality of user devices 101, through an association cluster and visualization (ACV) system 102 interact with a source data 104 over a communication network 103. The ACV system in turn communicates with user devices over a communication network 103. The source data 104 and user devices 101 may be implemented in a variety of conventional computing devices, including but not restricted to devices such as a desktop, notebook or portable computer, tablet computer, a mainframe computer, a mobile computing device, an entertainment device, a computing platform, an internet appliance, servers and similar systems. The association cluster and visualization system 102 transforms data received from the source data so as to

determine association rules and renders visualized data. The source data 104 and the user devices 101 can be arranged as a client server configuration, part of a distributed computing environment, a cloud computing environment or any similar such environment apparent to the persons skilled in the art to execute routines, process data, perform visualization as illustrated in the various embodiments of the invention. The source data 104 can also be external such as a server or any such similar external devices, or can also be embedded or residing within the user device. Source data 104 also encompasses data that can be a manual input through the user device, digital, or unstructured forms of data or any other form of data.
[00016] However, a person skilled in the art will acknowledge that the embodiment of the invention is not limited to any particular computing system, architecture or application device, as it may be adapted to take advantage of new computing system, environment and platform as they become accessible.
[00017] The source data 104 is connected to the user device 101 through a communication network 103 that can include a Local Area Network (LAN), Wide Area Network (WAN), a wireless or wired connection or any other type as it would occur to those skilled in the art. Figure 1 is a simplified illustration of the setup; additional components and features described herein are not included for sake of clarity and brevity.
[00018] Association rules, a technique of data mining is used to arrive at correlations between existing data. These correlations have been useful especially in the retail sector to indentify and to cater to purchase patterns of consumers. However, analysis need not be restricted to retail or ecommerce sector alone - it can be extended to any business scenario. Any type of data can be used by business entities to derive associations in order to make critical business decisions and forecasts.
[00019] In one of the embodiments, data such as size of a business, loan amount borrowed, and repaying capacity can be collected in order to find an association between (size of business, loan amount borrowed) and repaying capacity. Such data will be useful to financial institutions before disbursing loans to businesses. Similarly, correlation between age, number of dependants and vehicle owned will be useful for sales managers in automobile manufacturing setup to identify propensity of part of a population to own a vehicle and subsequently, after making an informed decision, target a group of potential buyers.
[00020] For a set of transactions, each transaction contains values that correspond to certain attributes associated with that transaction. A two dimensional association rule is denoted as {A, B =>C}. Where, A, B and C are distinct attributes that take value A=a, B=b, C=c. Association rule mining is a process of evaluating all rules given the values of the pair of attributes A and B with respect to a fixed value of

attribute C. Given the syntax (A,B =>C), A and B are called antecedents or left hand side and C is called the consequent, business goal or the right hand side.
[00021] Consider a rule A=a, B=b =>C=c. Let Na and Nb represent the set values associated with attributes A and B. G denotes a two dimensional grid containing values Na and Nb. Let gy denote the (ij)' pixel of the grid, Uy denote the number of transactions mapped to gy (i.e., the transactions where A = i and B = j). This is also the base support (BS), count of transactions, where value of A=i and B=j. Parameter v,, denotes the number of successful transactions mapped to gy, that is transactions where A = i, B = j and C=c, also called Rule Support (RS).
In order to generate relevant associations, two parameters - minimum support and minimum confidence have to be factored in. Minimum support Min_Supp is the minimum support expressed in percentage of count of transactions where, A=i, B=j, or Uij for rule generation. Minimum confidence Min_Conf, also expressed in percentage is the percentage of count of all transactions where A=i, B=j and C=c, or Vij.
Gain (gy) is defined as Gain (gy) = Vy - Min_Conf * Uij
Base Support (BS) for a region P of the grid G is defined as∑uij for all pair (ij) where gij belongs to P.
Rule Support (RS) for a region P of the grid G, is defined as ∑Vij Vy for all pair (ij) where gij belongs to P.
The valid association rules are those (A=i, B=j => C=c) for which, percentage value of Uij is a Min_Supp and percentage value of Vy is a Min_Conf.
Net Gain is defined as ∑Gain (gij =∑ (Vij - Min_Conf * Uij), uij > Min_Supp and Vij> Min_Conf.
[00022] As per one of the embodiments of the present invention, consider the following example. Let attribute A represent age having value = 30, attribute B represent income with value = 6 lacs and attribute C represent type of car owned with value = Basic car. Let Minimum Support and Minimum Confidence be set at values 10% and 50% respectively. To simplify, a count of at least 10% of all instances where age is 30 and income is 6 lacs is the base support. A confidence value of 50% relates to a count of instances where a population of 30 year olds, having income 6 lacs own a Basic car is at least 50% of the base support.
[00023] Figure 2 is a conceptual representation of various components of the ACV system 102 that work in tandem to generate association rules and perform computation necessary to visualize the rule. Association Cluster and Visualization System 200 comprises an Input Data Interface 201 that interfaces with source data (104), a Source Data Interaction Module 202, which interacts with any source of data,

which in turn can be a server, a distributed group of servers or any such mass storage device containing process related or transactional data 202-1. The source data interaction module also contains a Verification Module 203 whose objective is to verify if the attributes or type of attributes selected is present in the source data. Computational process memory 204 performs calculations, computations in order to determine parameters such as base support, rule support, after receiving minimum support and minimum confidence as input. The computational process memory 204 in turn comprises two modules that support calculation of parameters - a frequency distribution processor 205, and gain computation module 206. Frequency distribution processor 205 itemizes attribute data and computes the frequency of occurrence of the itemized attribute data that are greater than the minimum support and minimum confidence levels defined. Output of the frequency distribution processor 205 is communicated to the gain computation module 206, which calculates the gain that will be published in a grid display. These outputs are again communicated to the Display Module 207, where the antecedent and consequent values are visually represented in a plurality of views in a two dimensional matrix.
[00024] Figure 3 elaborates the processing steps 301 to 309 carried out on the input data which in turn is depicted using a flowchart 300. The computer implemented method 300 may be described in the general context of computer executable instructions. Generally, computer executable instructions can include routines, programs, objects, components, data structures, procedures, modules, functions, and the like that perform particular functions or implement particular abstract data types. The method may also be practiced in a distributed computing environment where functions are performed by remote processing devices that are linked through a communication network. In a distributed computing environment, computer executable instructions may be located in both local and remote computer storage media, including memory storage devices.
[00025] The order in which the method is described is not intended to be construed as a limitation, and any number of the described method blocks can be combined in any order to implement the method, or an alternate method. Additionally, individual blocks may be deleted from the method without departing from the spirit and scope of the subject matter described herein. Furthermore, the method can be implemented in any suitable hardware, software, firmware, or combination thereof.
[00026] In the present embodiment of the invention, in order to extract association rules, in the first step 301, said Association Cluster and Visualization System 200 connects to an identified source data containing a set of business transactions or accepts manual input from the user through the input data interface 201. Input attributes pertaining to the selected business process are selected as antecedent attributes 302, similarly, yet another input attribute that must serve as a business goal is selected as a consequent 303.

[00027] In order to comprehend one of the embodiments of the present invention, consider the following example where an automobile manufacturer wants to find out which segment of consumers has the propensity to purchase a certain type of car. From a sample set of records, attributes available for analysis are: Age, Yearly Income, Occupation, Residence Area, Dependants and Type of Car. Refer to Table 1 for a sample set of transactions.
Table 1: Sample set of transactions

Rec. No Age Income Occupation Residence Dependant Car Type
1 25 4 Govt Urban 3 Basic
2 63 10 Farmer Rural 4 Economy
3 47 15 SMB City 7 SUV
4 34 15 Company Metro 4 Economy
5 30 10 Company Metro 3 Economy
6 56 18 Govt City 6 SUV
7 40 20 SMB City 5 SUV
8 27 18 Company Metro 3 SUV
9 42 25 Company Metro 5 Economy
10 45 30 SMB Rural 6 SUV
11 48 15 Govt Urban 6 SUV
12 32 16 Company Metro 5 Economy
13 37 10 Farmer Rural 5 Economy
14 65 27 Farmer Rural 3 Economy
15 35 14 Govt Urban 4 Economy
16 28 30 Company Metro 3 SUV
17 24 5 Company Urban 1 Economy
18 60 19 SMB Metro 4 Economy
19 47 43 SMB Metro 5 SUV
20 38 41 Company Metro 3 SUV
21 42 16 Govt Urban 4 Economy
22 24 8 Company City 2 Economy
23 50 27 Farmer Rural 4 Economy
24 41 15 Govt Rural 5 Basic
25 33 20 SMB City 4 Basic
[00028] The association thus generated by selecting a combination of attributes, helps an automobile manufacturer to get higher visibility and plan a sales strategy. In one example, the attributes selected may be Age, Income and Type of Car owned, where antecedent attributes are Age and Income, while the consequent is Type of Car owned. This analysis will help determine which age group belonging to a certain income band owns which type of car. An example of such an association may be a rule {Age (30-40), Income (6L-8L)=>Car (SUV Car)}. In other words, an association between attributes representing age and income to determine preference in type of vehicle can be determined by assigning values to

these attributes. That means, analyzing how many persons belonging to an age group between 30 to 40 years and earning an income between 6 lacs to 8 lacs prefer an SUV over a basic or economy car will enable an automobile manufacturer to align their sales strategy to target such an age and income band.
[00029] Before analyzing associations, an important step lies in data preparation or itemizing data. This is because attributes selected for deriving associations can be from any of the data types - nominal, categorical or continuous. For effective analysis, values of these attributes need to be translated and represented in an itemized form. The itemized form can be a set of contiguous integers that can take any value between 1 to M.
[00030] Nominal values encompass a discrete set of values. Some discrete values can be grouped into a set, and each set can be mapped to any item number that belongs to item set ranging from 1 to M. For example, consider attribute 'Number of Dependants', for which based on user specified input, a group of sets can be created having a range containing upper limit and lower limit such as (1-3), (4-5), (6-7). Each of these ranges can be mapped to an item set comprising elements (1, 2, 3). If a record of the attribute 'Number of Dependants' from a set of transaction has attribute value 6, the item number assigned is 3. Similarly, for a categorical attribute, a non-empty set of categorical values maps to an item number. For example category values defined can be Basic, Economy, SUV, where each value maps to item number (1, 2, 3) respectively. For a continuous attribute, the given range is partitioned into a certain number of sub-ranges and each sub-range is mapped to a given item number. For example, attribute age can have a lower limit 20 and upper limit 50, which can be partitioned into non-overlapping sub-ranges (20-30), (30-40), (40-50), (50-60), (60,70) and each sub-range can correspond to item numbers (1, 2, 3, 4, 5) respectively from the item set. Thus, any value belonging to any data type can be itemized.
[00031] In the example cited in relation to Table 1, influencing attributes are age, income, or the antecedents. Type of car is the consequent attribute that has a fixed value. Age and Income attributes can be continuous attributes that can have a range of values. Continuous attributes can have a lower limit and an upper limit, which in turn can be user specified inputs received through the Input Data Interface 201. Type of car is a categorical attribute, and in this case, the case the category defined is SUV Car. Step 304 of method 300 performs itemization of the selected attributes.
[00032] Age and income being continuous attribute types, have an upper limit and a lower limit. Age range considered for an association is between 20 and 70. Income range is between 2 lacs and 100 lacs. This range can be broken into sub-ranges to create synthetic partitions for itemization. This kind of partitioning can be performed using an interactive frequency distribution (IFD) technique. The objective of IFD is to obtain frequency distribution of a given dataset, interactively. As per one of the embodiments of

the present invention, three types of data partitioning techniques have been incorporated into the present invention.
[00033] Equi-distance Frequency Distribution technique wherein distribution is obtained by dividing the range into equal subintervals. The number of subintervals is equal to the number of buckets chosen. As an example, the age attribute can be analyzed on the basis of equi-distance distribution, wherein 5 buckets are chosen (1,2,3,4,5) and each bucket has an equal range {(20-30),(30-40),(40-50),(50-60),(60-70)}.
[00034] Equi-distributed Frequency Distribution technique wherein the technique comprises equally distributing a sample set across all buckets, such that each bucket has the same frequency (subject to integer division) but may have unequal intervals. The number of buckets can be selected by the user.
[00035] The third option, User-defined (Generic) comprises manually inputting the number of buckets and range for each bucket. However, the range selected must be distinct, that is, intersection of ranges of any two distinct buckets is NULL and the union of ranges of all the buckets is the range of the dataset. In addition, the range set satisfies a strict partial order. This is a preferred option if the dataset has some specific distribution characteristics that are known to analysts and the analysis based on this pattern is more meaningful with respect to applicability of results. In the current example, yearly income is distributed manually wherein the sample set is divided into 6 buckets with each bucket comprising a different range. The buckets chosen are (1,2,3,4,5,6) and distribution across each bucket is {(2-6),(6-12),(12-20),(20-30),(30-60),(60-100)}. This type of distribution enables a focus on a particular range based on societal or economic pattern in order to derive a strong association rule.
[00036] The input data interface 201 takes in any one of the frequency distribution methods as input. This input is then adopted by step 304 of method 300 to generate itemized data. After selecting a frequency distribution method as input, the number of partitions or buckets is also received as a further input according to step 304. The number of buckets is set to the default value, which can be changed as per user requirement. By default, the IFD computes frequency distribution of a given dataset based on equidistance partition of the range with respect to a fixed number of buckets.
[00037] This computation is performed in the computation memory 204, by the Frequency Distribution Processor 205. The number of buckets is derived from data size internally and it is fixed for a fixed data size. Once the partitioning principle and number of partitions is fed to the method, it generates items under each partition and their counts. This means that for transaction where a person's age is 43, income is 8L, occupation is farmer, residence is rural, dependants are 5 and car type is Basic, the itemized values assigned to this transaction after selecting a frequency distribution method will be {(Age=3,

lncome=2, Occupation=1, Residences, Dependant=2, Car Type=1)}. Refer to Table 2 where data from the sample set of transactions has been itemized, and readied for further processing, that is application of association rule and visualization.
Table 2: Itemized Transaction Dataset

Rec. No Age Income Occupation Residence Dependant Car Type Mapping Description
1 1 1 2 2 1 1 Age: 1, 2, 3, 4, 5
2 5 2 1 1 2 2 [20,30), [30,40), [40,50), [50,60), [60,70]
3 3 3 4 3 3 3
4 2 3 3 4 2 2 Yearly Income: 1, 2, 3, 4, 5,6
5 2 2 3 4 1 2 2-6, 6-12, 12-20, 20-30, 30-60, 60-100
6 4 3 2 3 3 3
7 3 4 4 3 2 3 Occupation: 1, 2, 3, 4
8 1 3 3 4 1 3 Farmer, Govt., Company, SMB
9 3 4 3 4 2 2
10 3 od4 4 1 3 3 Residence: 1, 2, 3, 4
11 3 3 2 2 3 3 Rural, Urban, City, Metro
12 2 3 3 4 2 2
13 2 2 1 1 2 2 Dependant: 1, 2, 3
14 5 4 1 1 1 2 [1,3], [4,5], [6,7]
15 2 3 " 2 2 2 2
16 1 5 3 4 1 3 Car Type: 1, 2, 3
17 1 1 3 2 1 2 Basic, Economy, SUV
18 5 3 4 4 2 2
19 3 5 4 4 2 3
20 2 5 3 4 1 3
21 3 3 2 2 2 2
22 1 2 3 3 1 2
23 4 4 1 1 2 2
24 3 3 2 1 2 1
25 2 4 4 3 2 1
[00038] After obtaining itemized data, as per step 305 of the present embodiment of the invention, variable input parameters - minimum support and minimum confidence values serve as input to the computation memory 204 and further computations are performed by the Gain Computation Module 206. As an example, let minimum support be set at 5% and minimum confidence be set at 30%. All

association rule (Age = i, Income = j) => (Car = Economy Car), for all values of 'i' and 'j' in its range where Min_Supp > 05% and Min_Conf > 30% is identified in step 305.
[00039] Further to receiving the minimum support and minimum confidence as variable input parameters, other parameters that drive application of the rule and thereinafter the visualization are computed 306. These parameters are Base Support and Rule Support. Base Support returns a count of instances where the itemized values of the antecedent attributes occur in a single record from a set of transactions. Base support values are published in a base support table for each itemized attribute as a count value and as a percentage, where percentage is {(count of Base Support)/Total number of records}*100. Similarly, rule support is calculated, where rule support is a count of instances where itemized values of antecedent attributes and consequent attribute occur in a single record from a set of transactions.
[00040] In addition to base support and rule support, gain is calculated in step 307. Gain is defined as {Rule Support - Minimum Conf*(Base Support)}. Gain indicates the strength of the association for each range/category of the Age and Salary attribute (antecedent) with respect to the Car Type as Economy Car (consequent). That is, the Gain Factor indicates closeness of the statistical data mined from the given dataset for a given rule.
[00041] According to one of the embodiments of the present invention, the defined rule is visualized in a two dimensional Decision Right matrix, as illustrated in step 308. Each axis corresponds to the pair of attribute antecedents (A and B), in this case, Age and Income. The number of rows (left hand side attribute) and columns (right hand side attribute) of the matrix is derived from the number of buckets of the itemized data attribute. As an exemplary embodiment of the invention, the matrix displays computed values of the base support, rule support and gain for an association rule. The significance of the matrix along with its arrangement and will be better appreciated when read alongside Figure 4 and its associated description in the subsequent paragraphs.
[00042] Step 309 of method 300, allows the system to receive a modified input of the minimum support and minimum confidence parameters. Other computation such as Base Support, Rule Support and Gain will be performed again. These modified parameters will help in determining newer associations, and gain an insight into higher dimension of association rules from among the itemized dataset.
[00043] Figure 4 refers to a schematic that is indicative of the components of the Visualization and Display Module 400. In the context of the present invention, it is important to note that the order of arrangement does not intend to dilute or limit the scope of the underlying concepts of the present invention. The module comprises Decision Right Matrix 401, Antecedent Information 402-1, 402-2, View Type Indicator 403, Consequent Value 404, Modifiable Parameter 405, and Computation Chart 406.

[00044] Consequent Value 404 represents the selected, fixed value of the consequent attribute, in this case, Car Type = Economy Car. Modifiable parameter 405 refers to the percentage value of the minimum support and minimum confidence value that has been selected for computation through the input data interface 201. Values for this parameter can be modified through the Input Data Interface 201, and the values thus received will be re-computed, published and displayed in the Decision Right Matrix 401.
[00045] The Decision Right Matrix 401 assumes significance according to an embodiment of the present invention. The matrix comprises the left antecedent attribute (in this case, Age) arranged in the horizontal axis or row and the right antecedent (in this case, income) arranged across the vertical axis or column. The attribute category of left antecedent and right antecedent coupled with support status of a cell with respect to base value and rule value decides the valid gain functions that can be computed.
[00046] The Decision Right Matrix is configured to visually differentiate between regions of the grid based on conformance to the set rule and association parameters computed. Further to the rule being applied to a cell, percentage or indication of conformance to a rule can be visually encoded in a different color or pattern. For example, a cell in the matrix can be shaded in a different color to highlight if base support and rule support value for that cell is above the specified limits of minimum support and minimum confidence. Computation chart 406 enables assigning color codes or patterns to values published in the Decision Right matrix.
[00047] In addition, the grid displays the numerical parameters computed for each cell along with a visually coded view; this is indicated by the View Type Indicator 403. In one view, base support values, rule support values, and gain value for the matrix is displayed to indicate association to the rule. The right-most column and the bottom-most row represent an aggregate of the values that have been computed for each cell. This in turn yields the total number of transactions that are present in the selected process file
[00048] These published values help in Row-MAX, Column-MAX and Matrix-MAX analysis - a visual representation of associations determined by itemized data, which in turn is derived from attribute type data. If A is an attribute with continuous value, the itemized partition in the horizontal axis (row partition) is virtual and hence, a contiguous set of cells along a row can be clubbed under certain criteria to represent a stronger rule. Similarly, if B is an attribute with continuous value, the itemized partition in the vertical axis (column partition) is a virtual one and hence, a contiguous set of cells along a column can be clubbed under certain criteria to represent a stronger rule. If the attribute A and B both have continuous value, the itemized partition across horizontal and vertical axis is virtual and hence, a set of connected cells in the grid can be clubbed together under certain criteria to represent a stronger rule. An exemplary

embodiment of the invention measures the optimal continuous gain across each row, each column and in a connected path based on 'Positive Support' and 'Non-Negative Gain' criteria that each cell has to satisfy.
[00049] Row-MAX (i) is a set of contiguous cells {gij} of positive support and non-negative gain in i-th row. This also generates the Row-MAX (i) association rule. Similarly, R-MAX Gain (i) is the gain contributed by the set Row-MAX (i) and Net R-MAX Gain: ∑ R-MAX Gain (i), for all i.
[00050] Column-MAX (j) is a set of contiguous cells {gj} of positive support and non-negative gain in j-th column. This also generates the Column-MAX (i) association rule. Similarly, C-MAX Gain (j) is the gain contributed by the set Column-MAX (j) and Net C-MAX Gain: 1 C-MAX Gain (j), for all j.
[00051] Matrix-MAX is the set of cells {gij} of positive support and non-negative gain in i-th row and j-th column, which are connected and generates maximum aggregated gain value. The support and confidence of this connected region is also computed and published in the matrix. Matrix-MAX Gain is £ Gain (g,j), for all (i,j) position that belongs to Matrix-MAX set. Thus it is obvious that Matrix-MAX Gain > Row-MAX (i) for each i-th row and Matrix-MAX Gain > Column-MAX (j) for each j-th column.
[00052] However, optimal gain measurement depends on the attribute type selected. If the antecedent attribute type is continuous for both axes, then optimal gain can be validated for Row-MAX, Column-MAX and Matrix-MAX. If either antecedent attribute value is non-continuous, then either Row-MAX or Column-MAX can be used to evaluate aggregate gain value. If both antecedent attributes are non-continuous, then optimal gain cannot be obtained by aggregating Row-MAX, Column-MAX or Matrix-MAX.
[00053] As per one of the embodiments of this invention, for minimum support of 5% and minimum confidence of 30%, from a set of 25 records, Base Support View is displayed in the Decision Right Matrix as indicated in Table 3. Each cell in the grid shows the support count, the green shade indicates cells that have a base support >Min_Supp. The cumulative value of each cell in a row, termed R-Sum and cumulative value of each cell in a column, termed ad C-Sum is also published. Total value of R-Sum and C-Sum yields the total number of records.

[00054] Similarly, rule support for an association is shown in Table 4. Cells that display values greater than minimum confidence is made visually distinct. In addition, the following valid associations can be determined through the visual identifiers.
(Age [20, 30), Income [2L, 6L)) => Person has an Economy Car, support 8% and confidence 50% (Age [30, 40), Income [6L, 12L)) => Person has an Economy Car, support 8% and confidence 100% (Age [30, 40), Income [12L, 20L)) => Person has an Economy Car, support 12% and confidence 100% (Age [40, 50), Income [20L, 30L)) => Person has an Economy Car with support 8% and confidence 50%

[00055] Further to displaying Base Support and Rule Support, the Visualization and Display Module 400, generates a gain table with published values of gain and net gain across all cell. Cells that show non negative gain in accordance with the Rule Support table are visually differentiated. The aggregate gain of the highlighted cells is computed and displayed, which is referenced in Table 5.

[00056] One of the exemplary embodiments of the present invention is display of Row-MAX, Column-MAX and Matrix-MAX gain in the Decision Right matrix. All of these measures display continuous non-negative gain across a row, column or a connected set in the matrix where each cell under the trace has positive support. The Table 6 depicts Row-MAX, Column-MAX and Matrix-MAX gain table with gain values:
NET Row-MAX GAIN = 6.40
NET Column-MAX GAIN = 7.10
Matrix-MAX GAIN = 7.80

[00057] It can be construed from Table 6 that Row-MAX gain table has uncovered four valid association rules across a contiguous patch of positive support cells that have non-negative gain. Out of the four association rules generated, two of them are new rules.
(Age [(20, 40), Income [6L, 12L)] => Person has an Economy Car, support 12% and confidence 100%.
(Age [40, 70), Income [20L, 30L)) => Person has an Economy Car, support 16% and confidence 75%.
[00058] Similarly, in another Column-MAX view, sum of non-negative gain of contiguous positive support cells within any column of the Decision Right Matrix intelligently reveals four association rules and three of them are new rules.
(Age [20, 30), Income [2L, 12L)) => Person has an Economy Car, support 12% and confidence 66%.
(Age [30, 40), Income [6L, 20L)) => Person has an Economy Car, support 20% and confidence 100%.
Age [60, 70), Income [6L, 30L)) => Person has an Economy Car, support 12% and confidence 100%.
[00059] In yet another view termed Matrix-MAX, the net gain computed is a sum of all non-negative gain of connected positive support cells. In the example cited in Table 6, this value is computed as 7.80. In addition, the base support and rule support values are also published for the highlighted cells, wherein for the current example relate to cells shaded in green have support 64% and confidence 75%. Thus, it is evident that different view types displayed on the View Type Indicator 403, aggregate the gain of contiguous patches of cell that provide an insight into deriving additional association rules and create an intelligent, intuitive visualization for stakeholders to facilitate decision making.

[00060]. In addition, the Decision Right Matrix is also configured to generate a focused cell display. Association rule parameters for each cell in the matrix is viewed in a separate display area This focused cell view displays the range or code of the selected antecedent attributes, base support value, rule support value and gain in a separate display area to indicate conformance or relevance to the association rule.
[00061] Thus, the present invention relates to a method and system for itemizing members of a dataset belonging to transactional data through interactive bucketing and further displaying the computed association rule parameters in a Decision Right Matrix that provides analysts a window for easy, intuitive and intelligent visualization to uncover associations among selected attribute data. The method and system described herein enables virtual partitioning of attribute data type, which in turn is capable of generating higher dimension association rules by aggregation of these partitions. Computing and displaying gain in different views of the Decision Right matrix optimizes identification of associations. The

process of identifying associations among selected antecedent and consequent attributes can be fine tuned by modifying minimum support and minimum confidence parameters to suit a particular business scenario to uncover new rules.
[00062] The present invention performs a generalized market basket analysis and may also be used across multiple business areas and scenarios, to derive meaningful associations from transaction attributes. Association rule mining can be applied to any business scenario such as predicting production for a manufacturing facility, conducting surveys for market research, determining a target market to sell certain types of insurance policies, understanding loan repaying capacities of industries and individuals based on critical parameters and so on. It is possible to view and compare the strength of a rule with respect to all possible combinations of antecedent attributes by specifying varying values of minimum support and minimum confidence.
[00063] It is to mention at this juncture that although the present invention has been described to its fullest extent. Those skilled in the art should understand that they can make various changes, substitutions and alteration herein, without departing from the core principles to make it appropriate for different practical scenarios.

I/We Claim -
1. A computer implemented method for determining association clusters and visualization wherein
said method comprises:
receiving an input dataset from a source data;
preparing an itemized dataset from the input dataset, wherein the input dataset relates to one or more continuous attributes;
receiving a set of input variable parameters;
computing at least one rule parameter from the itemized dataset to determine relevance of association; and
determining association clusters based on at least one of the computed rule parameters and the received set of input variable parameters and visualization of the association clusters to indicate the association.
2. The method as claimed in claim 1, wherein the input dataset comprises selecting a pair of antecedent attributes and a consequent attribute from among a plurality of attributes belonging to the source data.
3. The method as claimed in claim 1, wherein preparing an itemized dataset further comprises:
one of assigning an upper limit value and a lower limit value for the received input dataset comprising a pair of antecedent data attributes and a consequent data attribute and assigning value to the identified antecedent and consequent data attributes, wherein the value is one of a numeric, an alphabetical and an alphanumeric value;
receiving a value for number of buckets to perform partitioning of attribute values associated with the received input dataset;
receiving an option to perform frequency distribution, wherein said options comprise one of equi-distance, equi-distribution and user-defined method; and
generating an itemized dataset, wherein each member of the itemized dataset is mapped to an item number belonging to an item set.
4. The method as claimed in claim 3, wherein a default option for performing frequency distribution
is equi-distance partitioning method.

5. The method as claimed in claim 3, wherein the user-defined method further comprises receiving an input value for a number of buckets and another input value for assigning a range of values to each bucket.
6. The method as claimed in claim 1, wherein the input variable parameters received are values pertaining to minimum support and minimum confidence.
7. The method as claimed in claim 1 further comprises computing rule parameters to determine relevance of the association from an itemized dataset, wherein the method comprises:
computing a count of all instances of itemized values derived from a pair of identified antecedent attributes;
computing a count of all instances of itemized values associated with the pair of identified antecedent attributes and a consequent attribute; and
calculating gain factor.
8. The method as claimed in claim 1 further comprises determining the association clusters and a
visualization to indicate the association, wherein the method comprises:
arranging the pair of antecedent attributes and a consequent attribute, with each antecedent attribute occupying a first axis and a second axis of a two dimensional display region;
displaying at least one of computed rule parameters for each cell of the two dimensional display region in a plurality of views;
displaying gain factor computed for each cell in the two dimensional display region; and
displaying an aggregate of non-negative gain factor for each cell in three different views to enable determination of additional associations.
9. The method as claimed in claim 8 further comprises providing visual identifiers to differentiate between gain factors corresponding to various regions in said two dimensional display region.
10. A computer implemented association cluster and visualization system, wherein the system comprises:
a source data interaction module to communicate with a plurality of transactional and process data received from the source data;
an input data interface to receive user selected input data;
a computational process memory for derivation of associations; and
a display module for displaying the association in a two dimensional matrix

11. The computer implemented association cluster and visualization system as claimed in claim 10, wherein the source data further comprises a verification module for verification of attribute data type received from the source data.
12. The computer implemented association cluster and visualization system as claimed in claim 10, wherein the computational process memory comprises:
frequency distribution processor for itemization of attribute values received from the source data; and
gain computation module for calculation of gain factor of itemized attribute value.
13. The computer implemented association cluster and visualization system as claimed in claim 10, wherein the computational process memory communicates with the display module to indicate association between attributes in a two dimensional matrix incorporating visual differentiation.
14. The display module as claimed in claim 10 further comprises a separate focused cell display area for visualizing association rule parameters for certain regions of the two dimensional display region.

Documents

Application Documents

#	Name	Date
1	2729-MUM-2011-FORM 1(12-10-2011).pdf	2011-10-12
1	2729-MUM-2011-RELEVANT DOCUMENTS [26-09-2023(online)].pdf	2023-09-26
2	2729-MUM-2011-RELEVANT DOCUMENTS [27-09-2022(online)].pdf	2022-09-27
2	2729-MUM-2011-CORRESPONDENCE(12-10-2011).pdf	2011-10-12
3	2729-MUM-2011-RELEVANT DOCUMENTS [28-09-2021(online)].pdf	2021-09-28
3	2729-MUM-2011-FORM 4(ii) [03-07-2018(online)].pdf	2018-07-03
4	2729-MUM-2011-OTHERS [02-08-2018(online)].pdf	2018-08-02
5	2729-MUM-2011-FER_SER_REPLY [02-08-2018(online)].pdf	2018-08-02
6	2729-MUM-2011-CORRESPONDENCE [02-08-2018(online)].pdf	2018-08-02
7	2729-MUM-2011-COMPLETE SPECIFICATION [02-08-2018(online)].pdf	2018-08-02
8	2729-MUM-2011-CLAIMS [02-08-2018(online)].pdf	2018-08-02
9	2729-MUM-2011-ABSTRACT [02-08-2018(online)].pdf	2018-08-02
10	Form-3.pdf	2018-08-10
11	Form-1.pdf	2018-08-10
12	Drawings.pdf	2018-08-10
13	ABSTRACT1.jpg	2018-08-10
14	2729-MUM-2011-POWER OF ATTORNEY(9-11-2011).pdf	2018-08-10
15	2729-MUM-2011-FORM 18(29-9-2011).pdf	2018-08-10
16	2729-MUM-2011-FER.pdf	2018-08-10
17	2729-MUM-2011-CORRESPONDENCE(9-11-2011).pdf	2018-08-10
18	2729-MUM-2011-CORRESPONDENCE(29-9-2011).pdf	2018-08-10
19	2729-MUM-2011-HearingNoticeLetter.pdf	2018-11-26
20	2729-MUM-2011-Correspondence to notify the Controller (Mandatory) [17-12-2018(online)].pdf	2018-12-17
21	2729-MUM-2011-Written submissions and relevant documents (MANDATORY) [26-12-2018(online)].pdf	2018-12-26
22	2729-MUM-2011-PatentCertificate02-01-2019.pdf	2019-01-02
23	2729-MUM-2011-IntimationOfGrant02-01-2019.pdf	2019-01-02
24	2729-MUM-2011-RELEVANT DOCUMENTS [29-03-2020(online)].pdf	2020-03-29
25	2729-MUM-2011-RELEVANT DOCUMENTS [28-09-2021(online)].pdf	2021-09-28
26	2729-MUM-2011-RELEVANT DOCUMENTS [27-09-2022(online)].pdf	2022-09-27
27	2729-MUM-2011-RELEVANT DOCUMENTS [26-09-2023(online)].pdf	2023-09-26

Search Strategy

1	Search_22-12-2017.pdf