Abstract: The present technique is a system and methods to construct an accurate decision tree. The technique constructs a decision tree by a bottom-up approach. In addition, the technique mines a set of frequent itemsets from a dataset to store in a prefix tree data structure module. Additionally, the technique computes a set of class distribution values from the dataset at each node of the prefix tree data structure module and builds the decision tree using this data structure.
BACKGROUND
The technique relates generally to decision tree classifiers and construction of the decision trees for accurate classification. In particularity, the technique relates to a method for constructing accurate decision trees using frequent itemsets or the association rules.
Classification in data miming is useful in multiple areas such as fraud detection, business intelligence, medical diagnosis, target marketing. The input to the problem is a set of records where each record has various attributes. In similarity, among these attributes, there is one attribute called the class attribute which is categorical and has a small domain. Values of the class attribute are called class labels. The remaining attributes are called predictor attributes. In addition, the goal of the classification is to use the input records or the training records to build a concise model of the distribution of the class labels in terms of the predictor attributes.
Decision trees are one of the classification models and are especially attractive for a data mining environment because of their intuitive representation which is very easy for a human to understand. The decision trees act as an important analysis tool for understanding attribute-attribute relationships. Decision trees are constructed from a set of training records. Most of the decision tree construction algorithms build the decision trees in a breadth-first path by recursively partitioning the data until a high proportion of each partition's records belong to one class. In addition, such top-down algorithms may not generate optimal decision trees as their greedy fashion of tree construction assumes independence between domain variables. Hence these techniques may not necessarily output optimally accurate decision trees.
Accordingly, there is a need for a technique for constructing a decision tree with high accuracy. The present technique decision tree construction algorithm works
In a bottom-up approach and uses the frequent itemsets instead of the training records in the tree construction phase.
BRIEF DESCRIPTION
In one aspect of the present technique, a method for constructing a decision tree by a bottom-up approach is disclosed. The method continues to mine a set of frequent itemsets from a dataset to store in at least one prefix tree data structure of a prefix tree storage module. The method further computes and stores a set of class distribution values from the dataset at a plurality of nodes of the at least one prefix tree data structure. In addition, the method selects a subset of the set of frequent itemsets using an itemsets selection module. Furthermore, the method constructs the decision tree using the subset of the set of frequent itemsets using a build decision tree module.
In another aspect of the present technique, a system to construct a decision tree by a bottom-up approach is disclosed. The system comprises a frequent itemsets miner module adapted to mine a set of frequent itemsets from a dataset and store in at least one prefix tree data structure. The system further comprises an itemsets selection module adapted to select a subset of the set of frequent itemsets. Furthermore, the system comprises a build decision tree module adapted to construct the decision tree using the subset of the set of frequent itemsets.
DRAWINGS
These and other features, aspects, and advantages of the present invention will become better understood when the following detailed description is read with reference to the accompanying drawings in which like characters represent like parts throughout the drawings, wherein:
FIG, 1 is a block diagram depicting a complete architecture of the decision tree construction system, in accordance with an aspect of the present technique;
FIG.2 is a block diagram depicting one or more stages for building the decision tree construction system, in accordance with an aspect of the present technique;
FIG.3 is a block diagram depicting a process for calculating the best split for the decision tree construction system, in accordance with an aspect of the present technique; and
FIG.4 is a flow diagram indicative of a process of a decision tree construction system, in accordance with an aspect of the present technique.
DETAILED DESCRIPTION
The following description is full and informative description of the best method and system presently contemplated for carrying out the present invention which is known to the inventors at the time of filing the patent application. Of course, many modifications and adaptations will be apparent to those skilled in the relevant arts in view of the following description in view of the accompanying drawings and the appended claims. While the system and method described herein are provided with a certain degree of specificity, the present technique may be implemented with either greater or lesser specificity, depending on the needs of the user. Further, some of the features of the present technique may be used to advantage without the corresponding use of other features described in the following paragraphs. As such, the present description should be considered as merely illustrative of the principles of the present technique and not in limitation thereof, since the present technique is defined solely by the claims.
The present invention relates to decision tree classifiers and construction of the decision trees for accurate classification. In particularity, the technique relates to a method for constructing accurate decision trees using frequent item sets (FI) or the association rules.
To define a decision tree is a predictive model; that is, a mapping of observations about an item to conclusions about the item's target value. Each interior
Node corresponds to a variable; an arc to a child represents a possible value of that variable. A leaf represents the predicted value of target variable given the values of the variables represented by the path from the root. It is used in business planning. The machine learning technique for inducing a decision tree from data is called decision tree learning, or decision trees.
Decision tree construction is a common method used in data mining. In addition, a decision tree describes a tree structure wherein leaves represent classifications and branches represent conjunctions of features that lead to those classifications. A decision tree can be learned by splitting the source set into subsets based on an attribute value test. In similarity, the process is repeated on each derived subset in a recursive manner. The recursion is completed when splitting is either non-feasible, or a singular classification can be applied to each element of the derived subset. Decision trees are also a descriptive means for calculating conditional probabilities.
FIG. 1 is a block diagram depicting a complete architecture of the decision tree construction system 300, in accordance with an aspect of the present technique. The system 300 comprises a dataset 310, a frequent item set miner module 315, a frequent item sets storage module 325, a decision tree construction module 380 and a final decision tree storage module 400.
In one embodiment of the present technique, the dataset 310 is the input to the system 300. The dataset 310 comprises records, whereby each record takes a respective value for each attribute called predictor attribute. Additionally, each record may take one value for a class attribute. Furthermore, the dataset 310 is a set of data consisting of a list of research subjects and the set of data vectors associated with them. One dataset may consist of various records. It should be noted that one record may take one value for each of the predictor attributes and in addition, one value for the class attribute.
In another embodiment of the present technique, the frequent itemsets miner module 315, receives the dataset 310 as input. The module 315 mines the
Frequent itemsets for a given support threshold measure called sigma without considering the class attribute of the records. In addition, the frequent itemsets may be mined from the dataset 310 using the frequent item set miner module 315. Multiple data mining tasks may be performed to mine frequent itemsets from the dataset 310. The module 315 may comprise the data mining tasks. To define, an itemset is a set of items and a frequent itemset is an itemset that has support above sigma.
In another embodiment of the present technique, the frequent itemsets storage module 325 holds the frequent itemsets generated from the module 315. In addition, the frequent itemsets may be either stored on the hard disk or on the random access memory (RAM), in accordance to application requirements. Additionally, the module 325 may be storing the mined frequent itemsets that may be used for building the decision tree.
In another embodiment of the present technique, the decision tree construction module 380 receives the frequent itemsets as input from the module 325. In addition, the module 380 constructs the accurate decision tree.
In yet another embodiment of the present technique, the final decision tree storage module 400 stores the constructed decision tree. The module 400 may be the classifier for classifying new records for which the class value may not be known. Additionally, the module 400 may be the decision tree constructed that may be optimal under various conditions.
FIG.2 is a block diagram depicting one or more stages for building the decision tree system 500, in accordance with an aspect of the present technique. The system 500 may describe the module 380. The diagram 500 comprises the frequent item set storage module 325, a prefix tree storage module 550, an itemsets selection module 575, a build decision tree module 600 and a final decision tree storage module 400.
In one embodiment of the present technique, the frequent itemsets storage module 325 as explained earlier holds the frequent itemsets.
In another embodiment of the present technique, the prefix tree storage module 550 receives the frequent itemsets as input. The frequent itemsets may be stored in a prefix tree form. In addition, for each node of the prefix tree that represents a frequent itemset, the class-distribution values may be computed and stored at that particular node.
In one embodiment of the present technique, the prefix tree stores the frequent itemsets in the following way. Initially, a complete order between the attributes is fixed apriori so that given two attributes we can determine which one among them has higher importance. Based on this order, constituent items of every particular frequent itemset are ordered into a sequence. Now each frequent itemset corresponds to one sequence. In a prefix tree, every node is labeled with an item. Most importantly, every node stores one sequence, which can be obtained by concatenating the items appearing from the root node to this particular node. Now, a sequence is inserted into a prefix tree in the following way: (1) start at the root node of the prefix tree, (2) select the child of this node whose label matches with the first item in the sequence that we are trying to store, (if this child node does not exist, create it), (3) make the child node the current node and iterate from step-1 by starting at the second item of the sequence, (4) carry on these iterations until all the items of the sequence have been visited, (5) The final node that we end up with, will represent the sequence that we are trying to store, and frequency of the itemset corresponding to this sequence is stored at this node. Also, the class distribution corresponding to this frequent itemset is stored at this node. In this fashion, all the frequent itemsets are first converted into sequences based on our apriori-fixed ordering approach and the resultant sequences are inserted into the prefix tree. It should be noted that the above illustration is a description of one of the ways. However, as will be appreciated by a person skilled in the art, other similar methods may also be implemented using the present technique.
Furthermore, the resultant prefix tree obtained may be the input to the next module. It should be noted that a prefix tree may be a tree structure wherein all of the words likely to be encountered by the system 500 are represented in the tree structure.
In another embodiment of the present technique, the itemsets selection module 575 processes the prefix tree and selects a subset of frequent itemsets that may be mutually exclusive, collectively exhaustive and highly accurate. In addition, the selected frequent itemsets and their relevant class-distribution values may be the input to the next module.
In another embodiment of the present technique, the build decision tree module 600 builds the decision tree from the selected frequent itemsets.
In yet another embodiment of the present technique, the final decision tree storage module 400 as explained earlier stores the constructed decision tree.
FIG.3 is a block diagram depicting a process for calculating the best split for the decision tree construction system 700, in accordance with an aspect of the present technique. The system 700 comprises a prefix tree storage module 550, a best split calculation module 740, a prefix tree traversing module 760, a selected itemsets storage module 780 and a build decision tree module 600.
In one embodiment of the present technique, the prefix tree storage module 550 as explained earlier stores the frequent itemsets and gives the resultant prefix tree obtained as input to the prefix tree traversing module 760.
In another embodiment of the present technique, the prefix tree traversing module 760 receives the input as the prefix tree obtained from the module 550. In addition, the module 760 traverses the tree and finds the best split attribute at each node using the best split calculation module 740. It should be noted that in this case, a bottom-up traversal that may be a depth first traversal is used. Furthermore, the module 760 may traverse the tree repeatedly starting from the root node following the best splits and may collect all the leaf level frequent itemsets, after which traversal is not possible. The selected frequent itemsets may be the input to the next module.
In another embodiment of the present technique, the best split calculation module 740 determines the best split attribute at the given node. The determination may be done by finding out the immediate supersets of the node, grouping them into
sub-groups based on the attribute split that may have generated them from the current node, calculating the accuracy for each of these attribute splits, and selecting and storing the attribute split with the highest accuracy (which is the best-split) at the node.
In another embodiment of the present technique, the selected itemsets storage module 760 stores the selected itemsets either on a hard disk or random access memory (RAM) in accordance to the system requirements.
In yet another embodiment of the present technique, the build decision tree 600 as explained earlier builds the decision tree from the selected frequent itemsets in the following way: Find one attribute such that every frequent itemset has one attribute-value of this attribute appearing in it; fill the decision tree node with this attribute; divide the frequent itemsets into subgroups such that frequent itemsets belonging to a subgroup have the same attribute value for this attribute; create one unique child for the decision tree node for each unique value of the attribute; repeat the above process for each child with its corresponding attribute-value's subgroup of frequent itemsets.
FIG.4 is a flow diagram indicative of a process a decision tree construction system, in accordance with an aspect of the present technique.
The method continues at step 900, wherein a root of a prefix tree data structure may be taken initially as a current node. The method continues at step 902, wherein the set of nodes corresponding to immediate supersets of the current node may be identified. It should be noted that step 900 to step 906 may be repeated for each sub-tree with a condition that if any sub-tree consists of a single node then the step 900 to step 906 may not be repeated for that particular sub-tree. It should also be noted that this accuracy may be used while calculating the best split attribute using best split calculation module at the parent node of the current node.
The method continues at step 904, wherein a set of sub-trees at the current node may be traversed from left to right path. During the first phase of the best split calculation module children of current node from left to right path and each child's
Immediate superset may be computed by traversing the sub-tree in a bottom-up or depth-first path that may be rooted at current node. In these cases, the items of the child may be added to the child's parent to obtain the resultant child. Furthermore, each immediate superset of the node using the items of the child may be expanded.
Furthermore, the method ends at step 906, wherein a best split attribute at the current node may be calculated from the accuracies of the immediate supersets. During the second phase of the best split calculation module the best split attribute at each node may be computed and stored in their respective nodes.
As will be appreciated by people skilled in the art, to best understand the present invention it is important to be familiar with the environment in which it is used and the basic terminologies.
As will be appreciated by a person skilled in the art, the various implementations of the present technique provide a variety of advantages. For example in the present technique, the frequent itemsets are mined from the training records, it constructs the tree in an efficient manner and hence it can be useful tool for understanding attribute-attribute relationships by a human in a data mining environment.
The system as described in the present technique or any of its components, may be embodied in the form of a computer system. Typical examples of a computer system includes a general-purpose computer, a programmed microprocessor, a microcontroller, a peripheral integrated circuit element, and other devices or arrangements of devices that are capable of implementing the steps that constitute the method of the present technique.
The computer system comprises a computer, an input device, a display unit and/or the Internet. The computer further comprises a microprocessor. The microprocessor is connected to a communication bus. The computer also includes a memory. The memory may include Random Access Memory (RAM) and Read Only Memory (ROM). The computer system further comprises a storage device. The storage device can be a hard disk drive or a removable storage drive such as a floppy
Disk drive, optical disk drive, etc. The storage device can also be other similar means for loading computer programs or other instructions into the computer system. The computer system also includes a communication unit. The communication unit allows the computer to connect to other databases and the Internet through an I/O interface. The communication unit allows the transfer as well as reception of data from other databases. The communication unit may include a modem, an Ethernet card, or any similar device which enables the computer system to connect to databases and networks such as LAN, MAN, WAN and the Internet. The computer system facilitates inputs from a user through input device, accessible to the system through I/O interface.
The computer system executes a set of instructions that are stored in one or more storage elements, in order to process input data. The storage elements may also hold data or other information as desired. The storage element may be in the form of an information source or a physical memory element present in the processing machine.
The set of instructions may include various commands that instruct the processing machine to perform specific tasks such as the steps that constitute the method of the present technique. The set of instructions may be in the form of a software program. Further, the software may be in the form of a collection of separate programs, a program module with a larger program or a portion of a program module, as in the present technique. The software may also include modular programming in the form of object-oriented programming. The processing of input data by the processing machine may be in response to user commands, results of previous processing or a request made by another processing machine.
While, the following description is presented to enable a person of ordinary skill in the art to make and use the invention and is provided in the context of the requirement for a obtaining a patent. The present description is the best presently-contemplated method for carrying out the present invention. Various modifications to the preferred embodiment will be readily apparent to those skilled in the art and the generic principles of the present invention may be applied to other embodiments, and
Some features of the present invention may be used without the corresponding use of other features. Accordingly, the present invention is not intended to be limited to the embodiment shown but is to be accorded the widest cope consistent with the principles and features described herein.
Many modifications of the present invention will be apparent to those skilled in the arts to which the present invention applies. Further, it may be desirable to use some of the features of the present invention without the corresponding use of other features.
Accordingly, the foregoing description of the present invention should be considered as merely illustrative of the principles of the present invention and not in limitation thereof.
We Claim:
1. A method for constructing a decision tree by a bottom-up approach, the method comprising:
Mining a set of frequent itemsets from a dataset to store in at least one prefix tree data structure of a prefix tree storage module;
computing and storing a set of class distribution values from the dataset at a plurality of nodes of the at least one prefix tree data structure;
Selecting a subset of the set of frequent itemsets using an itemsets selection module; and
Constructing the decision tree using the subset of the set of frequent itemsets using a build decision tree module.
2. The method as recited in claim 1, wherein construction of the decision tree comprises the bottom-up approach using the set of frequent itemsets.
3. The method as recited in claim 1, wherein the decision tree is optimal for an accuracy measure and with a condition that a plurality of frequencies of a plurality of leaves of the decision tree are greater than a threshold measure.
4. The method as recited in claim 1, further comprising processing the plurality of nodes of the at least one prefix tree data structure in a depth first path.
5. The method as recited in claim 1, wherein computing the set of class distribution values for the plurality of nodes of the at least one prefix tree data structure and respectively storing in the plurality of nodes thereof.
6. The method as recited in claim 1, further comprising determining a best split of the plurality of nodes of the at least one prefix tree data structure using a prefix tree traversing module.
7. The method as recited in claim 6, wherein the prefix tree traversing module traverses the prefix tree data structure in the depth first path starting from a root node.
8. A system for constructing a decision tree by a bottom-up approach, the system comprising:
A frequent itemset miner module adapted to mine a set of frequent itemsets from a dataset and store in at least one prefix tree data structure;
An itemsets selection module adapted to select a subset of the set of frequent itemsets; and
A build decision tree module adapted to construct the decision tree using the subset of the set of frequent itemsets.
9. The system as recited in claim 8, further comprising a prefix tree traversing module adapted to determine a best split attribute of the plurality of nodes of the at least one prefix tree data structure.
10. A tangible computer-readable medium having stored thereon computer executable instructions for constructing a decision tree by a bottom-up approach, the computer-readable medium comprising:
Program code adapted to mine a set of frequent itemsets from a dataset to store in at least one prefix tree data structure of a prefix tree storage module;
Program code adapted to compute and store a set of class distribution values from the dataset at a plurality of nodes of the at least one prefix tree data structure;
Program code adapted to select a subset of the set of frequent itemsets using an itemsets selection module; and
Program code adapted to construct the decision tree using the subset of the set of frequent itemsets using a build decision tree module.
11. The computer-readable medium as recited in claim 10, further comprising program code adapted to construct the decision tree using the bottom-up approach using a decision tree construction module.
12. The computer-readable medium as recited in claim 10, further comprising program code adapted to determine a best split of the plurality of nodes of the at least one prefix tree data structure using a prefix tree traversing module.
| # | Name | Date |
|---|---|---|
| 1 | 2442-CHE-2006 POWER OF ATTORNEY.pdf | 2011-12-09 |
| 1 | 2442-che-2006-form 5.pdf | 2011-09-04 |
| 2 | 2442-che-2006-abstract.pdf | 2011-09-04 |
| 2 | 2442-che-2006-form 3.pdf | 2011-09-04 |
| 3 | 2442-che-2006-claims.pdf | 2011-09-04 |
| 3 | 2442-che-2006-form 1.pdf | 2011-09-04 |
| 4 | 2442-che-2006-correspondnece-others.pdf | 2011-09-04 |
| 4 | 2442-che-2006-drawings.pdf | 2011-09-04 |
| 5 | 2442-che-2006-description(complete).pdf | 2011-09-04 |
| 6 | 2442-che-2006-correspondnece-others.pdf | 2011-09-04 |
| 6 | 2442-che-2006-drawings.pdf | 2011-09-04 |
| 7 | 2442-che-2006-claims.pdf | 2011-09-04 |
| 7 | 2442-che-2006-form 1.pdf | 2011-09-04 |
| 8 | 2442-che-2006-abstract.pdf | 2011-09-04 |
| 8 | 2442-che-2006-form 3.pdf | 2011-09-04 |
| 9 | 2442-CHE-2006 POWER OF ATTORNEY.pdf | 2011-12-09 |
| 9 | 2442-che-2006-form 5.pdf | 2011-09-04 |