Predictive Data Mining System And Methods Thereof

Abstract: A method and system to build a tool to automate predictive data mining system. In addition, the method automates one or more tasks for selecting multiple predictive data mining models using a set of meta-heuristic search techniques. Furthermore, the method updates one algorithm from the various classification algorithms with one model among the multiple predictive models that build parameters to store into the tool. Similarly, the system comprises a set of meta-heuristic search techniques adapted to automate one or more tasks adapted for selecting a set of predictive data mining models.

Patent Information

Application #

Filing Date

28 September 2006

Publication Number

48/2008

Publication Type

INA

Invention Field

NO SUBJECT

Status

Email

Parent Application

Applicants

INFOSYS TECHNOLOGIES LIMITED

PLOT NO. 44, ELECTRONICS CITY, HOSUR ROAD, BANGALORE, KARNATAKA - 560 100, INDIA

Inventors

1. SUREKA, ASHISH

B-401 MEENAKSHI CLASSICS, #471, 27 MAIN, HSR LAYOUT SECTOR 1, BANGALORE, KARNATAKA - 560 034, INDIA

Specification

Predictive Data Mining System and Methods Thereof BACKGROUND The present technique relates generally to build a computer implemented software tool to automate a data mining process. In particularity, the technique relates to automating a predictive data mining modeling and the process of building a predictive model. Predictive Modeling is considered to be one of the most widely used data mining technology may have been applied to various engineering and scientific disciplines. Moreover, the objective of predictive modeling may be to build a model from historical data assigning records into various classes or categories based on their attributes. A model may be learned using the historical data and further be used to predict membership of new records. The process of building a predictive model and using it to score or predict the class membership of unseen records. In addition, one of the fields in the historical data may be designated as the target or class variable, v and the other fields in the dataset may be referred to as the independent variables (inputs or predictors). Furthermore, if the target variable may be categorical, then the predictive modeling technique to use may be classification. In controversy, if the target variable is continuous, then the most well suited form of predictive modeling technique may be regression. The technique pertains to classification based predictive modeling. However, the technique applies to regression based predictive modeling. Additionally, there may be various predictive modeling algorithms that may be proposed earlier. The range of classification methods may include decision trees, neural networks, support-vector machines, bayes method, lazy learning techniques and nearest neighbor approach. Furthermore, amongst the various classification methods, decision tree may be the most popular technique for predictive modeling. In addition, there may be various decision tree inducing algorithms that may be used to build predictive models and the various decision tree inducing algorithms differing, each other in terms of their splitting criteria, pruning techniques, handing of missing values and continuous variables. Moreover, some of the examples used to build decision trees include ID35 C4.5, C5.0, SLIQ, Classification and regression trees (CART) and Chi-square automatic interaction detector (CHAID). However, each of these algorithms has several tuning parameters that may be varied and may have an effect on the generated model. By way of example, the J48 algorithm, provided in the popular open source machine learning tool Weka for generating an un-pruned or a pruned C4.5 decision tree that may have various tuning parameters. In addition, a user may set the value of a parameter called binary splits to true or false. Similarly, the user may be enabled to set values for other tuning parameters such as the usage of laplace option or sub-tree rising. However, there is no single algorithm that performs well in all type of situations. Resulting, a data mining practitioner, relies on a trial and error approach, whereby the practitioner applies various algorithms on the dataset to build several predictive models. In addition, the practitioner may also tune the modifiable parameters of the algorithms with the aim to find a high accuracy predictive model. However, once many predictive models may be built, the practitioner selects the model that best suits the business needs and has good predictive accuracy. In addition, based on past experiences, a practitioner may also make a direct judgment of which algorithms from a library of available algorithms may suit the business problem and the dataset. Further, there may be some v pre-defined rules that may map dataset characteristics and business needs to an algorithm but these rules may not be exhaustive and cannot cover every possible case. Hence in the present systems the task of predictive modeling algorithm and its parameter selection is largely a manual process and requires a lot of human judgment, skill and effort. Furthermore, in the present systems and tools, the task of algorithm selection and finding the best parameter setting to build a predictive data mining model may not be automated and may be mostly done using a trial and error approach. Accordingly, there is a need for a technique to bring automation to the task of finding an appropriate predictive data mining model. BRIEF DESCRIPTION A method to build a computer implemented software tool for automating a data mining process is disclosed. In addition, the method automates a set of tasks for building and selecting a set of predictive data mining models using a set of meta-heuristic search techniques. Furthermore, the method updates one of at least one classification algorithms with one of at least one predictive models building parameter to store into the tool. The system to build a computer implemented software tool for automating a data * mining process is disclosed. In addition, the system comprises a set of meta-heuristic search techniques adapted to automate a set of tasks adapted for building and selecting a set of predictive data mining models. Furthermore, the system comprises a tool adapted to store and update one of at least one classification algorithms with one of at least one predictive models building parameter. DRAWINGS These and other features, aspects, and advantages of the present invention will become better understood when the following detailed description is read with reference to the accompanying drawings in which like characters represent like parts throughout the drawings, wherein: FIG. 1 is a block diagram of a system depicting a process of building and scoring a ' predictive data mining model, in accordance with an aspect of the present technique; FIG.2 is a block diagram of a system depicting a process of building the predictive data mining model using an automatic algorithm and parameter selection module (AAPSM), in accordance with an aspect of the present technique; FIG. 3 is a block diagram of a system depicting architecture of the automatic algorithm and parameter selection module (AAPSM), in accordance with an aspect of the present technique; and FIG.4 is a flow diagram illustrating the process of building a predictive data mining model, in accordance with an aspect of the present technique. DETAILED DESCRIPTION The following description is full and informative description of the best method and system presently contemplated for carrying out the present invention which is known to the inventors at the time of filing the patent application. Of course, many modifications and adaptations will be apparent to those skilled in the relevant arts in view of the following description in view of the accompanying drawings and the appended claims. While the system and method described herein are provided with a certain degree of specificity, the present technique may be implemented with either greater or lesser specificity, depending on the needs of the user. Further, some of the features of the present technique may be used to advantage without the corresponding use of other features described in the following paragraphs. As such, the present description should be considered as merely illustrative of the principles of the present technique and not in limitation thereof, since the present technique is defined solely by the claims. As a preliminary matter, the definition of the term "or" for the purpose of the following discussion and the appended claims may be intended to be an inclusive "or" . That may be, the term "or" may be not intended to differentiate between two mutually exclusive alternatives. Rather, the term "or" when employed as a conjunction between two elements may be defined as including one element by itself, the other element itself, and combinations and permutations of the elements. For example, a discussion or recitation employing the terminology "A" or "B" includes: v "A" by itself, "B" by itself and any combination thereof, such as "AB" and "BA." It may be worth noting that the present discussion relates to exemplary embodiments, and the appended claims should not be limited to the embodiments discussed herein. As will be appreciated by people skilled in the art, to best understand the present technique it is important to be familiar with the environment in which it is used and the basic terminologies. Predictive modeling may build a classification or regression model that may accurately predict the value of a target column by observing the values of the input attributes. The process of building the predictive model may be iterative in nature and requires specialized knowledge in the field of data mining and predictive modeling to build better predictive models. Moreover, one of the most critical tasks that needs to be performed while building a predictive model may be the task of selecting the best algorithm and its tuning parameter from a library of a large number of predictive model building algorithms. Data mining may be a new concept for multiple people. Data mining products may be new and marred by unpolished interfaces. Additionally, only the most innovative or daring early adopters may be trying to apply emerging tools. Today's products have matured, and data mining is accessible to a much wider audience. We are even seeing the emergence of specialized vertical market data mining products. Data mining may extract new information from data. In addition, data mining tools do more than query and analysis tools, online analytical processing (OLAP) tools, or statistical techniques like an analysis of variance to name just a few examples. Understanding the kinds of questions data mining tools can answer is the best way to appreciate how they differ from other approaches. It should be noted that OLAP may be an approach to quickly provide the answer to analytical queries that may be dimensional in nature. OLAP may be part of the broader category business intelligence that includes extract transform load (ETL), relational reporting and data mining. In addition, the typical applications of OLAP are in business reporting for sales, marketing, management reporting, business process management (BPM), budgeting and forecasting, financial reporting and similar areas. The term OLAP was created as a slight modification of the traditional database term on line transaction processing (OLTP). To define, data mining may also be known as knowledge-discovery in databases v (KDD) or knowledge-discovery and data mining (KDD) may be the process of automatically searching large volumes of data for patterns. In addition, data mining may be defined as "the nontrivial extraction of implicit, previously unknown, and potentially useful information from data" and "the science of extracting useful information from large datasets or databases. Furthermore, predictive modeling may be the process by which a model is created or chosen to try to best predict the probability of an outcome. In many cases, the model is chosen on the basis of detection theory to try to guess the probability of a signal given a set amount of input data, by example given an e-mail determining how likely that it is a spam. Additionally, models can use one or more classifiers in trying to determine the probability of a set of data belonging to another set, say 'spam' or 'ham'. A classifier may be a mapping from a discrete or continuous feature space X to a discrete set of labels Y. Predictive modeling may be considered to be one of the most widely used data mining technology and may be applied to multiple engineering and scientific disciplines. The objective of predictive modeling may be to build a model from historical data assigning records into various classes or categories based on their attributes. Additionally, a model may be learned using the historical data and may be used to predict membership of new records. The present technique relates to building a tool to automate a data mining process. In particularity, the technique relates to automating a predictive data mining modeling and the process of building a predictive modeling. The present technique may be to build an intelligent system that automates the task of selecting an optimal predictive model building algorithm and its parameter so that human intervention may be reduced. The technique automates the task of algorithm selection thereby increasing the overall productivity of a data mining practitioner and achieving a large scale adoption of the technology due to a reduction in the skill required in order to use the technology. In addition, the technique provides an automatic algorithm and parameter selection module (AAPSM) built using meta-heuristic search techniques such as genetic algorithms that does the required automation. Furthermore, the AAPSM module requires minimum user interaction and finds a near-optimal algorithm and automatically sets the various algorithm parameters in a manner that may be suitable to the dataset and the particular business problem. FIG. 1 is a block diagram of a system 100 depicting for building and scoring a predictive data mining model, in accordance with an aspect of the present technique. The system 100 comprises a historical database module 102, a classifier module 104, a predictive model module 106 and a database of classification algorithms module 108, in which resides a data miner 110. The historical database module 102 consists of a set of known records for Mure reference. The classifier module 104 may involve multiple classifiers varying from various classes. The classifier module 104 may be predicting the class memberships or values from sources of unusable or unseen records. In one embodiment, predictive modeling may be considered to be one of the most widely used data mining technology and has been applied to many engineering and scientific disciplines. The objective of predictive modeling may be to build a predictive model module 106 from historical data module 102 assigning records into various classes or categories based on their attributes. To define an attribute, may be a parameter of an object or other kind of entity. A model may be learned using the historical data module 102 and is then used to predict membership of new records. The process of building a predictive model and using it to score or predict the class membership of unseen records may be illustrated in the system 100. One of the fields in the historical data module 102 may be designated as the target or class variable, and the other fields in the dataset may be referred to as independent variables such as inputs or predictors. In another embodiment, if the target variable is categorical, then the predictive modeling technique to use is classification and if the target variable is continuous, then the most well suited form of predictive modeling technique is regression. To define a regression, may be a technique in which unknown values of a discrete variable are predicted based on known values of one or more continuous and or discrete variables. Additionally, the technique pertains to classification based predictive modeling and hence the idea presented in this technique applies to classification based predictive modeling. In another embodiment, there may be various classification-based predictive modeling algorithms. In addition, the range of classification methods include decision trees, neural networks, support-vector machines, bayes methods, lazy learning techniques and nearest neighbor approach. It should be noted that, amongst the various classification methods, decision trees may be the most popular technique for predictive modeling. In addition, there are various decision trees inducing algorithms that may be used to build predictive models and the various decision tree inducing algorithms differ from each other in terms of their splitting criteria, pruning techniques, handling of missing values and continuous variables. In another embodiment, some of the algorithms used t build decision trees include ID3 (a decision tree inducing algorithm), C4.5 (a decision tree generating algorithm, based on the ID3 algorithm, and contains various improvements, especially needed for software implementation), C5.0 (a data mining tool for discovery patterns in extensive databases), SLIQ (a decision tree classifier that can handle both numeric and categorical data), CART (a classification and regression decision tree tool that automatically sifts large, complex databases, searching for and isolating significant patterns and relationships) and CHAID (chi-squared automatic interaction detector, used to construct non-binary trees, performs a more thorough merging and testing of predictable variables, requiring more computing time). In another embodiment, each of the said algorithms may have several tuning parameters that can be varied and has an effect on the generated model. By way of example, the J48 algorithm, (an implementation of the C4.5 algorithm decision tree learner, uses the greedy technique to induce decision trees for classification), provided in the popular open source machine learning tool Weka (a machine learning software written in Java which implements various machine learning algorithms from various learning paradigms) for generating an un-pruned or a pruned C4.5 decision tree having multiple tuning parameters. In yet another embodiment, a user may set the value of a parameter called as binary splits to true or false. Similarly, the user may have the freedom to set values for other tuning parameters such as the usage of laplace option or sub-tree rising. A data miner 110 relies on a trial and error approach, wherein the miner 110 applies various algorithms on the dataset module 112 to build one or more predictive models. The miner 110 may also fine tune the modifiable parameters of the algorithms with the aim to find a high accuracy predictive model module 114. Additionally, once multiple predictive models may be built, then the miner 110 selects the model 116 that best suits the business needs and has good predictive accuracy. Moreover, based on his past experience, a miner 110 may also make his judgment of which algorithms from a library of available algorithms 108 will suit the business problem and the dataset module 112. In addition, there may be some pre-defined rules that can map dataset 112 characteristics and business needs to an algorithm but these rules may not be exhaustive and cannot cover every possible case. FIG.2 is a block diagram of a system 200 depicting a process of building the predictive data mining model using an automatic algorithm and parameter selection module (AAPSM), in accordance with an aspect of the present technique. The system 200 explains the role of the AAPSM module to reduce the requirement of human intervention and increase productivity. The system 200 comprises input modules 205 further comprising an input dataset module 210, a classification algorithms and tuning parameters module 212, objective function module 214. The modules 210,212 and 214 are fed as the inputs to the predictive model building process. The dataset module 210 by a way of example may be the weather dataset from the machine learning Weka tool. The classification algorithms and tuning parameters module 212 provides the facility for the users to fine tune the modifiable parameters of the algorithms with the aim to find the high accuracy predictive model. The objective function module 214 provides the users to select and submit the required objective functions along with the dataset from the dataset module 210 to build and test multiple predictive models. In addition, before deploying a predictive model in an application, a data mining analyst builds multiple models and selects the best model according to the function module 214. In one embodiment, an automation module 216 reduces the amount of human intervention to increase productivity. In addition, the task of predictive modeling alfiOrit^™ an/^ **c n&ram&t&r ct*\e*ntir\r\ smtrtmsitirtn tn Hf»r»rpaef» thf» manual r\rnre*Q

Documents

Application Documents

#	Name	Date
1	1792-CHE-2006 FORM-18 06-10-2009.pdf	2009-10-06
1	1792-CHE-2006-AbandonedLetter.pdf	2018-01-10
2	1792-CHE-2006-FER.pdf	2017-05-05
2	1792-CHE-2006 FORM-13 28-10-2009.pdf	2009-10-28
3	1792-che-2006-form 5.pdf	2011-09-03
3	1792-CHE-2006 AMENDED PAGES OF SPECIFICATION 03-06-2015.pdf	2015-06-03
4	1792-che-2006-form 3.pdf	2011-09-03
4	1792-CHE-2006 CORRESPONDENCE OTHERS 03-06-2015.pdf	2015-06-03
5	1792-che-2006-form 1.pdf	2011-09-03
5	1792-CHE-2006 FORM-1 03-06-2015.pdf	2015-06-03
6	1792-che-2006-drawings.pdf	2011-09-03
6	1792-CHE-2006 FORM-13 03-06-2015.pdf	2015-06-03
7	1792-che-2006-description(complete).pdf	2011-09-03
7	1792-che-2006-abstract.pdf	2011-09-03
8	1792-che-2006-correspondnece-others.pdf	2011-09-03
8	1792-che-2006-claims.pdf	2011-09-03
9	1792-che-2006-correspondnece-others.pdf	2011-09-03
9	1792-che-2006-claims.pdf	2011-09-03
10	1792-che-2006-abstract.pdf	2011-09-03
10	1792-che-2006-description(complete).pdf	2011-09-03
11	1792-che-2006-drawings.pdf	2011-09-03
11	1792-CHE-2006 FORM-13 03-06-2015.pdf	2015-06-03
12	1792-che-2006-form 1.pdf	2011-09-03
12	1792-CHE-2006 FORM-1 03-06-2015.pdf	2015-06-03
13	1792-che-2006-form 3.pdf	2011-09-03
13	1792-CHE-2006 CORRESPONDENCE OTHERS 03-06-2015.pdf	2015-06-03
14	1792-che-2006-form 5.pdf	2011-09-03
14	1792-CHE-2006 AMENDED PAGES OF SPECIFICATION 03-06-2015.pdf	2015-06-03
15	1792-CHE-2006-FER.pdf	2017-05-05
15	1792-CHE-2006 FORM-13 28-10-2009.pdf	2009-10-28
16	1792-CHE-2006-AbandonedLetter.pdf	2018-01-10
16	1792-CHE-2006 FORM-18 06-10-2009.pdf	2009-10-06

Search Strategy

1	searchstrategy_04-01-2017.pdf