Method And System For Waterfall Segmented Clustering For Classifying

< Back

Method And System For Waterfall Segmented Clustering For Classifying Mislabeled Data

Abstract: Performance of state of art classifiers deteriorates when the data has missing labels or mislabeled information. Embodiments provide a method and system for a Waterfall Segmented Clustering (WSC) technique for classifying mislabeled data, which is an automated clustering approach to identify patterns and classes in a labeled dataset that may include incorrectly labeled data items along with the mislabeled data. Clustering metrics such a Completeness Score is used along with a Clustering Evaluation Matrix with configurable hyperparameters that enable elimination of the mislabeled data by invoking a pruning criterion on clustering. The WSC technique disclosed herein provides a stable and consistent class identification of mislabeled data and is data type agnostic. Further, the WSC techniques generates a water segmented classifier, which is self-trained from and can classify any new data item to an prior identified class, identify outliers and/or generate a new class.

Get Free WhatsApp Updates!
Notices, Deadlines & Correspondence

Patent Information

Application #

Filing Date

27 September 2021

Publication Number

13/2023

Publication Type

INA

Invention Field

COMPUTER SCIENCE

Status

kcopatents@khaitanco.com

Parent Application

Patent Number

Legal Status

Grant Date

2024-03-08

Renewal Date

Applicants

Tata Consultancy Services Limited

Nirmal Building, 9th Floor, Nariman Point Mumbai Maharashtra India 400021

Inventors

1. GUPTA, Ashit

Tata Consultancy Services Limited Tata Research Development & Design Centre, 54-B, Hadapsar Industrial Estate, Hadapsar, Pune Maharashtra India 411013

2. RUNKANA, Venkataramana

Tata Consultancy Services Limited Tata Research Development & Design Centre, 54-B, Hadapsar Industrial Estate, Hadapsar, Pune Maharashtra India 411013

3. MUKHERJEE, Tathagata

Tata Consultancy Services Limited Tata Research Development & Design Centre, 54-B, Hadapsar Industrial Estate, Hadapsar, Pune Maharashtra India 411013

4. DEODHAR, Anirudh

Tata Consultancy Services Limited Tata Research Development & Design Centre, 54-B, Hadapsar Industrial Estate, Hadapsar, Pune Maharashtra India 411013

Specification

FORM 2
THE PATENTS ACT, 1970
(39 of 1970)
&
THE PATENT RULES, 2003
COMPLETE SPECIFICATION (See Section 10 and Rule 13)
Title of invention:
METHOD AND SYSTEM FOR WATERFALL SEGMENTED CLUSTERING FOR CLASSIFYING MISLABELED DATA
Applicant
Tata Consultancy Services Limited A company Incorporated in India under the Companies Act, 1956
Having address:
Nirmal Building, 9th floor,
Nariman point, Mumbai 400021,
Maharashtra, India
Preamble to the description
The following specification particularly describes the invention and the manner in which it is to be performed.

TECHNICAL FIELD [001] The embodiments herein generally relate to data classifiers and, more particularly, to a method and system for Waterfall Segmented Clustering (WSC) for classifying mislabeled data.
BACKGROUND
[002] Data mining and pattern recognition have enabled businesses to be proactive and take smart decisions based on data-derived knowledge over the past two decades. The most common and perhaps the most challenging type of application is classification, that can range from binary to multiclass classification. Supervised classification techniques work well on data rich in quality and quantity, but the performance declines with drop in richness of data. The issue could be due to lack of sufficient data, partially labeled data or even mislabeling of data. Unsupervised clustering may help solve some of the issues and assist in identifying patterns from the datasets. Although they are effective in checking for coherence in the data, clustering models are not indicative in nature and lack specific outcome measures. Identifying the correct set of attributes for effective clustering is sometimes a challenge. Semi-supervised classification (SSC) has emerged as a viable alternative for learning using partially labeled data. SSC techniques exploit the presence of unlabeled data to enhance the accuracy of initial classifier built using labeled data. Some of the SSC approaches include self-training, co-training, transductive support vector machines, generative models, and graph based methods. Self-training initially trains a supervised classifier using only the labeled data and then identifies a label for unlabeled data using the classifier. The high confidence labeled data is used again to re-train the classifier. This iterative process yields a better classifier model than the initial one.
[003] However, the accuracy of final classifier often deteriorates when the data space occupied by the labeled data is not representative of the entire data space. Semi-supervised clustering has also been proposed where the clustering technique was provided with observation linking constraints to force specific observations to

fall under one cluster or separate out into different clusters. However, predefining the constraints may not always be possible. Utilizing clustering outcomes to improve the performance of semi-supervised classification has also been proposed previously.
[004] Although handling of missing labels (unlabeled data) during classification and improving the classification accuracy using clustering has been effectively addressed by existing methods mentioned, there is relatively limited work on handling mislabeled data. Identifying a mislabeled data is a technically challenging task. When such mislabeled data is used for guiding the classification/clustering in the semi-supervised classification techniques, may hamper the learning of the underlying data space. Secondly, despite recent advances in Automated Machine Learning, especially for supervised learning, studies on automation of semi-supervised classification are limited.
[005] Recently Neural Network (NN) based approaches have been used for appropriate classification of mislabeled data to improve quality of ground truth data or training data for building classifiers. However, NN based approaches inherently require large dataset. This limitation introduces a challenge while applying NN based techniques for mislabeled data identification or outlier detection when the available datasets are limited. Further, NN based approaches consume higher computational resources. Moreover, existing approaches focus on only a specific type of data such as image data. Thus, state of art techniques handling mislabeled data perform well for only a specific type of data, limiting the scope of the application areas. Furthermore, the existing NN based approaches cannot handle new classes on the fly and require retraining for handling new data items belonging to a new or unknown class.
SUMMARY [006] Embodiments of the present disclosure present technological improvements as solutions to one or more of the above-mentioned technical problems recognized by the inventors in conventional systems.

[007] For example, in one embodiment, a method for Waterfall Segmented Clustering (WSC) for classifying mislabeled data is provided. The method includes receiving a labeled dataset, wherein each data item in the labeled data set has a correct label or an incorrect label among a plurality of labels and preprocessing the labeled data set to generate a preprocessed labeled data set.
[008] Further includes applying, iteratively until a stopping criteria is satisfied, a waterfall segmented clustering (WSC) technique on the preprocessed labeled dataset , via the one or more hardware processors, to obtain a multi-level parent-child cluster tree structure. The WSC technique comprising: a)creating a plurality of unique sets of features representing a plurality of combinations of features associated with the preprocessed labeled data set using a feature selection technique; b) performing an unsupervised clustering on the preprocessed labeled data set based on each of the plurality of sets of features to generate a plurality of clustering results wherein number of clusters in each of the plurality of clustering results is determined by an optimum cluster criteria, and wherein each cluster among each of the plurality of clustering results comprises data items associated with one or more labels from among the plurality of associated labels; c) computing a Completeness Score (CS) for each of the plurality of clustering results; d) identifying a complete clustering result from among the plurality of clustering results having a maximum CS (CSmax); and e) comparing the CSmax of the complete clustering result with a completeness threshold (λ CS) to split, further in each iteration, one or more parent clusters of the complete clustering result into child clusters for generating the multi-level parent-child cluster tree. Further, i) clustering of the child clusters within the complete clustering result is terminated, if the CSmax is less than a completeness threshold (λ CS,), indicating further separation of data items is not meaningful, and assigning all the data items associated with the complete clustering result to a new class, and (ii) clustering of the child clusters within the complete clustering result is further evaluated based on a Cluster Evaluation Matrix (CEM), if the CSmax is higher than the completeness score threshold (λ CS). wherein evaluation of the complete cluster based on the CEM comprises: 1) generating the CEM for the complete clustering result; 2) comparing

each element of the CEM with an outlier threshold (λOL), where if any element of the CEM is less than the outlier threshold λOL, then the data items associated with corresponding elements of the CEM are termed as outliers indicating presence of mislabeled data and are removed from subsequent processing; 3) comparing each element of the CEM with a CEM threshold (λCEM), and assigning a child cluster among the complete clustering result as another new class, if at least one element of the CEM associated with the child cluster is greater than the CEM threshold (λCEM); 4) comparing each element of the CEM with a CEM threshold (λCEM), and assigning a child cluster among the complete clustering result as a separable child cluster, if none of the elements of the CEM in the child cluster is greater than the CEM threshold (λCEM); 5) identifying a plurality of separable child clusters generated based on the CEM threshold (λCEM); and 6) separating, the data items associated with the plurality of separable child clusters and repeating the WSC technique until the stopping criterial is satisfied for generating the multi-level parent-child cluster tree structure comprising plurality of separable child clusters, wherein the stopping criteria comprises terminating the WSC technique if all the data items associated with the plurality of separable child clusters are assigned to at least one class among a plurality of classes or removed as the outliers, wherein one or more child clusters are assigned as classes based on the WSC technique and wherein information associated with each child cluster of the WSC technique and corresponding data items is stored for future reference.
[009] Furthermore, the method includes generating a waterfall segmented classifier for identifying a class of a new data item, based on the multi-level parent-child tree structure obtained during the WSC technique.
[0010] In another aspect, a system for Waterfall Segmented Clustering (WSC) for classifying mislabeled data is provided. The system comprises a memory storing instructions; one or more Input/Output (I/O) interfaces; and one or more hardware processors coupled to the memory via the one or more I/O interfaces, wherein the one or more hardware processors are configured by the instructions to receive a labeled dataset, wherein each data item in the labeled data set has a correct

label or an incorrect label among a plurality of labels and preprocessing the labeled data set to generate a preprocessed labeled data set.
[0011] Further the system iteratively applies until a stopping criteria is satisfied, a waterfall segmented clustering (WSC) technique on the preprocessed labeled dataset , via the one or more hardware processors, to obtain a multi-level parent-child cluster tree structure. The WSC technique comprising: a)creating a plurality of unique sets of features representing a plurality of combinations of features associated with the preprocessed labeled data set using a feature selection technique; b) performing an unsupervised clustering on the preprocessed labeled data set based on each of the plurality of sets of features to generate a plurality of clustering results wherein number of clusters in each of the plurality of clustering results is determined by an optimum cluster criteria, and wherein each cluster among each of the plurality of clustering results comprises data items associated with one or more labels from among the plurality of associated labels; c) computing a Completeness Score (CS) for each of the plurality of clustering results; d) identifying a complete clustering result from among the plurality of clustering results having a maximum CS (CSmax); and e) comparing the CSmax of the complete clustering result with a completeness threshold (λ CS) to split, further in each iteration, one or more parent clusters of the complete clustering result into child clusters for generating the multi-level parent-child cluster tree. Further, i) clustering of the child clusters within the complete clustering result is terminated, if the CSmax is less than a completeness threshold (λ CS,), indicating further separation of data items is not meaningful, and assigning all the data items associated with the complete clustering result to a new class, and (ii) clustering of the child clusters within the complete clustering result is further evaluated based on a Cluster Evaluation Matrix (CEM), if the CSmax is higher than the completeness score threshold (λ CS). The wherein evaluation of the complete cluster based on the CEM comprises: 1) generating the CEM for the complete clustering result; 2) comparing each element of the CEM with an outlier threshold (λOL), where if any element of the CEM is less than the outlier threshold λ OL, then the data items associated with corresponding elements of the CEM are termed as outliers indicating presence of

mislabeled data and are removed from subsequent processing; 3) comparing each element of the CEM with a CEM threshold (λCEM), and assigning a child cluster among the complete clustering result as another new class, if at least one element of the CEM associated with the child cluster is greater than the CEM threshold (λCEM); 4) comparing each element of the CEM with a CEM threshold (λCEM), and assigning a child cluster among the complete clustering result as a separable child cluster, if none of the elements of the CEM in the child cluster is greater than the CEM threshold (λCEM); 5) identifying a plurality of separable child clusters generated based on the CEM threshold (λCEM); and 6) separating, the data items associated with the plurality of separable child clusters and repeating the WSC technique until the stopping criterial is satisfied for generating the multi-level parent-child cluster tree structure comprising plurality of separable child clusters, wherein the stopping criteria comprises terminating the WSC technique if all the data items associated with the plurality of separable child clusters are assigned to at least one class among a plurality of classes or removed as the outliers, wherein one or more child clusters are assigned as classes based on the WSC technique and wherein information associated with each child cluster of the WSC technique and corresponding data items is stored for future reference.
[0012] Furthermore, the system generates a waterfall segmented classifier for identifying a class of a new data item, based on the multi-level parent-child tree structure obtained during the WSC technique.
[0013] In yet another aspect, there are provided one or more non-transitory machine-readable information storage mediums comprising one or more instructions, which when executed by one or more hardware processors causes a method for Waterfall Segmented Clustering (WSC) for classifying mislabeled data
[0014] The method includes receiving a labeled dataset, wherein each data item in the labeled data set has a correct label or an incorrect label among a plurality of labels and preprocessing the labeled data set to generate a preprocessed labeled data set.
[0015] Further includes applying, iteratively until a stopping criteria is satisfied, a waterfall segmented clustering (WSC) technique on the preprocessed

labeled dataset , via the one or more hardware processors, to obtain a multi-level parent-child cluster tree structure. The WSC technique comprising: a)creating a plurality of unique sets of features representing a plurality of combinations of features associated with the preprocessed labeled data set using a feature selection technique; b) performing an unsupervised clustering on the preprocessed labeled data set based on each of the plurality of sets of features to generate a plurality of clustering results wherein number of clusters in each of the plurality of clustering results is determined by an optimum cluster criteria, and wherein each cluster among each of the plurality of clustering results comprises data items associated with one or more labels from among the plurality of associated labels; c) computing a Completeness Score (CS) for each of the plurality of clustering results; d) identifying a complete clustering result from among the plurality of clustering results having a maximum CS (CSmax); and e) comparing the CSmax of the complete clustering result with a completeness threshold (λ CS) to split, further in each iteration, one or more parent clusters of the complete clustering result into child clusters for generating the multi-level parent-child cluster tree. Further, i) clustering of the child clusters within the complete clustering result is terminated, if the CSmax is less than a completeness threshold (λ CS,), indicating further separation of data items is not meaningful, and assigning all the data items associated with the complete clustering result to a new class, and (ii) clustering of the child clusters within the complete clustering result is further evaluated based on a Cluster Evaluation Matrix (CEM), if the CSmax is higher than the completeness score threshold (λ CS). The wherein evaluation of the complete cluster based on the CEM comprises: 1) generating the CEM for the complete clustering result; 2) comparing each element of the CEM with an outlier threshold (λOL), where if any element of the CEM is less than the outlier threshold λ OL, then the data items associated with corresponding elements of the CEM are termed as outliers indicating presence of mislabeled data and are removed from subsequent processing; 3) comparing each element of the CEM with a CEM threshold (λCEM), and assigning a child cluster among the complete clustering result as another new class, if at least one element of the CEM associated with the child cluster is greater than the CEM threshold

(λCEM); 4) comparing each element of the CEM with a CEM threshold (λCEM), and assigning a child cluster among the complete clustering result as a separable child cluster, if none of the elements of the CEM in the child cluster is greater than the CEM threshold (λCEM); 5) identifying a plurality of separable child clusters generated based on the CEM threshold (λCEM); and 6) separating, the data items associated with the plurality of separable child clusters and repeating the WSC technique until the stopping criterial is satisfied for generating the multi-level parent-child cluster tree structure comprising plurality of separable child clusters, wherein the stopping criteria comprises terminating the WSC technique if all the data items associated with the plurality of separable child clusters are assigned to at least one class among a plurality of classes or removed as the outliers, wherein one or more child clusters are assigned as classes based on the WSC technique and wherein information associated with each child cluster of the WSC technique and corresponding data items is stored for future reference.
[0016] Furthermore, the method includes generating a waterfall segmented classifier for identifying a class of a new data item, based on the multi-level parent-child tree structure obtained during the WSC technique.
[0017] It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention, as claimed.
BRIEF DESCRIPTION OF THE DRAWINGS
[0018] The accompanying drawings, which are incorporated in and constitute a part of this disclosure, illustrate exemplary embodiments and, together with the description, serve to explain the disclosed principles:
[0019] FIG. 1 is a functional block diagram of a system, for Waterfall Segmented Clustering (WSC) technique for classifying mislabeled data, in accordance with some embodiments of the present disclosure.
[0020] FIG. 2 is a flow diagram illustrating a method for the WSC technique for classifying mislabeled data, using the system of FIG. 1, in accordance with some embodiments of the present disclosure.

[0021] FIGS. 3A through 3B (collectively referred as FIG. 3) is a process for classifying new data item using a self-trained waterfall segmented classifier generated during WSC technique based classification of the mislabeled data, in accordance with some embodiments of the present disclosure.
[0022] FIGS. 4A, 4B and 4C is an example depicting the WSC technique applied on an sample labeled dataset comprising correctly labeled data item or an incorrectly labeled data item, in accordance with some embodiments of the present disclosure.
[0023] It should be appreciated by those skilled in the art that any block diagrams herein represent conceptual views of illustrative systems and devices embodying the principles of the present subject matter. Similarly, it will be appreciated that any flow charts, flow diagrams, and the like represent various processes which may be substantially represented in computer readable medium and so executed by a computer or processor, whether or not such computer or processor is explicitly shown.
DETAILED DESCRIPTION OF EMBODIMENTS [0024] Exemplary embodiments are described with reference to the accompanying drawings. In the figures, the left-most digit(s) of a reference number identifies the figure in which the reference number first appears. Wherever convenient, the same reference numbers are used throughout the drawings to refer to the same or like parts. While examples and features of disclosed principles are described herein, modifications, adaptations, and other implementations are possible without departing from the scope of the disclosed embodiments.
[0025] Embodiments of the present disclosure provide a method and system for a Waterfall Segmented Clustering (WSC) technique for classifying mislabeled data. The WSC technique disclosed herein is an automated clustering approach to identify patterns and classes in a labeled dataset that may include incorrectly labeled data items along with correctly labeled data items, interchangeably referred to as mislabeled data. Clustering metrics such a Completeness Score (CS) is used along with a Clustering Evaluation Matrix (CEM) with configurable hyperparameters that

enables elimination of the mislabeled data by invoking a pruning criterion on clustering. The WSC technique disclosed herein provides a stable and consistent class identification of mislabeled data and is data type agnostic and can handle image data, sensor data, NLP data and the like. Unlike Neural Network (NN) based approaches, the WSC performs well irrespective of the size of data, large or small. Further, the WSC techniques generates a waterfall segmented classifier, interchangeably referred to as WSC classifier, which is self-trained from and can classify any new data item to an prior identified class, identify outliers and/or generate a new class.
[0026] Referring now to the drawings, and more particularly to FIGS. 1 through 4C, where similar reference characters denote corresponding features consistently throughout the figures, there are shown preferred embodiments and these embodiments are described in the context of the following exemplary system and/or method.
[0027] FIG. 1 is a functional block diagram of a system 100, for the Waterfall Segmented Clustering (WSC) technique for classifying the mislabeled data, in accordance with some embodiments of the present disclosure.
[0028] In an embodiment, the system 100 includes a processor(s) 104, communication interface device(s), alternatively referred as input/output (I/O) interface(s) 106, and one or more data storage devices or a memory 102 operatively coupled to the processor(s) 104. The system 100 with one or more hardware processors is configured to execute functions of one or more functional blocks of the system 100.
[0029] Referring to the components of system 100, in an embodiment, the processor(s) 104, can be one or more hardware processors 104. In an embodiment, the one or more hardware processors 104 can be implemented as one or more microprocessors, microcomputers, microcontrollers, digital signal processors, central processing units, state machines, logic circuitries, and/or any devices that manipulate signals based on operational instructions. Among other capabilities, the one or more hardware processors 104 are configured to fetch and execute computer-readable instructions stored in the memory 102. In an embodiment, the system 100

can be implemented in a variety of computing systems including laptop computers, notebooks, hand-held devices such as mobile phones, workstations, mainframe computers, servers, and the like.
[0030] The I/O interface(s) 106 can include a variety of software and hardware interfaces, for example, a web interface, a graphical user interface to display the generated target images and the like and can facilitate multiple communications within a wide variety of networks N/W and protocol types, including wired networks, for example, LAN, cable, etc., and wireless networks, such as WLAN, cellular and the like. In an embodiment, the I/O interface (s) 106 can include one or more ports for connecting to a number of external devices or to another server or devices.
[0031] The memory 102 may include any computer-readable medium known in the art including, for example, volatile memory, such as static random access memory (SRAM) and dynamic random access memory (DRAM), and/or non-volatile memory, such as read only memory (ROM), erasable programmable ROM, flash memories, hard disks, optical disks, and magnetic tapes.
[0032] Further, the memory 102 includes a database 108 to store data related to WSC technique and the water segmented classifier. Further, the memory 102 includes a WSC module not shown) executing via the one or more hardware processors the WSC technique, the waterfall segmented classifier (not shown), and the like. Further, the memory 102 may comprise information pertaining to input(s)/output(s) of each step performed by the processor(s) 104 of the system100 and methods of the present disclosure. In an embodiment, the database 108 may be external (not shown) to the system 100 and coupled to the system via the I/O interface 106. Functions of the components of the system 100 are explained in conjunction with flow diagram of FIG. 2 and FIG. 3 and example depicted in FIG. 4A through and 4C.
[0033] FIG. 2 is a flow diagram illustrating a method 200 for the WSC technique for classifying mislabeled data, using the system of FIG. 1, in accordance with some embodiments of the present disclosure.

[0034] In an embodiment, the system 100 comprises one or more data storage devices or the memory 102 operatively coupled to the processor(s) 104 and is configured to store instructions for execution of steps of the method 200 by the processor(s) or one or more hardware processors 104. The steps of the method 200 of the present disclosure will now be explained with reference to the components or blocks of the system 100 as depicted in FIG. 1 and the steps of flow diagram as depicted in FIG. 2 and FIG. 3. Although process steps, method steps, techniques or the like may be described in a sequential order, such processes, methods, and techniques may be configured to work in alternate orders. In other words, any sequence or order of steps that may be described does not necessarily indicate a requirement that the steps to be performed in that order. The steps of processes described herein may be performed in any order practical. Further, some steps may be performed simultaneously.
[0035] Referring to the steps of the method 200, at step 202 of the method 200, the one or more hardware processors 104 receives a labeled dataset, wherein each data item (corresponding to a recording or an observation) in the labeled data set may have a correct label or an incorrect label (mis-labeled), wherein the labeled data set may have data items tagged with a label from amongst a plurality of labels.
[0036] At step 204 of the method 200, the one or more hardware processors 104 preprocess the labeled data set to generate a preprocessed labeled data set. The preprocessing can include cleaning, transforming (scaling the data) and the like, and is well understood process in the domain.
[0037] At step 206 of the method 200, the one or more hardware processors 104 apply a waterfall segmented clustering (WSC) technique on the preprocessed labeled dataset to obtain a multi-level parent-child tree structure based on a waterfall segmented clustering approach. The WSC technique iterates until a stopping criteria is reached. The clustering process for the labeled dataset (dataset) with "n" features can be considered as observing the data from a frame of reference with "n" coordinates, aggregating the points close to each other and separating the points that are away from each other. However, true separation and aggregation may not be possible by looking at the dataset from only one frame of reference in a single go.

Thus, the method 200 disclosed herein applies a waterfall segmented clustering approach that observes and separates data items in the dataset at multiple levels while maximizing the clustering effectiveness at every level by changing the frame of reference using feature combinations and number of clusters.
[0038] The WSC technique is a semi-supervised clustering approach that builds or trains the waterfall segmented classifier for mislabeled data classification. The WSC technique is automated by measuring the clustering effectiveness at every level using available knowledge through labels and identifying the stopping criterion via the clustering evaluation matrix (CEM)..
a) Create plurality of unique sets of features representing a plurality of
combinations of features associated with the preprocessed labeled data set
feature selection technique (206a). Entropy based feature selection (EFS) or
the like can be used to select a set of features for each unique set.
b) Perform unsupervised clustering on the preprocessed labeled data set based
on each of the plurality of sets of features to generate a plurality of clustering
results wherein number of clusters in each of the plurality of clustering
results is determined by an optimum cluster criteria. Each cluster among each
of the plurality of clustering results comprises data items associated with one
or more labels from among the plurality of associated labels (206b). In an
embodiment, clustering utilizes a distance-based clustering technique such
as k-means and k-medoids or the like. However, any other clustering
approaches such as density based clustering and the like can be used.
Although the data items have labels, they are not considered at this point.
Thus, multiple sets of combinations of features or multi-dimensional frames
of reference are created, with number of combinations = S. A Silhouette
score or a similar score can be used as the optimum cluster criteria for
identifying the number of clusters in a given combination of features (from
the set S). In one implementation of the method 200, maximum number of
clusters are limited by the number of labels, as this limitation provides best
result for the specific dataset under consideration.

c) Compute the Completeness Score (CS) for each of the plurality of clustering
results (206c). The semi-supervised WSC is initiated by computing the CS
for all sets (S) using the available labels. A higher CS (CSmax) indicates that
maximum number of observations under the same label are assigned to a
single cluster. It can be noted that the WSC technique assumes there is a
scope for two or more labels forming a single class (or a cluster) . The
mathematical formulation of the Completeness score(CS) is given below.

Where, K is total number of clusters, C is total number of labels, H(C/K) represents conditional entropy, nck is the absolute value of an element for a given cluster number and the label number 82 and N is the sum of all elements of the matrix.
d) Identify a complete clustering result from among the plurality of clustering results having the maximum CS (CSmax) (206d). Thus, the clustering result (among S combinations) that has the highest completeness core is selected as the best result (complete clustering result).
e) Compare the CSmax of the complete clustering result with a completeness threshold (λ CS) (206e) to split, further in each iteration, one or more parent clusters of the complete clustering result into child clusters for generating the multi-level parent-child cluster tree. The λ CS is one among the plurality of configurable hyper parameters provided by the WSC technique. The value of λ CS is defined based on the type of dataset to be classified and can be defined by a subject matter expert. Based on the comparison, the WSC technique iterates to identify best clusters and tags corresponding class. The evaluation of a clustering results having the maximum CS based on the completeness threshold (λ CS) is as provided below:

(i) Terminate clustering of the child clusters within the complete
clustering result, if the CSmax is less than a completeness threshold (λCS,), indicating further separation of data items is not meaningful, and assigning all the data items associated with the complete clustering result to a new class.
(ii) Further evaluate the clustering of the child clusters within the
complete clustering result based on a Cluster Evaluation Matrix (CEM), if the CSmax is higher than the completeness score threshold (λCS). However, the decision on whether further clustering is required or not, cannot be taken based on the Completeness Score (CS) alone. Therefore, the method 200 discloses a matrix called Cluster Evaluation Matrix (CEM) that helps in deciding whether a particular cluster should be separated further or not. The Cluster Evaluation Matrix (CEM) enables determination of a cluster as a class and separation of outliers of a class (probable mislabeled data item or observations). It serves as a stopping criterion for pruning of the cascading architecture. Each element A of the CEM is computed using the equation 4 below. The dimensions of the generated CEM are [K,C], where K is total number of clusters and C is total number of labels. Often the observations belonging to a specific label get separated into multiple clusters during clustering, due to outliers and mislabeled data under that label. The relative proportion of how the observations (data items in the labeled dataset) are separated across different clusters and the number of observations separated in a given cluster are two important aspects for deciding on continuing the clustering process.

where a first logarithmic term inside a max function in the equation accounts for the number of data items associated with a label in each

cluster, a second exponential term in the max function accounts for the relative proportion of data items associated with that label across different clusters, and a term outside the max function provides a measure of the dominance of a particular label in the cluster. Often the observations belonging to a specific label get separated into multiple clusters during clustering, due to outliers and mislabeled data under that label. The relative proportion of how the observations are separated across different clusters and the absolute number of observations separated in a given cluster are two important aspects for deciding on continuing the clustering process. The two configurable hyperparameter thresholds a CEM threshold (λCEM) and an outlier threshold (λOL), are designed for decision making, which can be finetuned based on the nature of dataset, prior knowledge about labels and the desired number of classes. If any element of the CEM is less than the outlier threshold (λOL), then the observations corresponding to that cluster and label are
termed as outliers and are removed from subsequent clustering. These are probably the mislabeled observations associated with each of the labels. Then for each cluster, all the elements of CEM are checked against the CEM threshold (λCEM) . If any element of a chosen cluster is greater than the CEM threshold (λCEM) , that cluster is designated as a class. The calculation of the CEM is done in such a way that it remains true irrespective of values of other elements in that cluster. If no value in a given cluster of CEM is greater than the CEM threshold (λCEM) , the cluster is assigned for further clustering. The data items or observations associated with that cluster are then taken forward in a cascaded manner for re-clustering using the same methodology. The cascading of clusters and their re-clustering creates a tree like structure (multi-level parent -child cluster tree) until all branches of the tree are halted based on CS and CEM. The

bottommost or leaf nodes of the tree are the identified classes. The evaluation of the complete cluster based on the CEM is depicted in the steps below:
1) Generate the CEM for the complete clustering result.
2) Compare each element of the CEM with the outlier threshold (λOL), where if any element of the CEM is less than the outlier threshold λ OL, then the data items associated with corresponding elements of the CEM are termed as outliers indicating presence of mislabeled data and are removed from subsequent processing.
3) Compare each element of the CEM with the CEM threshold (λCEM) and assigning a child cluster among the complete clustering result as another new class, if at least one element of the CEM associated with the child cluster is greater than the CEM threshold (λCEM).
4) Compare each element of the CEM with a CEM threshold (λCEM) and assigning a child cluster among the complete clustering result as a separable child cluster if none of the elements of the CEM in the child cluster is greater than the CEM threshold (λCEM).
5) Identify a plurality of separable child clusters generated based on the CEM threshold (λCEM).
6) Separate, the data items associated with the plurality of separable child clusters and repeating the WSC technique, until the stopping criteria is satisfied, for generating the multi-level parent-child cluster tree structure comprising plurality of separable child clusters. The one or more child clusters are assigned as classes based on the WSC technique, wherein the stopping criteria comprises terminate the WSC technique if all the data items associated with the plurality of separable child clusters are assigned to at least one class among the plurality of classes or removed as the outliers. Further, information

associated with each child cluster of the WSC technique and corresponding data items is stored.
[0039] The information associated with each child cluster of the WSC technique and corresponding data items comprises features associated with each cluster at every level, measures of central tendency for each cluster, comprising a median and a mean, basic descriptive statistics comprising standard deviation of each cluster, data point labels associated with each cluster and the clustering results along with corresponding metrics comprising the CS and the CEM, at every level.
[0040] At step 208 of the method 200, the one or more hardware processors 104 generate a waterfall segmented classifier, based on the parent-child tree structure obtained during the WSC technique. This generated waterfall segmented classifier is then utilized for identifying a class of a new data item that has to added to the labeled data set and classifies to the most appropriate class. The class identification steps for a new data entry are depicted in FIG. 3 and include:
a) Receive (302) the new data item.
b) Preprocess (304) the new data item using steps similar to the ones used prior to applying the WSC technique to the labeled dataset. This generates a preprocessed new data item.
c) Classify the (306) preprocessed new data item into one of the plurality of classes. The class identification steps include:
1) Identifying (306a) the closest cluster among the plurality of child clusters at a lowest level of the multi-level parent-child tree structure obtained by the WSC. The closest cluster is identified successively at every level of multi-level parent-child tree structure using a distance metric and a measure of central tendency of the respective cluster based on the feature set used from among the plurality of feature sets at respective level of the WSC technique. A lowest level child cluster is assigned as a class. This is done using any distance-based metric such as the Euclidean distance, but specifically calculated based on the features used at that level of clustering only. A central measure

such as mean or median of each cluster is used to calculate this distance. The choice of distance and central measure can vary depending on the distribution of data and the clustering technique used during waterfall segmented clustering. This process is repeated at each level until the nearest class is identified for the new observation or data item. If the new observation is found to be dissimilar with respect to the existing classes or clusters at any level, a new label is assigned to it temporarily. If sufficient number of such new labels are obtained at any point during the classification exercise, a new class can be formulated by using the waterfall segmented classifier by combining the initial training set and the new labeled test set used during classification. This type of classifier can be easily trained even with partially mislabeled data because it inherits the benefits of WSC technique.
2) Assigning (306b) the new data item to the closest cluster at the lowest level of the multi-level parent-child tree structure obtained by the WSC technique.
3) Computing (306c) a Probability of Closeness (PC) measure of the new data item to the assigned closest cluster indicating whether the new data item is within an acceptable limit of the cluster or is an outlier to the cluster. In an example implementation, the PC is computed in terms of Inter-Quartile Range (IQR) and the new data item is considered as the outlier if the distance metric is greater than three times the IQR (306e). Any other closeness measure may be used in an alternative implementation.
4) Assigning (306d) the new data item to the class associated with the assigned closest cluster if PC is found to be within a PC threshold. The PC threshold can be configured based on the dataset and the expected clustering. The new data item is added

to the definition of the assigned class for future use. The addition of classified new data items to the class definition enables automatic update of the water segmented classifier making it a robust classifier.
5) Assigning (306e) the new data item as an outlier of the class associated with the assigned cluster if PC is beyond the PC threshold.
6) Creating (306f) a new class, if number of the data items assigned as outliers of a class exceed a set threshold, wherein the information related to the data items is updated into the waterfall segmented classifier, enabling it to update itself using new data items. The information could be a new class name, data items associated, position in the parent- child tree (new child created) and so on.
[0041] FIGS. 4A, 4B and 4C is an example depicting the WSC technique applied on an sample labeled dataset (actual data), in accordance with some embodiments of the present disclosure. As depicted a label data set is received comprising multiple data items per label (number of observations per label). At first level the labeled data is clustered based on various combinations of features to generate the plurality of clustering results. A clustering result having the maximum CS is selected as shown in FIG. 4A and 4B. which has a feature combination set of [x1,x2,x3 x4 Further the CEM is computed for selected clustering result, which is the complete clustering result. The CSmax of the complete clustering result at cluster level 1 is compared with a completeness threshold (λ CS) to split, further in each iteration, one or more parent clusters of the complete clustering result into child clusters for generating the multi-level parent-child cluster tree. Since CEM elements associated with cluster C3 satisfy CEM criterion, this cluster is identified as class, while C1 and C2 are considered for further creation of child clusters, as the CEM elements associated with those do not satisfy the criteria. Similarly, the unique feature set based clustering is repeated and the child cluster level and a clustering result at level 2 is selected based on the CSmax at level 2. Further, the

CEM is computed at level 2 to ensure if further clustering is possible based on (λCEM) and if any outliers (mislabeled data ) is present based on (λOL). FIG. 4C depicts the multi-level parent-child cluster tree generated for the received labeled dataset with appropriate class tagged in.
[0042] RESULTS AND DISCUSSION: The effectiveness of WSC technique, and the generated waterfall segmented classifiers, also referred hereinafter as classification model, on four datasets available in the public domain as shown in table 1.
TABLE 1

Datasets Data items (# observations) # labels # features mislabel%
Ecoli 337 7 7 10
Wine 179 3 13 10
Eucalyptus 680 13 12 15
[0043] Each dataset was divided into train data (for WSC technique 85-90%) and test data (for the classifier 10-15%). Multiple variations of the datasets were created by random mislabeling of training part of each dataset (10-30% mislabeling). The test data for each dataset was free from mislabeling and was kept constant across variations of 141 datasets. The WSC technique was done on the training data for each dataset and the corresponding classifier was run on the respective test dataset. The Support Vector Machine (SVM) was selected as a comparative technique. The SVM-based classifier was also trained on both perfectly labeled and mislabeled training data sets. A comparison of the SVM classifier against the waterfall segmented classifier on the selected datasets is performed.
[0044] Waterfall Segmented Clustering (WSC) technique, was run on all the perfectly labeled and mislabeled variations of all selected datasets such as publicly available datasets including Ecoli, Wine and Eucalyptus. The K-medoids technique was used for Ecoli and wine datasets while the k-means technique was

used for the Eucalyptus dataset. Cascading cluster architectures were obtained for each of the perfectly labeled and mislabeled datasets of all types.
[0045] The waterfall segmented classifier is compared state of the art SVM classifiers. Table 1 presents the comparison of classification results for the waterfall segmented classifiers and the SVM classifier on all variations of datasets, in terms of percentage accuracy.
TABLE 2

Datasets SVM (% accuracy) WSC (test)

train validation test

(% accuracy)
Perfectly labeled ecoli 90.7 88.2 77.3 81.1
data
10% mislabeled ecoli 84.8 93.8 72.7 81.1
data
Perfectly labeled wine 99.7 100 100 100
data
15% mislabeled wine 86.5 81.2 100 100
data
Perfectly labeled 89.6 81.5 90.1 99.1
eucalyptus data
15% mislabeled 80.7 77.2 83.7 97.5
eucalyptus data
[0046] The quality of the SVM classifier deteriorated with addition of mislabeled information in the training data. Drop in accuracy was observed for all test datasets for mislabeled data when compared with the corresponding perfectly labeled data. In contrast, the WSC classifier tends to provide stable and consistent results with mislabeled information as well. This is because the WSC classifier filters out the mislabeled information through the CS and CEM criteria and the

waterfall segmented clustering architecture or the multi-level parent-child architecture of the WSC technique. The WSC classifier is relevant for many industrial use cases, for example, classification of coal 167 types in thermal power plants, where coal is burnt to produce power. Correct classification of the type of incoming coal in terms of its properties and composition can greatly assist in maximizing plant efficiency, reducing emissions of pollutants like NOx, and in improving the health of the plant. The number of coal labels is large and often, there is mislabeling in recorded coal types. The WSC classifier (waterfall segmented classifier) when tested on the coal dataset, was able to classify coals into appropriate classes. The number of classes generated can be controlled by tuning the WSC hyperparameters. The classifier also successfully removed the mislabeled coal observations during clustering and hence performed perfectly when tested as a classifier on new test dataset. The WSC classifier can be utilized for online clustering and real-time classification of coals or any other materials in industrial settings. Moreover, since the water segmented classifier is generic, it can be effectively used for any scenario in real world where classification is important, and the data is likely to be mislabeled.
[0047] The written description describes the subject matter herein to enable any person skilled in the art to make and use the embodiments. The scope of the subject matter embodiments is defined by the claims and may include other modifications that occur to those skilled in the art. Such other modifications are intended to be within the scope of the claims if they have similar elements that do not differ from the literal language of the claims or if they include equivalent elements with insubstantial differences from the literal language of the claims.
[0048] It is to be understood that the scope of the protection is extended to such a program and in addition to a computer-readable means having a message therein; such computer-readable storage means contain program-code means for implementation of one or more steps of the method, when the program runs on a server or mobile device or any suitable programmable device. The hardware device can be any kind of device which can be programmed including e.g. any kind of computer like a server or a personal computer, or the like, or any combination

thereof. The device may also include means which could be e.g. hardware means like e.g. an application-specific integrated circuit (ASIC), a field-programmable gate array (FPGA), or a combination of hardware and software means, e.g. an ASIC and an FPGA, or at least one microprocessor and at least one memory with software processing components located therein. Thus, the means can include both hardware means, and software means. The method embodiments described herein could be implemented in hardware and software. The device may also include software means. Alternatively, the embodiments may be implemented on different hardware devices, e.g. using a plurality of CPUs.
[0049] The embodiments herein can comprise hardware and software elements. The embodiments that are implemented in software include but are not limited to, firmware, resident software, microcode, etc. The functions performed by various components described herein may be implemented in other components or combinations of other components. For the purposes of this description, a computer-usable or computer readable medium can be any apparatus that can comprise, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.
[0050] The illustrated steps are set out to explain the exemplary embodiments shown, and it should be anticipated that ongoing technological development will change the manner in which particular functions are performed. These examples are presented herein for purposes of illustration, and not limitation. Further, the boundaries of the functional building blocks have been arbitrarily defined herein for the convenience of the description. Alternative boundaries can be defined so long as the specified functions and relationships thereof are appropriately performed. Alternatives (including equivalents, extensions, variations, deviations, etc., of those described herein) will be apparent to persons skilled in the relevant art(s) based on the teachings contained herein. Such alternatives fall within the scope of the disclosed embodiments. Also, the words “comprising,” “having,” “containing,” and “including,” and other similar forms are intended to be equivalent in meaning and be open ended in that an item or items following any one of these words is not meant to be an exhaustive listing of such

item or items, or meant to be limited to only the listed item or items. It must also be noted that as used herein and in the appended claims, the singular forms “a,” “an,” and “the” include plural references unless the context clearly dictates otherwise.
[0051] Furthermore, one or more computer-readable storage media may be utilized in implementing embodiments consistent with the present disclosure. A computer-readable storage medium refers to any type of physical memory on which information or data readable by a processor may be stored. Thus, a computer-readable storage medium may store instructions for execution by one or more processors, including instructions for causing the processor(s) to perform steps or stages consistent with the embodiments described herein. The term “computer-readable medium” should be understood to include tangible items and exclude carrier waves and transient signals, i.e., be non-transitory. Examples include random access memory (RAM), read-only memory (ROM), volatile memory, nonvolatile memory, hard drives, CD ROMs, DVDs, flash drives, disks, and any other known physical storage media.
[0052] It is intended that the disclosure and examples be considered as exemplary only, with a true scope of disclosed embodiments being indicated by the following claims.

We Claim:
1. A processor implemented method (200), the method comprising:
receiving, via one or more hardware processors, a labeled dataset, wherein each data item in the labeled data set has a correct label or an incorrect label among a plurality of labels (202);
preprocessing, via the one or more hardware processors, the labeled data set to generate a preprocessed labeled data set (204); and
iteratively applying, until a stopping criteria is satisfied, a waterfall segmented clustering (WSC) technique on the preprocessed labeled dataset , via the one or more hardware processors, to obtain a multi-level parent-child cluster tree structure (206), the WSC technique comprising:
a) creating a plurality of unique sets of features representing a
plurality of combinations of features associated with the
preprocessed labeled data set using a feature selection
technique;
b) performing an unsupervised clustering on the preprocessed labeled data set based on each of the plurality of sets of features to generate a plurality of clustering results wherein number of clusters in each of the plurality of clustering results is determined by an optimum cluster criteria, and wherein each cluster among each of the plurality of clustering results comprises data items associated with one or more labels from among the plurality of associated labels;
c) computing a Completeness Score (CS) for each of the plurality of clustering results;
d) identifying a complete clustering result from among the plurality of clustering results having a maximum CS (CSmax); and
e) comparing the CSmax of the complete clustering result with a completeness threshold (λ CS) to split, further in each iteration, one or more parent clusters of the complete clustering result

into child clusters for generating the multi-level parent-child
cluster tree, wherein
(i) clustering of the child clusters within the complete clustering result is terminated, if the CSmax is less than a completeness threshold (λ CS,), indicating further separation of data items is not meaningful, and assigning all the data items associated with the complete clustering result to a new class, and
(ii) clustering of the child clusters within the complete clustering result is further evaluated based on a Cluster Evaluation Matrix (CEM), if the CSmax is higher than the completeness score threshold (λ CS), wherein evaluation of the complete cluster based on the CEM comprises:
1) generating the CEM for the complete
clustering result;
2) comparing each element of the CEM with an outlier threshold (λOL), where if any element of the CEM is less than the outlier threshold λOL, then the data items associated with corresponding elements of the CEM are termed as outliers indicating presence of mislabeled data and are removed from subsequent processing;
3) comparing each element of the CEM with a CEM threshold (λCEM), and assigning a child cluster among the complete clustering result as another new class, if at least one element of the CEM associated with the child cluster is greater than the CEM threshold (λCEM);

4) comparing each element of the CEM with a CEM threshold (λCEM), and assigning a child cluster among the complete clustering result as a separable child cluster, if none of the elements of the CEM in the child cluster is greater than the CEM threshold (λCEM);
5) identifying a plurality of separable child clusters generated based on the CEM threshold (λCEM); and
6) separating, the data items associated with the plurality of separable child clusters and repeating the WSC technique until the stopping criterial is satisfied for generating the multi-level parent-child cluster tree structure comprising plurality of separable child clusters, wherein the stopping criteria comprises terminating the WSC technique if all the data items associated with the plurality of separable child clusters are assigned to at least one class among a plurality of classes or removed as the outliers, wherein one or more child clusters are assigned as classes based on the WSC technique and wherein information associated with each child cluster of the WSC technique and corresponding data items is stored for future reference; and
generating (208), via the one or more hardware processors, a waterfall segmented classifier for identifying a class of a new data item, based on the multi-level parent-child tree structure obtained during the WSC technique.
2. The method as claimed in claim 1, wherein each element (A) of the CEM is computed using an equation comprising:

where K is total number of clusters, C is total number of labels, nck is an absolute value of an element for a given cluster number and a label number, a first logarithmic term inside a max function in the equation accounts for the number of data items associated with a label in each cluster, a second exponential term in the max function accounts for the relative proportion of data items associated with that label across different clusters, and a term outside the max function provides a measure of the dominance of a particular label in the cluster.
3. The method as claimed in claim 1, wherein the utilizing the generated waterfall segmented classifier for identifying the class of the new data item comprises:
receiving (302) the new data item;
preprocessing (304) the new data item to generate a preprocessed new data item;
classifying the (306) preprocessed new data item into one of the plurality of classes, wherein the classification comprising:
identifying (306a) the closest cluster among the plurality of child clusters at a lowest level of the multi-level parent-child tree structure obtained by the WSC technique, wherein the closest cluster is identified successively at every level of the multi-level parent-child tree structure using a distance metric and a measure of central tendency of the respective cluster based on the feature set used from among the plurality of feature sets at respective level of the WSC technique, wherein a lowest level child cluster is assigned as a class;

assigning (306b) the new data item to the closest cluster at the lowest level of the multi-level parent-child tree structure obtained by the WSC technique;
computing (306c) a Probability of Closeness (PC) measure of the new data item to the assigned closest cluster indicating whether the new data item is within an acceptable limit of the cluster or is an outlier to the cluster;
assigning (306d) the new data item to the class associated with the assigned closest cluster, if PC is found to be within a PC threshold, wherein the new data item is added to the definition of the assigned class for future use, wherein the addition of classified new data items to the class definition enables automatic update of the water segmented classifier;
assigning (306e) the new data item as an outlier of the class associated with the assigned cluster, if PC is beyond the PC threshold; and
creating (306f) a new class, if number of the data items assigned as outliers of a class exceed a set threshold, wherein the information related to the data items is updated into the waterfall segmented classifier, enabling it to update itself using new data items.
4. The method as claimed in claim 1, wherein information associated with each child cluster of the WSC technique and corresponding data items comprises features associated with each cluster at every level, measures of central tendency for each cluster, comprising a median and a mean, basic descriptive statistics comprising standard deviation of each cluster, data point labels associated with each cluster and the clustering results along with corresponding metrics comprising the CS and the CEM, at every level.

5. A system (100), the system (100) comprising: a memory (102) storing instructions;
one or more Input/Output (I/O) interfaces (106); and
one or more hardware processors (104) coupled to the memory (102) via the one or more I/O interfaces (106), wherein the one or more hardware processors (104) are configured by the instructions to:
receive a labeled dataset, wherein each data item in the labeled data set has a correct label or an incorrect label among a plurality of labels;
preprocess the labeled data set to generate a preprocessed labeled data set; and
iteratively apply until a stopping criteria is satisfied a waterfall segmented clustering (WSC) technique on the preprocessed labeled dataset , via the one or more hardware processors, to obtain a multi-level parent-child cluster tree structure, the WSC technique comprising:
a) creating a plurality of unique sets of features representing a
plurality of combinations of features associated with the
preprocessed labeled data set using a feature selection
technique;
b) performing an unsupervised clustering on the preprocessed
labeled data set based on each of the plurality of sets of features
to generate a plurality of clustering results wherein number of
clusters in each of the plurality of clustering results is
determined by an optimum cluster criteria, and wherein each
cluster among each of the plurality of clustering results
comprises data items associated with one or more labels from
among the plurality of associated labels;
c) computing a Completeness Score (CS) for each of the plurality of clustering results;
d) identifying a complete clustering result from among the plurality of clustering results having a maximum CS (CSmax); and

e) comparing the CSmax of the complete clustering result with a completeness threshold (λ CS) to split, further in each iteration, one or more parent clusters of the complete clustering result into child clusters for generating the multi-level parent-child cluster tree, wherein
(i) clustering of the child clusters within the complete clustering result is terminated, if the CSmax is less than a completeness threshold (λ CS,), indicating further separation of data items is not meaningful, and assigning all the data items associated with the complete clustering result to a new class, and
(ii) clustering of the child clusters within the complete clustering result is further evaluated based on a Cluster Evaluation Matrix (CEM), if the CSmax is higher than the completeness score threshold (λ CS), wherein evaluation of the complete cluster based on the CEM comprises:
1) generating the CEM for the complete
clustering result;
2) comparing each element of the CEM with an outlier threshold (λOL), where if any element of the CEM is less than the outlier threshold λOL, then the data items associated with corresponding elements of the CEM are termed as outliers indicating presence of mislabeled data and are removed from subsequent processing;
3) comparing each element of the CEM with a CEM threshold (λCEM), and assigning a child cluster among the complete clustering result as another new class, if at least one element of the CEM associated

with the child cluster is greater than the CEM threshold (λCEM);
4) comparing each element of the CEM with a CEM threshold (λCEM), and assigning a child cluster among the complete clustering result as a separable child cluster, if none of the elements of the CEM in the child cluster is greater than the CEM threshold (λCEM);
5) identifying a plurality of separable child clusters generated based on the CEM threshold (λCEM); and
6) separating, the data items associated with the plurality of separable child clusters and repeating the WSC technique until the stopping criterial is satisfied for generating the multi-level parent-child cluster tree structure comprising plurality of separable child clusters, wherein the stopping criteria comprises terminating the WSC technique if all the data items associated with the plurality of separable child clusters are assigned to at least one class among a plurality of classes or removed as the outliers, wherein one or more child clusters are assigned as classes based on the WSC technique and wherein information associated with each child cluster of the WSC technique and corresponding data items is stored for future reference; and
generate a waterfall segmented classifier for identifying a class of a new data item, based on the multi-level parent-child tree structure obtained during the WSC technique.

6. The system as claimed in claim 5, wherein the one or more hardware
processors are configured to compute each element (A) of the CEM using
an equation comprising:

where K is total number of clusters, C is total number of labels, nck is an absolute value of an element for a given cluster number and a label number, a first logarithmic term inside a max function in the equation accounts for the number of data items associated with a label in each cluster, a second exponential term in the max function accounts for the relative proportion of data items associated with that label across different clusters, and a term outside the max function provides a measure of the dominance of a particular label in the cluster.
7. The system as claimed in claim 5, wherein the one or more hardware
processors are configured to utilize the generated waterfall segmented
classifier for identifying the class of the new data item by:
receiving the new data item;
preprocessing the new data item to generate a preprocessed new data item;
classifying the preprocessed new data item into one of the plurality of classes, wherein the classification comprising:
identifying the closest cluster among the plurality of child clusters at a lowest level of the multi-level parent-child tree structure obtained by the WSC technique, wherein the closest cluster is identified successively at every level of multi-level parent-child tree structure using a distance metric and a measure of central tendency of the respective cluster based on the feature set used from among the plurality of

feature sets at respective level of the WSC technique, wherein a lowest level child cluster is assigned as a class;
assigning (306b) the new data item to the closest cluster at the lowest level of the multi-level parent-child tree structure obtained by the WSC technique;
computing (306c) a Probability of Closeness (PC) measure of the new data item to the assigned closest cluster indicating whether the new data item is within an acceptable limit of the cluster or is an outlier to the cluster;
assigning (306d) the new data item to the class associated with the assigned closest cluster, if PC is found to be within a PC threshold, wherein the new data item is added to the definition of the assigned class for future use, wherein the addition of classified new data items to the class definition enables automatic update of the water segmented classifier;
assigning (306e) the new data item as an outlier of the class associated with the assigned cluster, if PC is beyond the PC threshold; and
creating (306f) a new class, if number of the data items assigned as outliers of a class exceed a set threshold, wherein the information related to the data items is updated into the waterfall segmented classifier, enabling it to update itself using new data items.
8. The system as claimed in claim 5, wherein information associated with each child cluster of the WSC technique and corresponding data items comprises features associated with each cluster at every level, measures of central tendency for each cluster, comprising a median and a mean, basic descriptive statistics comprising standard deviation of each cluster, data

point labels associated with each cluster and the clustering results along with corresponding metrics comprising the CS and the CEM, at every level.

Documents

Application Documents

#	Name	Date
1	202121043781-STATEMENT OF UNDERTAKING (FORM 3) [27-09-2021(online)].pdf	2021-09-27
2	202121043781-REQUEST FOR EXAMINATION (FORM-18) [27-09-2021(online)].pdf	2021-09-27
3	202121043781-FORM 18 [27-09-2021(online)].pdf	2021-09-27
4	202121043781-FORM 1 [27-09-2021(online)].pdf	2021-09-27
5	202121043781-FIGURE OF ABSTRACT [27-09-2021(online)].jpg	2021-09-27
6	202121043781-DRAWINGS [27-09-2021(online)].pdf	2021-09-27
7	202121043781-DECLARATION OF INVENTORSHIP (FORM 5) [27-09-2021(online)].pdf	2021-09-27
8	202121043781-COMPLETE SPECIFICATION [27-09-2021(online)].pdf	2021-09-27
9	202121043781-FORM-26 [21-10-2021(online)].pdf	2021-10-21
10	Abstract1.jpg	2021-12-01
11	202121043781-Proof of Right [11-01-2022(online)].pdf	2022-01-11
12	202121043781-FER.pdf	2023-08-23
13	202121043781-OTHERS [10-11-2023(online)].pdf	2023-11-10
14	202121043781-FER_SER_REPLY [10-11-2023(online)].pdf	2023-11-10
15	202121043781-CLAIMS [10-11-2023(online)].pdf	2023-11-10
16	202121043781-PatentCertificate08-03-2024.pdf	2024-03-08
17	202121043781-IntimationOfGrant08-03-2024.pdf	2024-03-08

Search Strategy

1	202121043781E_11-08-2023.pdf
2	202121043781AE_01-02-2024.pdf