Abstract: The present disclosure provides a system to classify data streams, comprising a data reception module for receiving a data stream, and a data partitioning module to partition the received data stream into data blocks. Features are extracted from the data blocks by a feature extraction module, and a model is initialized based on the extracted features. A labelling determination module determines if data within a sliding window is labelled, while an active learning module applies an active learning strategy to unlabelled data. A plurality of base classifiers classifies the data, and an ensemble classifier formation module combines outputs of the base classifiers. A prediction module generates prediction and probability scores from the ensemble classifier, a concept drift detection module applies a concept drift detection algorithm to said scores, and an update module updates the ensemble classifier upon detection of the concept drift. Drawings / FIG 1 / Fig 2 / FIG. 3 / FIG. 4 / FIG. 5
Description:Field of the Invention
The present disclosure relates to data mining technologies, particularly, to an architecture for adaptive concept drift classification systems for real-time stream data mining.
Background
The background description includes information that may be useful in understanding the present invention. It is not an admission that any of the information provided herein is prior art or relevant to the presently claimed invention, or that any publication specifically or implicitly referenced is prior art.
An architecture for adaptive concept drift classification plays a pivotal role in the realm of data mining from real-time stream data. The processing of real-time stream data involves continuous analysis and extraction of information from data that dynamically changes over time. Said analysis is crucial for a wide range of applications, from financial market monitoring to social media analytics and beyond, where the underlying patterns and relationships within the data evolve. Traditional data mining techniques, while effective for static datasets, face challenges when applied to streaming data due to the phenomenon known as concept drift. Concept drift refers to the change in statistical properties of the target variable, which affects the predictive model's accuracy over time.
The utilization of static models in the face of concept drift results in decreased predictive accuracy as the models fail to adapt to the evolving data streams. Said models are designed based on historical data and assume that the data will follow the same patterns and distributions. However, in real-time stream data, said assumption often proves incorrect, as the nature of the data can change due to various factors such as shifts in user behavior, seasonal effects, or emerging trends. The inability of static models to adapt to said changes without manual intervention limits their applicability in dynamic environments.
Furthermore, another significant challenge associated with conventional systems is the handling of the trade-off between adaptability and stability. Adaptive systems aim to adjust to concept drift promptly, yet they must avoid overfitting to recent data, which might not be representative of long-term trends. Said balance is difficult to achieve with traditional methods, which either react too slowly to changes, thereby retaining outdated models for too long, or adapt too quickly, losing the ability to generalize from the broader dataset.
Moreover, the computational efficiency of processing real-time stream data presents another obstacle. The need for models to update in real-time or near-real-time necessitates algorithms that can operate with minimal latency. Many existing approaches require significant computational resources for continuous retraining or model adjustments, which is impractical for applications requiring immediate responses based on the latest data.
Prior art solutions cannot offer robust adaptability to evolving data patterns without sacrificing predictive accuracy or computational efficiency. Thus, there exists an urgent need of a system to classify data streams, by overcoming the problems associated with conventional systems and techniques for adaptive concept drift classification in real-time stream data mining.
Summary
The following presents a simplified summary of various aspects of this disclosure to provide a basic understanding of such aspects. This summary is not an extensive overview of all contemplated aspects and is intended to neither identify key or critical elements nor delineate the scope of such aspects. Its purpose is to present some concepts of this disclosure in a simplified form as a prelude to the more detailed description that is presented later.
The following paragraphs provide additional support for the claims of the subject application.
The present disclosure showcases a system to classify data streams. Said system comprises a data reception module that receives a data stream, a data partitioning module configured to partition the received data stream into data blocks, and a feature extraction module configured to extract features from said data blocks.
Furthermore, a model initialization module is configured to initialize a model based on the extracted features, and a labeling determination module is configured to determine if data within a sliding window is labelled. An active learning module applies an active learning strategy to unlabelled data. Additionally, a plurality of base classifiers is configured to classify the data, and an ensemble classifier formation module is configured to combine outputs of the base classifiers.
A prediction module generates prediction and probability scores from the ensemble classifier, a concept drift detection module applies a concept drift detection algorithm to the prediction and probability scores, and an update module updates the ensemble classifier upon detection of the concept drift. The data reception module preprocesses the data stream to normalize or standardize the data before partitioning. The ensemble classifier formation module employs a weighted voting feature to combine the outputs of the base classifiers. The ensemble classifier is updated by retraining the base classifiers with new data identified by the concept drift detection algorithm.
The system enables accurate classification of data streams by employing an ensemble classifier that integrates the strengths of multiple base classifiers. The utilization of an active learning strategy for unlabelled data optimizes the learning process and enhances the efficiency of the classification system. By preprocessing the data stream, the system ensures consistency and reliability in the data classification process.
The incorporation of a concept drift detection algorithm allows the system to adapt to changes in the data stream dynamically, ensuring that the classification remains accurate over time. The update mechanism for the ensemble classifier, based on the detection of concept drift, ensures that the system remains effective in the face of changing data characteristics, thereby maintaining the accuracy and reliability.
The present disclosure provides a method for classifying data streams. The method encompasses receiving a data stream, partitioning the received data stream into data blocks, and extracting features from said data blocks. A model is initialized based on the extracted features. Determination is made as to whether the data within a sliding window is labeled. An active learning strategy is applied to unlabeled data. Data is classified using a plurality of base classifiers. The outputs of the base classifiers are combined to form an ensemble classifier. Prediction and probability scores are generated from the ensemble classifier. A concept drift detection algorithm is applied to the prediction and probability scores. The ensemble classifier is updated upon detection of concept drift. A summary of the classified data is generated if no concept drift is detected.
The active learning strategy involves querying a user to label the selected data. The base classifiers are selected from a group consisting of decision trees, support vector machines, neural networks, and k-nearest neighbour algorithms. The active learning strategy includes a confidence-based selection process to identify the informative data points for labelling. The prediction and probability scores are used to determine a confidence level for each classification decision made by the ensemble classifier.
The method enhances the accuracy of data stream classification by incorporating an active learning strategy that efficiently utilizes human input for labelling selected data, ensuring that the most informative data points are prioritized. By employing a diverse array of base classifiers, the method leverages the strengths of different classification techniques, thereby improving the robustness and reliability of the classification process. The formation of an ensemble classifier from the outputs of multiple base classifiers further enhances the classification accuracy.
The application of a concept drift detection algorithm ensures that the method remains adaptive to changes in data patterns over time, allowing for the dynamic updating of the ensemble classifier to maintain the effectiveness. The generation of a summary for the classified data, in the absence of concept drift, provides a concise overview of the classification results, facilitating the understanding and analysis of the data streams.
Brief Description of the Drawings
The features and advantages of the present disclosure would be more clearly understood from the following description taken in conjunction with the accompanying drawings in which:
FIG. 1 illustrates a system to classify data streams, in accordance with the embodiments of the present disclosure.
FIG. 2 illustrates a method for classifying data streams, in accordance with the embodiments of the present disclosure.
FIG. 3 illustrates an accuracy analysis of different classifiers on various datasets, in accordance with the embodiments of the present disclosure.
Fig. 4 illustrates a kappa analysis, which measures the agreement between classifiers beyond chance, in accordance with the embodiments of the present disclosure.
Fig. 5 represents an exemplary system for online learning and classification of data streams, in accordance with the embodiments of the present disclosure.
Detailed Description
In the following detailed description of the invention, reference is made to the accompanying drawings that form a part hereof, and in which is shown, by way of illustration, specific embodiments in which the invention may be practiced. In the drawings, like numerals describe substantially similar components throughout the several views. These embodiments are described in sufficient detail to claim those skilled in the art to practice the invention. Other embodiments may be utilized and structural, logical, and electrical changes may be made without departing from the scope of the present invention. The following detailed description is, therefore, not to be taken in a limiting sense, and the scope of the present invention is defined only by the appended claims and equivalents thereof.
The use of the terms “a” and “an” and “the” and “at least one” and similar referents in the context of describing the invention (especially in the context of the following claims) are to be construed to cover both the singular and the plural, unless otherwise indicated herein or clearly contradicted by context. The use of the term “at least one” followed by a list of one or more items (for example, “at least one of A and B”) is to be construed to mean one item selected from the listed items (A or B) or any combination of two or more of the listed items (A and B), unless otherwise indicated herein or clearly contradicted by context. The terms “comprising,” “having,” “including,” and “containing” are to be construed as open-ended terms (i.e., meaning “including, but not limited to,”) unless otherwise noted. Recitation of ranges of values herein are merely intended to serve as a shorthand method of referring individually to each separate value falling within the range, unless otherwise indicated herein, and each separate value is incorporated into the specification as if it were individually recited herein. All methods described herein can be performed in any suitable order unless otherwise indicated herein or otherwise clearly contradicted by context. The use of any and all examples, or exemplary language (e.g., “such as”) provided herein, is intended merely to better illuminate the invention and does not pose a limitation on the scope of the invention unless otherwise claimed. No language in the specification should be construed as indicating any non-claimed element as essential to the practice of the invention.
Pursuant to the "Detailed Description" section herein, whenever an element is explicitly associated with a specific numeral for the first time, such association shall be deemed consistent and applicable throughout the entirety of the "Detailed Description" section, unless otherwise expressly stated or contradicted by the context.
Disclosed herein a system 100 to classify data streams. According to a pictorial illustration of FIG. 1, showcasing an architectural paradigm of the system 100 that can comprise functional elements, yet not limited to a data reception module 102, a data partitioning module 104, a feature extraction module 106, a model initialization module 108, a labelling determination module 110, an active learning module 112, a plurality of base classifiers 114, an ensemble classifier formation module 116, a prediction module 118, a concept drift detection module 120, and an update module 122. A person ordinarily skilled in art would prefer those elements or components of the system 100, to be functionally or operationally coupled with each other, in accordance with the embodiments of present disclosure.
In an embodiment, the data reception module 102 designed for the intake of data streams. The system 100 comprises a data reception module 102, which is responsible for receiving data streams. Said module 102 serves as the entry point for data within the system 100, ensuring that incoming data streams are captured for further processing. The operation of the data reception module 102 facilitates the initial handling of data, which is crucial for the subsequent stages of data classification.
In an embodiment, the data partitioning module 104 can be configured to divide received data streams into smaller, manageable data blocks. Following the data reception, the data partitioning module 104 partitions the received data stream into data blocks. Said segmentation is essential for detailed analysis and processing of data, allowing the feature extraction module 106 to operate more effectively on manageable portions of the entire data stream.
In an embodiment, the feature extraction module 106 engineered to identify and extract significant features from the data blocks. The feature extraction module 106 extracts feature from the said data blocks. The extracted features are critical for the model initialization module 108, as said features provide the necessary input for model training and refinement.
In an embodiment, the model initialization module 108 designed to prepare a model for data classification based on the extracted features. The model initialization module 108 initializes a model based on the extracted features. Said step is pivotal in setting the groundwork for the ability of the system 100 to classify data accurately.
In an embodiment, the labelling determination module 110 designed to assess if data within a certain range is labelled. The labelling determination module 110 determines if data within a sliding window is labelled. Said determination is crucial for the active learning module 112, guiding the focus towards unlabelled data for enhanced learning and classification accuracy.
In an embodiment, the active learning module 112 can be configured to employ a strategy for learning from unlabelled data. The active learning module 112 applies an active learning strategy to unlabelled data. By focusing on unlabelled data, the module improves the learning efficiency of the system 100 and classification performance over time.
In an embodiment, the plurality of base classifiers 114 may be designed to perform initial classification of data. The system 100 includes a plurality of base classifiers 114, which are responsible for the primary classification of the data. The diversity among said classifiers 114 enhances the robustness of the classification process.
In an embodiment, the ensemble classifier formation module 116 can be designed to integrate outputs from multiple classifiers to form a single, more accurate classifier. The ensemble classifier formation module 116 combines outputs of the base classifiers 114. Said combination leverages the strengths of individual classifiers to achieve a higher classification accuracy.
In an embodiment, the prediction module 118 can be configured to generate predictions and associated probability scores based on the ensemble classifier. The prediction module 118 generates prediction and probability scores from the said ensemble classifier. Said scores are indicative of the classification outcomes and their confidence levels, providing valuable insights into the data classification process.
In an embodiment, the concept drift detection module 120 can be engineered to identify shifts in the data stream's characteristics over time. The concept drift detection module 120 applies a concept drift detection algorithm to the prediction and probability scores. The detection of concept drift is vital for maintaining the accuracy and relevance of the classification model in dynamic environments.
In an embodiment, the update module 122 may be designed to refresh the ensemble classifier in response to detected changes. The update module 122 updates the ensemble classifier upon detection of concept drift. Said updating process ensures that the classification model remains effective and accurate in the face of evolving data patterns.
In an embodiment, the data reception module 102 preprocesses the data stream to normalize or standardize the data before partitioning. Said preprocessing step, performed by the data reception module 102, enhances the quality of data entering the system 100, leading to more reliable data partitioning and feature extraction.
In another embodiment, the ensemble classifier formation module 116 employs a weighted voting feature to combine the outputs of the base classifiers 114. The ensemble classifier formation module 116 utilizes the weighted voting, enhancing the decision-making process by attributing varying degrees of importance to different classifiers outputs.
In a further embodiment, the ensemble classifier is updated by retraining the base classifiers 114 with new data identified by the concept drift detection algorithm. Said retraining approach, facilitated by the update module 122, ensures that the ensemble classifier adapts to new data patterns, maintaining the accuracy and robustness over time.
Disclosed a method 200 for classifying data streams. Referring to a diagrammatic depiction put forth in FIG. 2, representing a flow diagram of the method 200 that can comprise steps of, yet not restricted to, (At step 202) receiving a data stream, (At step 204) partitioning the received data stream into data blocks, (At step 206) extracting features from the data blocks, (At step 208) initializing a model based on the extracted features, (At step 210) determining whether the data within a sliding window is labelled, (At step 212) applying an active learning strategy, (At step 214) classifying the data, (At step 216) combining the outputs of the base classifiers 114, (At step 218) generating prediction and probability scores, (At step 220) applying a concept drift detection algorithm,(At step 222) updating the ensemble classifier and (At step 224) generating a summary of the classified data. Said steps of the method 200 can be performed or executed, collectively or selectively, randomly, or sequentially or in a combination thereof, in accordance with the embodiments of current disclosure.
In an embodiment, at step 202, the method 200 involves receiving a data stream, relates to the initial intake of data for classification purposes. Said step 202 is crucial in marking the beginning of the data classification process, ensuring that data is made available for partitioning, feature extraction, and subsequent analysis.
In an embodiment, following the reception of the data stream, "partitioning the received data stream into data blocks" is conducted at step 204. Said partitioning, at step 204, facilitates the management and analysis of data by breaking down the stream into more manageable segments. Such segmentation is essential for effective feature extraction and further processing.
In an embodiment, at step 206, "extracting features from the data blocks" is performed. Feature extraction is pivotal in identifying the characteristics of the data that are most relevant for classification. The extracted features provide the basis for model initialization, enabling the method 200 to effectively learn from the data.
In an embodiment, at step 208 "initializing a model based on the extracted features" to follow feature extraction. Said initializing step 208 involves preparing a model that can utilize the identified features to classify data accurately. The initialized model is central to the ability of the method 200 to understand and categorize the data stream effectively.
In an embodiment, at step 210, the method 200 includes "determining whether the data within a sliding window is labelled." Said determination is crucial for identifying which segments of the data require labelling and which can be directly used for training the classification model.
In an embodiment, "applying an active learning strategy to unlabelled data" occurs at step 212. Said strategy is aimed at efficiently utilizing human resources to label the most informative data points, thereby enhancing the learning process and improving classification accuracy.
In an embodiment, the classification of data is conducted at step 214 using "a plurality of base classifiers 114." Said approach leverages the strengths of various classification algorithms to achieve a robust and accurate classification outcome. At step 216, "combining the outputs of the base classifiers 114 to form an ensemble classifier" is performed. The ensemble classifier integrates the decisions from said base classifiers 114 to improve overall classification performance.
In an embodiment, "generating prediction and probability scores from the ensemble classifier" is executed at step 218. Said scores provide insights into the confidence of the classification decisions and are essential for assessing the reliability of the results. The application of "a concept drift detection algorithm to the prediction and probability scores" takes place at step 220. Said algorithm identifies shifts in the data stream's characteristics, ensuring that the classification model remains accurate over time.
In an embodiment, at step 222, "updating the ensemble classifier upon detection of concept drift" ensures the method 200 adapts to changes in the data stream's nature, maintaining the effectiveness and relevance. Further, "generating a summary of the classified data if no concept drift is detected" is detailed at step 224. Said summarization provides a concise overview of the classification outcomes, facilitating interpretation and further analysis.
In an embodiment, the active learning strategy involves querying a user to label selected data. Said user interaction is crucial for accurately labelling the most informative data points, enhancing the model's learning efficiency. The base classifiers 114 are selected from a group consisting of decision trees, support vector machines, neural networks, and k-nearest neighbors algorithms. Said selection ensures a diverse and robust approach to data classification, leveraging the strengths of various algorithms.
Referring to one or more preceding embodiments, the active learning strategy includes a confidence-based selection process, to identify the most informative data points for labelling. Said process prioritizes data points that the model is least certain about, optimizing the use of human labelling efforts. The prediction and probability scores are used to determine a confidence level for each classification decision made by the ensemble classifier. Said determination of confidence levels provides valuable feedback on the reliability of the classification outcomes, aiding in the continuous improvement of the method 200.
In an embodiment, the system 100 to monitor and adapt to changes in data streams by detecting concept drift, which is indicated by shifts in the mean values of sample data. To address this monitoring, the system 100 implements a variety of strategies to adjust the base classifiers, enhancing overall performance. The ensemble classifier comprises multiple base learners, specifically selected to tackle different challenges in data stream classification. In an aspect, Hoeffding Tree is chosen for the quick decision-making capabilities and memory efficiency, ideal for fast-paced data streams.
The Hoeffding Adaptive Tree (HAT) builds on said decision-making capabilities by providing additional adaptability to gradual shifts in data, handling evolving concepts with greater precision. In another aspect, the OzaBag method can be combined with the ADWIN technique creates a robust ensemble that maintains classifier diversity and adapts in real-time to changes in data distribution, with ADWIN actively monitoring and adjusting the classifiers as needed. Together, said chosen classifiers and techniques are optimized for efficiency and adaptability in environments where data distributions are dynamic and subject to frequent change.
Fig. 3 presents an accuracy analysis of different classifiers on various datasets. The classifiers include OzaBag with Hoeffding Adaptive Tree (HAT), OzaBag with Hoeffding Tree (HT), and AdaBoost with HT. For the Airlines dataset, OzaBag+HAT outperforms the others with an accuracy of approximately 66.39%, while for the SEA dataset, said accuracy reaches around 89.76%. In the LED dataset, OzaBag+HT shows the highest accuracy at approximately 87.70%. The ConceptDriftRealStream dataset sees a high accuracy from AdaBoost+HT at about 95.04%, and for the ElectNewNorm dataset, OzaBag+HAT again leads with around 89.62% accuracy. Said chart demonstrates the varying effectiveness of each classifier depending on the dataset.
Fig. 4 displays a kappa analysis, which measures the agreement between classifiers beyond chance. A higher kappa indicates better performance. In the Airlines dataset, OzaBag+HAT has a kappa of approximately 22.58, suggesting moderate agreement. For the SEA dataset, the kappa value peaks with OzaBag+HT at around 94.51. In the LED dataset, OzaBag+HT again shows superiority with a kappa of approximately 97.74. The ConceptDriftRealStream dataset registers the highest kappa for AdaBoost+HT at about 89.81, indicating strong agreement. Lastly, the ElectNewNorm dataset's highest kappa is around 83.07 for OzaBag+HAT, showing good classifier reliability. This analysis is critical for understanding the consistency of each classifier.
Fig. 5 represents an exemplary system (such as said system 100) for online learning and classification of data streams. The system 100 is designed to handle continuous, possibly non-stationary data in real-time through a process that includes both an online learning phase and an offline update phase. During the online learning phase, the system 100 processes incoming data streams by dividing them into data blocks and extracting relevant features. A model is then initialized based on these features. A sliding window approach is used to determine if the incoming data is already labeled. If not, an active learning strategy is applied where the system 100 may query for labels. The data is then classified using a series of base classifiers, the results of which are combined to form an ensemble classifier. This ensemble classifier generates predictions with associated probability scores. The system 100 continuously checks for concept drift, a change in the statistical properties of the data stream, using a dedicated detection algorithm. If drift is detected, the ensemble classifier is updated to adapt to the new data distribution. If no drift is detected, the system 100 generates a summary of the data.
In the offline update phase, said summary can be used to refine the model and prepare the model for the data streams, ensuring the system 100 remains accurate over time. The process is cyclic, allowing the system 100 to dynamically adjust to the evolving nature of the data processed, and ultimately predict class labels for new data.
Example embodiments herein have been described above with reference to block diagrams and flowchart illustrations of methods and apparatuses. It will be understood that each block of the block diagrams and flowchart illustrations, and combinations of blocks in the block diagrams and flowchart illustrations, respectively, can be implemented by various means including hardware, software, firmware, and a combination thereof. For example, in one embodiment, each block of the block diagrams and flowchart illustrations, and combinations of blocks in the block diagrams and flowchart illustrations can be implemented by computer program instructions. These computer program instructions may be loaded onto a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions which execute on the computer or other programmable data processing apparatus create means for implementing the functions specified in the flowchart block or blocks.
Throughout the present disclosure, the term ‘processing means’ or ‘microprocessor’ or ‘processor’ or ‘processors’ includes, but is not limited to, a general purpose processor (such as, for example, a complex instruction set computing (CISC) microprocessor, a reduced instruction set computing (RISC) microprocessor, a very long instruction word (VLIW) microprocessor, a microprocessor implementing other types of instruction sets, or a microprocessor implementing a combination of types of instruction sets) or a specialized processor (such as, for example, an application specific integrated circuit (ASIC), a field programmable gate array (FPGA), a digital signal processor (DSP), or a network processor).
The term “non-transitory storage device” or “storage” or “memory,” as used herein relates to a random access memory, read only memory and variants thereof, in which a computer can store data or software for any duration.
Operations in accordance with a variety of aspects of the disclosure is described above would not have to be performed in the precise order described. Rather, various steps can be handled in reverse order or simultaneously or not at all.
While several implementations have been described and illustrated herein, a variety of other means and/or structures for performing the function and/or obtaining the results and/or one or more of the advantages described herein may be utilized, and each of such variations and/or modifications is deemed to be within the scope of the implementations described herein. More generally, all parameters, dimensions, materials, and configurations described herein are meant to be exemplary and that the actual parameters, dimensions, materials, and/or configurations will depend upon the specific application or applications for which the teachings is/are used. Those skilled in the art will recognize, or be able to ascertain using no more than routine experimentation, many equivalents to the specific implementations described herein. It is, therefore, to be understood that the foregoing implementations are presented by way of example only and that, within the scope of the appended claims and equivalents thereto, implementations may be practiced otherwise than as specifically described and claimed. Implementations of the present disclosure are directed to each individual feature, system, article, material, kit, and/or method described herein. In addition, any combination of two or more such features, systems, articles, materials, kits, and/or methods, if such features, systems, articles, materials, kits, and/or methods are not mutually inconsistent, is included within the scope of the present disclosure.
Claims
I/We claims:
1. A system 100 to classify data streams, comprising:
a data reception module 102 receives a data stream;
a data partitioning module 104 configured to partition the received data stream into data blocks;
a feature extraction module 106 configured to extract features from said data blocks;
a model initialization module 108 configured to initialize a model based on the extracted features;
a labeling determination module 110 configured to determine if data within a sliding window is labeled;
an active learning module 112 is configured to apply an active learning strategy to unlabeled data;
a plurality of base classifiers 114 configured to classify the data;
an ensemble classifier formation module 116 configured to combine outputs of the base classifiers;
a prediction module 118 configured to generate prediction and probability scores from said ensemble classifier;
a concept drift detection module 120 configured to apply a concept drift detection algorithm to the prediction and probability scores; and
an update module 122 configured to update the ensemble classifier upon detection of the concept drift.
2. The system of claim 1, wherein the data reception module preprocesses the data stream to normalize or standardize the data before partitioning.
3. The system of claim 1, wherein the ensemble classifier formation module employs a weighted voting feature to combine the outputs of the base classifiers.
4. The system of claim 1, wherein the ensemble classifier is updated by retraining the base classifiers with new data identified by the concept drift detection algorithm.
5. A method 200 for classifying data streams, comprising:
(At step 202) receiving a data stream;
(At step 204) partitioning the received data stream into data blocks;
(At step 206) extracting features from the data blocks;
(At step 208) initializing a model based on the extracted features;
(At step 210) determining whether the data within a sliding window is labeled;
(At step 212) applying an active learning strategy to unlabelled data;
(At step 214) classifying the data using a plurality of base classifiers;
(At step 216) combining the outputs of the base classifiers to form an ensemble classifier;
(At step 218) generating prediction and probability scores from the ensemble classifier;
(At step 220) applying a concept drift detection algorithm to the prediction and probability scores;
(At step 222) updating the ensemble classifier upon detection of concept drift; and
(At step 224) generating a summary of the classified data if no concept drift is detected.
6. The method of claim 5, wherein the active learning strategy involves querying a user to label the selected data.
7. The method of claim 5, wherein the base classifiers are selected from a group consisting of decision trees, support vector machines, neural networks, and k-nearest neighbors algorithms.
8. The method of claim 5, wherein the active learning strategy includes a confidence-based selection process to identify the informative data points for labeling.
9. The method of claim 5, wherein the prediction and probability scores are used to determine a confidence level for each classification decision made by the ensemble classifier.
ARCHITECTURE FOR ADAPTIVE CONCEPT DRIFT CLASSIFICATION SYSTEM FOR REAL-TIME STREAM DATA MINING
The present disclosure provides a system to classify data streams, comprising a data reception module for receiving a data stream, and a data partitioning module to partition the received data stream into data blocks. Features are extracted from the data blocks by a feature extraction module, and a model is initialized based on the extracted features. A labelling determination module determines if data within a sliding window is labelled, while an active learning module applies an active learning strategy to unlabelled data. A plurality of base classifiers classifies the data, and an ensemble classifier formation module combines outputs of the base classifiers. A prediction module generates prediction and probability scores from the ensemble classifier, a concept drift detection module applies a concept drift detection algorithm to said scores, and an update module updates the ensemble classifier upon detection of the concept drift.
Drawings
/
FIG 1
/
Fig 2
/
FIG. 3
/
FIG. 4
/
FIG. 5
, Claims:I/We claims:
1. A system 100 to classify data streams, comprising:
a data reception module 102 receives a data stream;
a data partitioning module 104 configured to partition the received data stream into data blocks;
a feature extraction module 106 configured to extract features from said data blocks;
a model initialization module 108 configured to initialize a model based on the extracted features;
a labeling determination module 110 configured to determine if data within a sliding window is labeled;
an active learning module 112 is configured to apply an active learning strategy to unlabeled data;
a plurality of base classifiers 114 configured to classify the data;
an ensemble classifier formation module 116 configured to combine outputs of the base classifiers;
a prediction module 118 configured to generate prediction and probability scores from said ensemble classifier;
a concept drift detection module 120 configured to apply a concept drift detection algorithm to the prediction and probability scores; and
an update module 122 configured to update the ensemble classifier upon detection of the concept drift.
2. The system of claim 1, wherein the data reception module preprocesses the data stream to normalize or standardize the data before partitioning.
3. The system of claim 1, wherein the ensemble classifier formation module employs a weighted voting feature to combine the outputs of the base classifiers.
4. The system of claim 1, wherein the ensemble classifier is updated by retraining the base classifiers with new data identified by the concept drift detection algorithm.
5. A method 200 for classifying data streams, comprising:
(At step 202) receiving a data stream;
(At step 204) partitioning the received data stream into data blocks;
(At step 206) extracting features from the data blocks;
(At step 208) initializing a model based on the extracted features;
(At step 210) determining whether the data within a sliding window is labeled;
(At step 212) applying an active learning strategy to unlabelled data;
(At step 214) classifying the data using a plurality of base classifiers;
(At step 216) combining the outputs of the base classifiers to form an ensemble classifier;
(At step 218) generating prediction and probability scores from the ensemble classifier;
(At step 220) applying a concept drift detection algorithm to the prediction and probability scores;
(At step 222) updating the ensemble classifier upon detection of concept drift; and
(At step 224) generating a summary of the classified data if no concept drift is detected.
6. The method of claim 5, wherein the active learning strategy involves querying a user to label the selected data.
7. The method of claim 5, wherein the base classifiers are selected from a group consisting of decision trees, support vector machines, neural networks, and k-nearest neighbors algorithms.
8. The method of claim 5, wherein the active learning strategy includes a confidence-based selection process to identify the informative data points for labeling.
9. The method of claim 5, wherein the prediction and probability scores are used to determine a confidence level for each classification decision made by the ensemble classifier.
ARCHITECTURE FOR ADAPTIVE CONCEPT DRIFT CLASSIFICATION SYSTEM FOR REAL-TIME STREAM DATA MINING
| # | Name | Date |
|---|---|---|
| 1 | 202421033123-OTHERS [26-04-2024(online)].pdf | 2024-04-26 |
| 2 | 202421033123-FORM FOR SMALL ENTITY(FORM-28) [26-04-2024(online)].pdf | 2024-04-26 |
| 3 | 202421033123-FORM 1 [26-04-2024(online)].pdf | 2024-04-26 |
| 4 | 202421033123-EVIDENCE FOR REGISTRATION UNDER SSI(FORM-28) [26-04-2024(online)].pdf | 2024-04-26 |
| 5 | 202421033123-EDUCATIONAL INSTITUTION(S) [26-04-2024(online)].pdf | 2024-04-26 |
| 6 | 202421033123-DRAWINGS [26-04-2024(online)].pdf | 2024-04-26 |
| 7 | 202421033123-DECLARATION OF INVENTORSHIP (FORM 5) [26-04-2024(online)].pdf | 2024-04-26 |
| 8 | 202421033123-COMPLETE SPECIFICATION [26-04-2024(online)].pdf | 2024-04-26 |
| 9 | 202421033123-FORM-9 [07-05-2024(online)].pdf | 2024-05-07 |
| 10 | 202421033123-FORM 18 [08-05-2024(online)].pdf | 2024-05-08 |
| 11 | 202421033123-FORM-26 [12-05-2024(online)].pdf | 2024-05-12 |
| 12 | 202421033123-FORM 3 [13-06-2024(online)].pdf | 2024-06-13 |
| 13 | 202421033123-RELEVANT DOCUMENTS [17-04-2025(online)].pdf | 2025-04-17 |
| 14 | 202421033123-POA [17-04-2025(online)].pdf | 2025-04-17 |
| 15 | 202421033123-FORM 13 [17-04-2025(online)].pdf | 2025-04-17 |
| 16 | 202421033123-FER.pdf | 2025-07-30 |
| 17 | 202421033123-FORM-8 [27-10-2025(online)].pdf | 2025-10-27 |
| 18 | 202421033123-FER_SER_REPLY [27-10-2025(online)].pdf | 2025-10-27 |
| 19 | 202421033123-DRAWING [27-10-2025(online)].pdf | 2025-10-27 |
| 20 | 202421033123-CORRESPONDENCE [27-10-2025(online)].pdf | 2025-10-27 |
| 21 | 202421033123-COMPLETE SPECIFICATION [27-10-2025(online)].pdf | 2025-10-27 |
| 22 | 202421033123-CLAIMS [27-10-2025(online)].pdf | 2025-10-27 |
| 1 | 202421033123_SearchStrategyNew_E_202421033123E_18-03-2025.pdf |