Sign In to Follow Application
View All Documents & Correspondence

System/Method To Enhance Intrusion Detection Using Correlation Feature Selection

Abstract: The firewalls, cryptographic methods, and antivirus scanners are becoming increasingly ineffective in the face of increasingly complex threats. When it comes to protecting network channels and machines, stronger defense barriers are what's needed. In addition to the policies already in place, an intrusion detection system can serve as an extra line of defense. Thesis research focuses on anomaly based intrusion detection systems that can tell the difference between typical and abnormal user actions. The challenge for ML is to make the learning process more complex by reducing the data's dimensionality. By reducing the number of dimensions, we can find a set of features that is representative of the domain and use them to build a classification model. In order to extract useful and unique information from a dataset containing 41 attributes, this study primarily concentrates on filter-based CFS. Quality characteristics in the feature set should have low inter-correlation and high correlation to the target class, according to the CFS fundamental hypothesis. In order to put the aforementioned idea into practice, we compute the merit of each subset that is formed. Combining the evaluation formula with the right correlation measure and a number of heuristic search strategies, CFS is a filter-based algorithm. In order to build an intrusion detection system (IDS) model, the subsets that were found using a mix of six search strategies are assessed using two machine learning algorithms: J48 (C4.5 decision tree learner) and random forest.

Get Free WhatsApp Updates!
Notices, Deadlines & Correspondence

Patent Information

Application #
Filing Date
24 April 2024
Publication Number
18/2024
Publication Type
INA
Invention Field
COMPUTER SCIENCE
Status
Email
Parent Application

Applicants

MLR Institute of Technology
MLR Institute of Technology, Hyderabad

Inventors

1. Dr. Mahalakshmi
Department of Information Technology, MLR Institute of Technology, Hyderabad
2. Dr. Venkata Nagaraju Thatha
Department of Information Technology, MLR Institute of Technology, Hyderabad
3. Mr. V. Gopikrishna
Department of Information Technology, MLR Institute of Technology, Hyderabad
4. Mr. Ch. Upendar
Department of Information Technology, MLR Institute of Technology, Hyderabad

Specification

Description:Field of the Invention
In order to identify any anonymous or unusual activity on the network or workstation, an intrusion detection system (IDS) watches the user's typical behavior. Alerts are raised by intrusion detection systems when they identify anomalies or misuse. Both centralized and distributed implementations are possible. Essentially, it keeps an eye on the internet log to see what's happening on networks, and the application, system, and data server logs to see what's happening on machines. The focus of this study is on identifying network and host-based attacks that can affect any distant system, whether it be a person, a business, a government agency, or even a machine outside of a network. Finding intrusions in host and network systems as soon as they happen is a top priority for researchers and administrators in the field of information security. The ever-changing nature of the threat to target systems, along with the wide variety of computer hardware and operating systems, makes efficient intrusion detection a challenging task.
Objective of the Invention
The filter-based CFS is proposed to choose feature subsets, which dramatically enhances the overall performance of ML algorithms.
Background of the Invention
One potential component of an intrusion detection system is its ability to safeguard network channels and computers housing sensitive information from the annoyance of unusual user actions and database abuse. A rule-based pattern matching system is one such model that might be built and tested against the system's actual usage, with any notable variation from the norm being marked as abnormal. The primary goal of intrusion detection systems (IDS) is to identify suspicious or malicious actions that compromise the security and authentication protocols of computer networks. In the past, system administrators would manually analyze user log information to conduct the detection process. The primary focus is on preparing for potential security breaches in the face of the ever-present dangers posed by the internet to private, public, and institutional computers. Despite the abundance of intrusion detection models, none of the older IDS have been able to keep up with the exponential growth in online activity, the ease of access to the internet, and the sophistication of cyberattacks over the past few years. Additionally, a critical area of concern is the identification of attacks with uncommon occurrences. The computer may be severely compromised if even the most infrequent attacks were to go unnoticed. Research has shown that a lot of misuse detection algorithms aren't good at spotting rare, low-frequency attack types like U2R and R2L attacks. In addition, the training dataset's uncorrelated fingerprints allow some novel attacks to pass unnoticed in the testing dataset (KR101794733B1).
In the last several decades, feature selection has grown increasingly important in the field of machine learning. This is because, given an observation space of RM with an M-dimensional feature space, the target class can be best predicted using a subset of m features. It is necessary to remove the repetitive, irrelevant features as well as the noisy information from the dataset in order to categorize it into its respective categories. Improved classifier predictive accuracy, faster and cheaper classifiers, and thoughtful comprehension of the ML process that produced the predictions are all goals of feature selection. When dealing with composite classification problems, feature selection is essential. Incorporating even a single irrelevant variable into the ML process might lead to erroneous correlation and make detection more difficult.
When it came to detecting intrusions, older classical IDs used to use every data feature. The literature does note, however, that not all traits are useful for enhancing IDS performance. They have a tendency to lead the ML algorithm astray, which in turn lowers the IDS's overall performance. As a result, employing different feature selection techniques to decrease the amount of characteristics became an urgent necessity (US10008107B2). During the pre-processing step, a feature selection technique is used to pick out relevant features and remove irrelevant or redundant ones. In addition to reducing computational complexity and time and storage required, feature selection prior to learning has demonstrated considerable improvement in ML outcomes. In addition, the imbalanced class problem—which occurs when some classifier families fail to detect uncommon class attacks—makes the selection of a classifier an essential area of concern. The overall performance of the unusual classes has been seen to noticeably increase when using classifiers such as Decision Trees and Naïve Bayes.
Summary of the Invention
Most machine learning techniques are compatible with CFS since feature selection is done based on feature correlation. In order to compare two families of ML algorithms, we used a variety of feature subset selection methods on Decision Tree and PART. The most promising FSS technique is the correlation-based feature subset selection algorithm, which provides a clearer picture of the features' link to the predictive attack class attribute. The degree of correlation among the traits is measured by their redundancy. Coefficients of correlation are measures of how closely related two aspects are; larger values indicate that the attributes are highly related or contain nearly identical information. The study of how filter-based CFS is used to choose feature subsets, which dramatically enhances the overall performance of ML algorithms. We have utilized two popular ML algorithms, J48 and Random Forest, to assess the correlation-based features selection technique. Additionally, we have solely published the findings of U2R in this, since it is a rare class that deals with class imbalance. To pick a high-quality subset of characteristics for attack category categorization, we have used CFS with several search strategies in this work.
Brief Description of Drawings
Figure 1: Block diagram of Correlation Feature Selection
Detailed Description of the Invention
One way to evaluate feature subsets produced by different search algorithms is by correlation-based feature selection, which uses a filter-based technique. It doesn't matter which induction algorithm is utilized because it is a filter strategy. It all comes down to the features' characteristics. In this study, the feature-attribute correlation serves as the dependent characteristic. It operates on the principle of Pearson's correlation coefficient. The features' intercorrelation and their relationship to the properties of the predictive class. It specifies that features should have weak coupling between themselves but strong coupling with the attributes of the target class. Statistical dependence, in layman's terms, is a way to measure the degree to which two attributes are interdependent and how well they predict each other and the attack class trait. There are a number of applications for the correlation measure in feature set evaluation. Here are few instances: Ranking features by correlation coefficient or sorting subgroups by merit are two possible approaches. Figure 1 shows a schematic representation of the CFS method flow. Three heuristic search techniques are available in the experimental implementation of CFS: Best first, sequential backward elimination, and sequential forward selection. Starting with a blank slate, sequential forward selection iteratively adds features until even the smallest addition yields a better evaluation. Starting with a full set of features, sequential backward elimination removes them one by one until the assessment remains stable. Either the complete set of features or a blank set would be a good place to start. The first iteration involves adding features to the search space in a forward direction, whereas the second iteration involves eliminating features in a backward direction. A halting requirement is enforced to ensure that the best initial search does not traverse the whole feature subset search space.
Next, we use a correlation-based feature subset evaluation approach to assess the subsets that made the cut after we generated them using different search techniques. Dependency, distance, and consistency are the three main tenets of the several feature ranking filter models that have already been covered. Information gain, maximum relevance, least redundancy, and Pearson's Correlation are all described under the dependency model. Our study here made use of the famous Pearson's correlation based feature selection (CFS) method. Features that are highly related to the target class and just weakly related to one another are the most relevant for ML processes, says Pearson.Each of the k feature subsets produced by the correlation-based feature subset selection method is evaluated for its relative worth. By calculating the weight age of the chosen feature subset using the merit of subset as a prospective performance metric, we may determine which subset is the best by looking at the correlations between features and the target class. When comparing characteristics or features with the predictive class property, the correlation coefficient indicates how closely related the features are. A measure of the degree of coupling among the features and the classes is the amount of reliance or correlation. Consequently, traits ought to be loosely connected amongst themselves, but firmly coupled with the target class. Nevertheless, the level of connection among the qualities and the target class attribute would undoubtedly increase with the addition of more features. Compared to the previously selected qualities, the newly induced ones are less correlated and more dominant, even when the correlation between the classes is high.
Dataset and problem domain dictate the choice of ML method used to construct an efficient and effective ID. In order to make decisions based on new network traffic records, ML algorithms use the logic of evaluating existing patterns of earlier data. Binary classifiers, multiclass classifiers, and multi-level classifiers are only a few examples of the many ML techniques available, each tailored to a certain kind of dataset. Although certain classifiers perform adequately in both the anomaly detection and misuse detection domains, neural networks are superior in the former. Because of their inherent speed, neural networks are also seen as one of the most promising ML algorithms for real-time intrusion detection systems. In this paper, J48 and random forest are the ML algorithms that are utilized. Both of these classifiers are rule based.
Clustering, outlier analysis, and many more data mining applications rely on random forest as a meta-algorithm for classification. When it comes to the sample selection and bootstrapping methods, random forests are a subset of bagging with decision trees. In a random forest, a collection of decision trees is trained using a random subset of the training dataset or randomly generated vectors, and the score is then computed based on these components. Put another way, there are two ways to build a random forest: either by randomly splitting the input data or randomly selecting the divided data. The Random Forest algorithm differs from the Classical Decision Tree algorithm despite the fact that it constructs many decision trees. To begin, it uses the aforementioned bootstrap sample from the initial training dataset to build a tree. In the second step, it chooses which features to use at each node by reducing the total number of features. These two unconventional ideas boost the random forest's diversity by introducing unpredictability into decision trees during the learning phase. The random forest classifier is supposedly an effective and easy-to-understand tool for spotting outliers.
A wide variety of data mining tasks, including clustering, outlier analysis, and classification, make use of random forest as a meta-algorithm. As a subset of bagging with decision trees, random forests differ in their approach to sample selection and bootstrapping. The idea behind a random forest is to employ a collection of decision trees, then compute the score according to the components of these trees, which can be randomly generated vectors or subsets of the training dataset. To put it simply, there are two methods for creating a random forest: random split selection and random input selection. Despite the fact that the Random Forest algorithm differs from the standard decision trees algorithm, it does generate many decision trees. As indicated before, it starts by building a tree using the bootstrap sample of the initial training dataset. Second, it chooses the best feature at each node by selecting a subset of features. These two unconventional ideas that generate uncertainty in decision trees during the learning phase enrich the random forest's diversity. , Claims:The scope of the invention is defined by the following claims:

Claim:
1. The System/method to Enhance Intrusion Detection using Correlation Feature Selection comprising the steps of:
a) A method is adopted to identify the appropriate subset of features so that overall accuracy is also retained at the same level of without reducing the features.
b) A method is adopted for reducing the feature space which significantly enhances the classifier performance.
c) The Identification the appropriate classifier for effective intrusion detection system which could detect wide range of attacks with a significant detection rate and low false alarm.
2. The System/method to Enhance Intrusion Detection using Correlation Feature Selection as claimed in claim1, Pre-processing is performed on the datasets.
3. The System/method to Enhance Intrusion Detection using Correlation Feature Selection as claimed in claim1, led to the design of Correlation based feature subset selection technique.
4. The System/method to Enhance Intrusion Detection using Correlation Feature Selection as claimed in claim1, two classifiers Random Forest and J48 classifiers are used.

Documents

Application Documents

# Name Date
1 202441032319-REQUEST FOR EARLY PUBLICATION(FORM-9) [24-04-2024(online)].pdf 2024-04-24
2 202441032319-FORM-9 [24-04-2024(online)].pdf 2024-04-24
3 202441032319-FORM FOR SMALL ENTITY(FORM-28) [24-04-2024(online)].pdf 2024-04-24
4 202441032319-FORM 1 [24-04-2024(online)].pdf 2024-04-24
5 202441032319-EVIDENCE FOR REGISTRATION UNDER SSI(FORM-28) [24-04-2024(online)].pdf 2024-04-24
6 202441032319-EVIDENCE FOR REGISTRATION UNDER SSI [24-04-2024(online)].pdf 2024-04-24
7 202441032319-EDUCATIONAL INSTITUTION(S) [24-04-2024(online)].pdf 2024-04-24
8 202441032319-DRAWINGS [24-04-2024(online)].pdf 2024-04-24
9 202441032319-COMPLETE SPECIFICATION [24-04-2024(online)].pdf 2024-04-24