Abstract: The present invention is related to an algorithm for enhanced phishing detection through integrated feature engineering and machine learning optimization. this invention introduces an optimized feature selection algorithm that combines Lasso Regression and Decision Tree classifiers in both serial and parallel configurations. By systematically identifying and prioritizing the most relevant features, the algorithm enhances detection accuracy, reduces false positives, and improves computational efficiency. The parallel processing aspect allows for the simultaneous evaluation of multiple feature subsets, ensuring adaptability to the dynamic nature of phishing attacks. This innovative approach not only streamlines the model development process but also fortifies the overall resilience of phishing detection systems.
Description:TECHNICAL FIELD OF INVENTION
The present invention is related to the field of cybersecurity. More specifically, it relates to machine learning-based detection systems for identifying phishing attacks. It addresses the intersection of data science and cybersecurity, aiming to fortify defences against deceptive online threats.
BACKGROUND OF THE INVENTION
The background information herein below relates to the present disclosure but is not necessarily prior art.
Phishing attacks have emerged as a formidable threat in the digital landscape, leading to substantial financial losses and compromising sensitive information. Traditional detection methods, such as blacklisting known malicious URLs, heuristic-based analyses, and manual feature engineering, have proven inadequate against the evolving sophistication of phishing tactics. Blacklisting struggles with zero-day attacks due to its reactive nature, while heuristic methods are labor-intensive and often lack scalability. Machine learning approaches, though promising, frequently grapple with challenges related to feature selection, resulting in models that are computationally expensive and prone to high false positive rates. These limitations underscore the pressing need for an innovative solution that enhances detection accuracy, reduces false positives, and operates efficiently in real-time environments.
OBJECTIVE OF THE INVENTION
The primary objective of the present invention is to provide an algorithm for enhanced phishing detection through integrated feature engineering and machine learning optimization.
Another objective of the invention is to introduce a novel feature selection algorithm that systematically identifies and optimizes the most pertinent features for phishing detection. By integrating both serial and parallel feature selection strategies, the invention aims to:
Yet another objective of the invention is to improve the precision of phishing detection systems by focusing on the most informative features.
Yet another objective of the invention is to minimize the incidence of legitimate activities being incorrectly flagged as malicious.
Further objective of the invention is to streamline the model training process, enabling real-time application without substantial resource expenditure.
SUMMARY OF THE INVENTION
Accordingly the following invention provides an algorithm for enhanced phishing detection through integrated feature engineering and machine learning optimization. In addressing the deficiencies of existing phishing detection methodologies, this invention introduces an optimized feature selection algorithm that combines Lasso Regression and Decision Tree classifiers in both serial and parallel configurations. By systematically identifying and prioritizing the most relevant features, the algorithm enhances detection accuracy, reduces false positives, and improves computational efficiency.
The parallel processing aspect allows for the simultaneous evaluation of multiple feature subsets, ensuring adaptability to the dynamic nature of phishing attacks. This innovative approach not only streamlines the model development process but also fortifies the overall resilience of phishing detection systems.
BRIEF DESCRIPTION OF DRAWING
This invention is described by way of example with reference to the following drawings where,
Figure 1 of sheet 1 illustrated the flowchart depicting the sequential application of lasso regression followed by decision tree classification for feature selection.
Figure 2 of sheet 2 illustrated the parallel processing framework utilizing multiple Decision Trees to evaluate diverse feature subsets concurrently.
Figure 3 of sheet 3 illustrated the system architecture showcasing the integration of the feature selection algorithm within an existing phishing detection framework.
Figure 4 of sheet 4 shows the performance comparison graphs of detection accuracy and false positive rates.
DETAILED DESCRIPTION OF THE INVENTION
As used in the description herein and throughout the claims that follow, the meaning of “a,” “an,” and “the” includes plural reference unless the context clearly dictates otherwise. Also, as used in the description herein, the meaning of “in” includes “in” and “on” unless the context clearly dictates otherwise.
The present invention is related to an algorithm for enhanced phishing detection through integrated feature engineering and machine learning optimization. The invention presents a feature selection algorithm that employs a two-pronged approach:
Lasso Regression Implementation: Utilizes Lasso (Least Absolute Shrinkage and Selection Operator) Regression to eliminate irrelevant and redundant features, thereby simplifying the model and enhancing interpretability.
Decision Tree Classifier Refinement: Applies Decision Tree classifiers for further feature refinement, capitalizing on their hierarchical structure to assess feature importance and interactions.
The algorithm incorporates a parallel processing mechanism using multiple Decision Trees. This parallelization explores diverse feature subsets simultaneously, bolstering the model's robustness and adaptability to various phishing tactics.
The invention's core lies in its innovative feature selection algorithm, designed to enhance the efficacy of phishing detection systems. The algorithm operates through the following stages:
Data Collection: Aggregates a comprehensive dataset comprising phishing and legitimate URLs from reputable sources such as Open Phish and Phish Tank. These datasets encompass over 25 million records, providing a robust foundation for model training and evaluation.
Feature Extraction: Derives an extensive array of features from the collected URLs, including but not limited to URL length, presence of subdomains, use of HTTPS, domain age, and page rank. These features encapsulate various characteristics that may indicate phishing attempts.
Lasso Regression Application: Implements Lasso Regression to perform feature selection by imposing a constraint on the sum of the absolute values of the model parameters. This constraint effectively shrinks some coefficients to zero, facilitating the elimination of non-contributory features and resulting in a more parsimonious model.
Decision Tree Classifier Refinement: Employs Decision Tree classifiers to further refine the feature set. Decision Trees assess feature importance by evaluating the reduction in impurity they provide, enabling the identification of features that most significantly influence the model's predictive capability.
Parallel Processing Implementation: Integrates a parallel processing framework wherein multiple Decision Trees operate concurrently on different feature subsets. This approach accelerates the feature selection process and enhances the model's adaptability to various phishing strategies by exploring a diverse combination of features simultaneously.
Model Training and Evaluation: Trains the optimized model using the refined feature set and evaluates its performance based on metrics such as detection accuracy, false positive rate.
Flowchart shown in figure 1 depicting the sequential application of Lasso Regression followed by Decision Tree classification for feature selection.
This flowchart illustrates the step-by-step process of applying Lasso Regression followed by Decision Tree classification for feature selection:
1. Data Collection: Aggregate phishing and legitimate URLs from sources such as Open Phish and Phish Tank.
2. Feature Extraction: Extract relevant features from the collected URLs.
3. Lasso Regression Application: Apply Lasso Regression to eliminate irrelevant features by shrinking some coefficients to zero.
4. Decision Tree Classification: Use Decision Tree classifiers to further refine the feature set by assessing feature importance.
5. Model Training and Evaluation: Train the model with the optimized feature set and evaluate its performance.
The parallel processing framework shown in figure 2 utilizing multiple Decision Trees to evaluate diverse feature subsets concurrently.
This diagram depicts the parallel processing framework that utilizes multiple Decision Trees to evaluate diverse feature subsets concurrently:
1. Feature Subset Generation: Divide the feature set into multiple subsets.
2. Parallel Decision Trees: Assign each feature subset to a separate Decision Tree classifier running in parallel.
3. Aggregation of Results: Combine the outputs of all Decision Trees to form a comprehensive feature selection model.
System architecture shown in figure 3 showcasing the integration of the feature selection algorithm within an existing phishing detection framework.
This system architecture diagram showcases how the proposed feature selection algorithm integrates into an existing phishing detection framework:
1. Input Layer: Receive incoming URLs for analysis.
2. Feature Extraction Module: Extract features from the URLs.
3. Feature Selection Module: Apply the proposed algorithm to select the most relevant features.
4. Classification Module: Use the refined feature set to classify URLs as phishing or legitimate.
5. Output Layer: Provide the classification result to the user.
Performance comparison graphs shown in figure 4 for detection accuracy and false positive rates
These graphs compare the performance metrics-detection accuracy and false positive rates-before and after implementing the proposed feature selection algorithm:
● Detection Accuracy: Illustrates the improvement in correctly identifying phishing URLs post-implementation.
● False Positive Rates: Shows the reduction in legitimate URLs being incorrectly flagged as phishing.
While various embodiments of the present disclosure have been illustrated and described herein, it will be clear that the disclosure is not limited to these embodiments only. Numerous modifications, changes, variations, substitutions, and equivalents will be apparent to those skilled in the art, without departing from the spirit and scope of the disclosure, as described in the claims.
, Claims:An algorithm for enhanced phishing detection through integrated feature engineering and machine learning optimization, comprising of:
| # | Name | Date |
|---|---|---|
| 1 | 202521044466-REQUEST FOR EARLY PUBLICATION(FORM-9) [07-05-2025(online)].pdf | 2025-05-07 |
| 2 | 202521044466-FORM-9 [07-05-2025(online)].pdf | 2025-05-07 |
| 3 | 202521044466-FORM 1 [07-05-2025(online)].pdf | 2025-05-07 |
| 4 | 202521044466-DRAWINGS [07-05-2025(online)].pdf | 2025-05-07 |
| 5 | 202521044466-COMPLETE SPECIFICATION [07-05-2025(online)].pdf | 2025-05-07 |
| 6 | Abstract.jpg | 2025-05-24 |