Abstract: The present invention relates to a system for phishing detection. The invention integrates Lasso Regression for initial feature elimination, removing irrelevant and redundant data points, followed by Decision Tree classifiers for refined feature importance assessment. Additionally, a parallel processing framework is employed wherein multiple Decision Trees concurrently evaluate diverse feature subsets, improving adaptability to evolving phishing tactics. The system comprises interconnected modules for data collection, feature extraction, Lasso-based pruning, Decision Tree refinement, parallel evaluation, and model training. By systematically prioritizing the most informative features, the invention improves detection accuracy, reduces false positives, and optimizes computational efficiency, enabling real-time phishing detection. This dual-stage selection methodology ensures robust, scalable, and interpretable model development for secure web environments. The proposed invention significantly advances the effectiveness and responsiveness of machine learning-based phishing detection frameworks.
Description:TECHNICAL FIELD OF INVENTION
The present invention is related to the field of computer science and engineering. More specifically, it relates to a system for phishing detection.
BACKGROUND OF THE INVENTION
The background information herein below relates to the present disclosure but is not necessarily prior art.
Phishing attacks have emerged as a formidable threat in the digital landscape, leading to substantial financial losses and compromising sensitive information. Traditional detection methods, such as blacklisting known malicious URLs, heuristic-based analyses, and manual feature engineering, have proven inadequate against the evolving sophistication of phishing tactics. Blacklisting struggles with zero-day attacks due to its reactive nature, while heuristic methods are labor-intensive and often lack scalability. Machine learning approaches, though promising, frequently grapple with challenges related to feature selection, resulting in models that are computationally expensive and prone to high false positive rates. These limitations underscore the pressing need for an innovative solution that enhances detection accuracy, reduces false positives, and operates efficiently in real-time environments.
CN114095278B relates to a method for detecting phishing websites based on a hybrid feature selection framework, which adopts a new design strategy, determines the optimal feature cut-off position and generates target features based on the preset features of each primary selection type and the model prediction time index and accuracy rate index. Finally, the target feature group is sent to the decision tree classifier for model adjustment and model training, and the phishing webpage detection model is obtained, which is used in the phishing website detection system; the whole scheme starts from the hybrid feature selection framework, which improves the feature selection. Stability, breaking the problem of unbalanced accuracy and system detection rate caused by the previous manual setting threshold method, thereby improving the detection efficiency and accuracy of phishing websites, and effectively improving the overall protection capability of the network.
US20160344770A1 related to a comprehensive scheme to detect phishing emails using features that are invariant and fundamentally characterize phishing. Multiple embodiments are described herein based on combinations of text analysis, header analysis, and link analysis, and these embodiments operate between a user's mail transfer agent (MTA) and mail user agent (MUA). The inventive embodiment, PhishNet-NLP™, utilizes natural language techniques along with all information present in an email, namely the header, links, and text in the body. The inventive embodiment, PhishSnag™, uses information extracted from the embedded links in the email and the email headers to detect phishing. The inventive embodiment, Phish-Sem™ uses natural language processing and statistical analysis on the body of labeled phishing and non-phishing emails to design four variants of an email-body-text only classifier. The inventive scheme is designed to detect phishing at the email level.
OBJECTIVE OF THE INVENTION
The primary objective of the present invention is to provide a system for phishing detection.
Yet another objective of the invention is to improve the ability to accurately identify phishing attempts by selecting the most informative and relevant features.
Yet another objective of the invention is to reduce the time and resources required for model training and inference, enabling real-time application.
Yet another objective of the invention is to evaluate multiple feature subsets simultaneously, enhancing adaptability to various phishing strategies.
SUMMARY OF THE INVENTION
Accordingly the following invention provides a system for phishing detection. The invention integrates Lasso Regression for initial feature elimination, removing irrelevant and redundant data points, followed by Decision Tree classifiers for refined feature importance assessment. Additionally, a parallel processing framework is employed wherein multiple Decision Trees concurrently evaluate diverse feature subsets, improving adaptability to evolving phishing tactics. The system comprises interconnected modules for data collection, feature extraction, Lasso-based pruning, Decision Tree refinement, parallel evaluation, and model training.
By systematically prioritizing the most informative features, the invention improves detection accuracy, reduces false positives, and optimizes computational efficiency, enabling real-time phishing detection. This dual-stage selection methodology ensures robust, scalable, and interpretable model development for secure web environments. The proposed invention significantly advances the effectiveness and responsiveness of machine learning-based phishing detection frameworks.
BRIEF DESCRIPTION OF DRAWING
Figure 1 of Sheet 1 illustrates the block diagram of the present invention.
Figure 2 of Sheet 1 illustrates the flowchart of the present invention.
Figure 3 of Sheet 2 illustrates the graphs of the present invention.
DETAILED DESCRIPTION OF THE INVENTION
As used in the description herein and throughout the claims that follow, the meaning of “a,” “an,” and “the” includes plural reference unless the context
clearly dictates otherwise. Also, as used in the description herein, the meaning of “in” includes “in” and “on” unless the context clearly dictates otherwise.
The present invention is related to a system for phishing detection. The primary objective of this invention is to introduce a novel feature selection algorithm that systematically identifies and optimizes the most relevant features for phishing detection. By integrating both serial and parallel feature selection strategies, the invention enhances detection accuracy by focusing on the most informative attributes, reduces false positives by minimizing incorrect flagging of legitimate activities, and optimizes computational efficiency to support real-time application without significant resource consumption.
The invention introduces a feature selection algorithm employing a dual approach comprising Lasso Regression and Decision Tree classifiers. Lasso Regression is used to eliminate irrelevant and redundant features, simplifying the model and improving interpretability, while Decision Tree classifiers further refine features by leveraging their hierarchical structure to evaluate feature importance and interactions. Additionally, the algorithm integrates a parallel processing mechanism involving multiple Decision Trees to simultaneously explore diverse feature subsets, enhancing the model’s robustness and adaptability to varied phishing strategies.
In addressing the deficiencies of existing phishing detection methodologies, this invention introduces an optimized feature selection algorithm that combines Lasso Regression and Decision Tree classifiers in both serial and parallel configurations. By systematically identifying and prioritizing the most relevant features, the algorithm enhances detection accuracy, reduces false positives, and improves computational efficiency.
The parallel processing aspect allows for the simultaneous evaluation of multiple feature subsets, ensuring adaptability to the dynamic nature of phishing attacks. This innovative approach not only streamlines the model development process but also fortifies the overall resilience of phishing detection system. Figures 1 and 2 illustrate the sequential and parallel approaches employed in the proposed feature selection algorithm.
Figure 1 outlines the step-by-step process involving initial data collection from phishing and legitimate URL sources such as Open Phish and Phish Tank, followed by feature extraction, application of Lasso Regression to eliminate irrelevant features by shrinking coefficients to zero, Decision Tree classification to further refine the feature set based on feature importance, and finally, model training and evaluation using the optimized features.
Figure 2 presents a parallel processing framework in which the complete feature set is divided into multiple subsets, each evaluated independently by separate Decision Tree classifiers running concurrently. The results from these trees are then aggregated to build a comprehensive and robust feature selection model.
The system architecture diagram illustrates the integration of the proposed feature selection algorithm into an existing phishing detection framework, beginning with an input layer that receives incoming URLs for analysis. A feature extraction module processes these URLs to derive relevant attributes, followed by a feature selection module that applies the proposed algorithm to identify the most significant features. The refined feature set is then used by the classification module to categorize URLs as either phishing or legitimate, with the output layer presenting the final classification result to the user.
The invention centers on an innovative feature selection algorithm aimed at improving the performance of phishing detection systems through several key stages. It begins with data collection, aggregating a large dataset of phishing and legitimate URLs from reputable sources such as Open Phish and Phish Tank, encompassing over 25 million records. Feature extraction follows, deriving a broad set of attributes from the URLs, including URL length, subdomain presence, HTTPS usage, domain age, and page rank. Lasso Regression is then applied to eliminate non-contributory features by shrinking some coefficients to zero, creating a simplified model. This is further refined using Decision Tree classifiers that assess feature importance based on impurity reduction. A parallel processing framework enhances efficiency by running multiple Decision Trees on different feature subsets simultaneously, improving adaptability to diverse phishing tactics. Finally, the optimized model is trained and evaluated using metrics such as detection accuracy and false positive rate.
While various embodiments of the present disclosure have been illustrated and described herein, it will be clear that the disclosure is not limited to these embodiments only. Numerous modifications, changes, variations, substitutions, and equivalents will be apparent to those skilled in the art, without departing from the spirit and scope of the disclosure, as described in the claims.
, Claims:1. A system for phishing detection.
| # | Name | Date |
|---|---|---|
| 1 | 202521035836-REQUEST FOR EARLY PUBLICATION(FORM-9) [12-04-2025(online)].pdf | 2025-04-12 |
| 2 | 202521035836-FORM-9 [12-04-2025(online)].pdf | 2025-04-12 |
| 3 | 202521035836-FORM 1 [12-04-2025(online)].pdf | 2025-04-12 |
| 4 | 202521035836-DRAWINGS [12-04-2025(online)].pdf | 2025-04-12 |
| 5 | 202521035836-COMPLETE SPECIFICATION [12-04-2025(online)].pdf | 2025-04-12 |
| 6 | 202521035836-FORM-5 [21-04-2025(online)].pdf | 2025-04-21 |
| 7 | 202521035836-FORM 3 [21-04-2025(online)].pdf | 2025-04-21 |
| 8 | 202521035836-ENDORSEMENT BY INVENTORS [21-04-2025(online)].pdf | 2025-04-21 |