Abstract: ABSTRACT Method And System for Vulnerability Management A method for managing vulnerabilities in a system is provided. A first training dataset is created by a server. The first training dataset includes a first plurality of vulnerabilities labeled with a first label and a second plurality of vulnerabilities labeled with a second label. The first label is assigned to a first vulnerability type and the second label is assigned to a second vulnerability type. A machine learning model is trained in a first stage of training, using the first training dataset. The server mines, from the unlabeled dataset, using the trained machine learning model, a third plurality of vulnerabilities to be pseudo-labeled with the second label. A second training dataset is created using the first plurality of vulnerabilities and the pseudo-labeled third plurality of vulnerabilities. The machine learning model is re-trained, in a second stage of training, using the second training dataset. [FIG. 1]
Description:METHOD AND SYSTEM FOR VULNERABILITY MANAGEMENT
BACKGROUND
FIELD OF THE DISCLOSURE
Various embodiments of the disclosure relate generally to vulnerability management. More specifically, various embodiments of the disclosure relate to methods and systems for detection and classification of vulnerabilities.
DESCRIPTION OF THE RELATED ART
Advancements in technology have led to large-scale digitization of operations by businesses. Operations such as trading stocks on the stock market, credit card payments, tracking of shipments, or the like, are now conducted digitally, for example, by way of computers. Various mechanisms have been developed to safeguard digital operations such as electronic storage of data, electronic transfer of data, execution of electronic transactions, or the like. Companies and governments have developed compliance policies that enable pre-emptive detection and mitigation of vulnerabilities in computer systems or computer networks.
A vulnerability may be a weakness in hardware or software associated with a computer system or a computer network. A vulnerability, if exploited, may result in loss of confidentiality of data, loss of integrity of data, loss of availability of computing resources, or the like. However, given a scale of operations at medium-sized or large-sized entities (e.g., companies, organizations, governments, or the like), it may not be feasible to detect all vulnerabilities and deal with all such vulnerabilities immediately upon detection. Therefore, classification of vulnerabilities into various classes or types (e.g., serious and non-serious, severe and non-severe, or the like) may enable efficient allocation of resources to deal with these vulnerabilities. For example, vulnerabilities that are deemed to cause compliance issues or violation of digital policy may need to be dealt with before vulnerabilities that are deemed to be less-serious or minor. However, classification of vulnerabilities as severe or non-severe is context-specific, and differs, for example, from company to company, from domain to domain, or the like. For example, a vulnerability that is deemed serious by a company operating in a medical industry may not be deemed serious by a company operating in a retail industry. Conventional solutions for facilitating context-specific classification of vulnerabilities include manual labeling of vulnerabilities (e.g., by human experts) and training machine learning models based on the labeled vulnerabilities. However, manual labeling of vulnerabilities is a labour-intensive and time-intensive activity, and is, therefore, inefficient.
In light of the foregoing, there exists a need for a technical and reliable solution that overcomes the abovementioned problems and facilitates context-specific classification of vulnerabilities.
SUMMARY
Methods and systems for managing vulnerabilities are provided substantially as shown in, and described in connection with, at least one of the figures, as set forth more completely in the claims.
In an embodiment of the present disclosure, a method for managing vulnerabilities in a system is provided. The method includes creating a first training dataset. The first training dataset includes a first plurality of vulnerabilities, each correctly labeled with a first label that is assigned to a first vulnerability type. The first training dataset further includes a second plurality of vulnerabilities selected from an unlabeled dataset and labeled with a second label assigned to a second vulnerability type, irrespective of whether the second plurality of vulnerabilities correspond to the first vulnerability type or the second vulnerability type. A machine learning model is trained in a first stage of training, using the first training dataset. A third plurality of vulnerabilities to be pseudo-labeled with the second label are mined, from the unlabeled dataset, using the trained machine learning model. A second training dataset that includes the labeled first plurality of vulnerabilities and the pseudo-labeled third plurality of vulnerabilities is created. The trained machine learning model is re-trained in a second stage of training, using the second training dataset.
In another embodiment of the present disclosure, a system for managing vulnerabilities is provided. The system includes a server configured to create a first training dataset. The first training dataset includes a first plurality of vulnerabilities, each correctly labeled with a first label that is assigned to a first vulnerability type. The first training dataset further includes a second plurality of vulnerabilities selected from an unlabeled dataset and labeled with a second label assigned to a second vulnerability type, irrespective of whether the second plurality of vulnerabilities correspond to the first vulnerability type or the second vulnerability type. The server is further configured to train a machine learning model trained in a first stage of training, using the first training dataset. The server is further configured to mine, using the trained machine learning model, a third plurality of vulnerabilities, from the unlabeled dataset, to be pseudo-labeled with the second label. The server is further configured to create a second training dataset that includes the labeled first plurality of vulnerabilities and the pseudo-labeled third plurality of vulnerabilities. The server is further configured to re-train the trained machine learning model in a second stage of training, using the second training dataset.
BRIEF DESCRIPTION OF DRAWINGS
FIG. 1 is a block diagram that illustrates a system environment for managing vulnerabilities, in accordance with an exemplary embodiment of the present disclosure;
FIG. 2 represents a table that illustrates a labeled dataset, in accordance with an exemplary embodiment of the present disclosure;
FIG. 3 represents a table that illustrates an unlabeled dataset, in accordance with an exemplary embodiment of the present disclosure;
FIG. 4 represents a table that illustrates a first training dataset, in accordance with an exemplary embodiment of the present disclosure;
FIG. 5 is a block diagram that illustrates training of a first machine learning model, in accordance with an exemplary embodiment of the present disclosure;
FIG. 6 is a block diagram that illustrates a second stage of training of the trained first machine learning model, in accordance with an exemplary embodiment of the disclosure;
FIG. 7 represents a table that illustrates a second training dataset, in accordance with an exemplary embodiment of the present disclosure;
FIG. 8 is a process flow diagram that illustrates classification of vulnerabilities, in accordance with an exemplary embodiment of the present disclosure;
FIG. 9 is a block diagram that illustrates an application server of FIG. 1, in accordance with an exemplary embodiment of the present disclosure;
FIG. 10 is a block diagram that illustrates a system architecture of a computer system, in accordance with an embodiment of the present disclosure;
FIG. 11A-11C, collectively, represent a flowchart that illustrates a method for training the first machine learning model for vulnerability classification, in accordance with an embodiment of the present disclosure;
FIG. 12 is a flowchart that illustrates a method for changing a label of a vulnerability, in accordance with an embodiment of the present disclosure; and
FIG. 13 is a high-level flowchart that illustrates a method for managing vulnerabilities in a system, in accordance with an embodiment of the present disclosure.
DETAILED DESCRIPTION OF EMBODIMENTS
The present disclosure is best understood with reference to the detailed figures and description set forth herein. Various embodiments are discussed below with reference to the figures. However, those skilled in the art will readily appreciate that the detailed descriptions given herein with respect to the figures are simply for explanatory purposes as the methods and systems may extend beyond the described embodiments. In one example, the teachings presented and the needs of a particular application may yield multiple alternate and suitable approaches to implement the functionality of any detail described herein. Therefore, any approach may extend beyond the particular implementation choices in the following embodiments that are described and shown.
References to “an embodiment”, “another embodiment”, “yet another embodiment”, “one example”, “another example”, “yet another example”, “for example”, and so on, indicate that the embodiment(s) or example(s) so described may include a particular feature, structure, characteristic, property, element, or limitation, but that not every embodiment or example necessarily includes that particular feature, structure, characteristic, property, element or limitation. Furthermore, repeated use of the phrase “in an embodiment” does not necessarily refer to the same embodiment.
OVERVIEW
Various embodiments of the present disclosure provide a method and a system for managing vulnerabilities. The system includes a server (e.g., an application server) that creates a first training dataset. The first training dataset includes a first plurality of vulnerabilities, each correctly labeled with a first label that is assigned to a first vulnerability type. The first training dataset further includes a second plurality of vulnerabilities selected from an unlabeled dataset and labeled with a second label assigned to a second vulnerability type, irrespective of whether the second plurality of vulnerabilities correspond to the first vulnerability type or the second vulnerability type. The server trains a machine learning model in a first stage of training, using the first training dataset. Using the trained machine learning model, the server mines, from the unlabeled dataset, a third plurality of vulnerabilities to be pseudo-labeled with the second label. The server creates a second training dataset that includes the labeled first plurality of vulnerabilities and the pseudo-labeled third plurality of vulnerabilities. The server re-trains the trained machine learning model for vulnerability classification in a second stage of training using the second training dataset. The server may implement ‘n’ stages of training and iteratively re-train the trained machine learning model until a desired accuracy level (for example, an accuracy level greater than or equal to a threshold accuracy) is achieved.
Thus, the present disclosure utilizes the principles of semi-supervised learning to pseudo-label vulnerabilities included in the unlabeled dataset and train the machine learning model for vulnerability classification. Since the machine learning model is iteratively re-trained, an accuracy of the machine learning model may be constantly improved, leading to accurate classification of vulnerabilities. Since manual labeling of vulnerabilities is not required after reception of the labeled dataset having only one labeled class, there may be significant saving of time and effort in generating training datasets and training the machine learning model for vulnerability classification.
TERMS DESCRIPTION (in addition to plain and dictionary meaning)
A vulnerability is a weakness, a threat, or a flaw present in a system (e.g., a server, a user device, a communication network, or the like). Vulnerabilities may compromise a security (e.g., digital security) of the entity. In other words, a vulnerability may correspond to a security threat or flaw that if exploited may compromise the security of the entity. Vulnerabilities may be of various types such as, but not limited to, hardware vulnerabilities, software vulnerabilities, network vulnerabilities, or the like. Vulnerabilities may be of a first vulnerability type (e.g., severe) or a second vulnerability type (e.g., non-severe).
Vulnerability type of a vulnerability is indicative of severity/severity level of the vulnerability. Vulnerabilities that correspond to the first vulnerability type have a severity greater than or equal to a threshold severity level (e.g., severe). Vulnerabilities that correspond to the second vulnerability type have a severity less than the threshold severity level (e.g., non-severe).
Identifier of a vulnerability refers to a numeric or an alpha-numeric code that uniquely identifies the vulnerability. For example, the identifier of the vulnerability may be a common vulnerability and exposure (CVE) identifier that uniquely identifies the vulnerability.
Description of a vulnerability refers to a natural language description (e.g., English description) of the vulnerability. For example, the description of the vulnerability may be indicative of a cause and an effect of the vulnerability, a set of risks associated with the vulnerability, or the like.
Label refers to a numeric or an alpha-numeric code associated with a vulnerability for qualifying the vulnerability as the first vulnerability type or the second vulnerability type. The first label is assigned to the first vulnerability type. The second label is assigned to the second vulnerability type. When a vulnerability is labeled with the first label, it is assumed that the vulnerability is of the first vulnerability type. When a vulnerability is labeled with the second label, it is assumed that the vulnerability is of the second vulnerability type.
Unlabeled dataset refers to a dataset that is indicative of a plurality of vulnerabilities but excludes a label for the plurality of vulnerabilities. For example, the plurality of vulnerabilities included in the unlabeled dataset may not be associated with any label that indicates whether each of the plurality of vulnerabilities is a severe vulnerability or a non-severe vulnerability.
Machine learning model refers to a statistical model trained using semi-supervised learning techniques for vulnerability classification. The machine learning model, when trained, classifies vulnerabilities into the first vulnerability type or the second vulnerability type. The machine learning model further generates vulnerability scores for vulnerabilities and classifies the vulnerabilities as the first vulnerability type or the second vulnerability type, based on the vulnerability scores. Examples of the machine learning model may include convolution neural network (CNN), a deep neural network (DNN), a recurrent neural network (RNN) such as Long Short Term Memory networks (LSTM) networks, or an artificial neural network that may be a combination of the RNN and CNN networks.
A training dataset is a collection or aggregation of data used to train a machine learning model for vulnerability classification. The training dataset includes identifiers of vulnerabilities, descriptions of the vulnerabilities, and labels assigned to the vulnerabilities. The machine learning model learns correlations between the identifiers, the descriptions, and labels of the vulnerabilities based on the training dataset that is provided as input to the machine learning model.
Vulnerability score refers to an output of a trained machine learning model for a vulnerability. The vulnerability score is indicative of a likelihood of the vulnerability being the first vulnerability type or the second vulnerability type. For example, the vulnerability score may be expressed as probability (e.g., a percentage) that indicates whether the vulnerability is the first vulnerability type or the second vulnerability type. In another example, the vulnerability score may be expressed as a numeric value between a first vulnerability score limit and a second vulnerability score limit.
Pseudo-labeling refers to labeling of vulnerabilities (e.g., the first label or the second label), based on vulnerability scores determined by a trained machine learning model for the vulnerabilities. For example, a first vulnerability may be pseudo-labeled with the second label if a first vulnerability score for the first vulnerability is less than or equal to a first threshold vulnerability score. Similarly, a second vulnerability may be pseudo-labeled with the first label if a second vulnerability score for the second vulnerability is greater than or equal to a second threshold vulnerability score.
FIG. 1 is a block diagram that illustrates a system environment 100 for managing vulnerabilities, in accordance with an exemplary embodiment of the present disclosure. The system environment 100 includes an application server 102, a database server 104, and a plurality of user devices 106. The plurality of user devices 106 may include first through nth user devices 106a-106n. The application server 102, the database server 104, and the plurality of user devices 106 may interact with each other via a communication network 108.
The application server 102 may include suitable logic, circuitry, interfaces, and/or code, executable by the circuitry that may be configured to perform one or more operations for detection and/or management of vulnerabilities. Examples of the one or more operations include, but are not limited to, detection of vulnerabilities, classification of vulnerabilities, generation of patches and/or fixes for vulnerabilities, or the like. In a non-limiting example, the application server 102 may be configured to detect and classify vulnerabilities that may be present in devices or systems associated with an entity. Examples of the entity may include, but are not limited to, an information technology (IT) company, a financial institution (such as an issuer, an acquirer, and a payment network), a payments company, a digital security solutions provider, a governmental organization, a non-governmental organization, or the like.
The term “vulnerability” refers to a weakness, a flaw, a threat, or a bug present in any system (e.g., the application server 102, the plurality of user devices 106, the communication network 108, or the like) associated directly or indirectly with the entity. The term “vulnerability” may also refer to existence, within any system associated with the entity, of a flaw or an issue that contravenes a compliance policy, for example, digital security policy, of the entity. In other words, a vulnerability may be any weakness, flaw, or bug that may compromise a security of the entity. Vulnerabilities may be of various types such as, but not limited to, hardware vulnerabilities, software vulnerabilities, network vulnerabilities, or the like. Examples of hardware vulnerabilities include, but are not limited to, flaws in hardware devices used for encryption, weaknesses in firmware, or the like. Examples of software vulnerabilities include, but are not limited to, standard query language (SQL) injection, privilege-confusion bugs, or the like. Examples of network vulnerabilities include, but are not limited to, lack of protection/encryption of communication lines, poor authentication processes, or the like. Vulnerabilities and types of vulnerabilities are well known to those of skill in the art. Hence, further details regarding the vulnerabilities and the types of vulnerabilities are omitted for the sake of brevity.
In one embodiment, the application server 102 may be configured to train a machine learning model (e.g., a first machine learning model) to classify vulnerabilities into various types, for example, a first vulnerability type, a second vulnerability type, or the like. In an example, the first machine learning model may be trained to classify vulnerabilities as severe (e.g., the first type of vulnerability) or non-severe (e.g., the second type of vulnerability) based on a determined severity level of each of the vulnerabilities. In one embodiment, the severity level of a vulnerability may refer to a potential degree of damage that may be inflicted on the application server 102 and/or the plurality of user devices 106, when the vulnerability is exploited. In other words, the severity level of the vulnerability may refer to an extent or degree to which the vulnerability contravenes a digital policy (e.g., compliance policy) of the entity that manages the application server 102. For the sake of brevity, the terms “first type of vulnerability” and “second type of vulnerability” are interchangeably referred to as “first vulnerability type” and “second vulnerability type” throughout the disclosure. A vulnerability may be classified as the first vulnerability type (e.g., severe) if a severity level of the vulnerability is greater than or equal to a threshold severity level. Similarly, the vulnerability of may be classified as the second vulnerability type (e.g., non-severe), if the severity level of the vulnerability is less than the threshold severity level. For the sake of brevity, the terms “severe” and “serious” are used interchangeably throughout the disclosure. Similarly, the terms “non-severe” and “non-serious” are also used interchangeably throughout the disclosure.
The application server 102 may store, therein, a first dataset indicative of a first plurality of vulnerabilities. In other words, the first dataset includes information pertaining to the first plurality of vulnerabilities, for example, information associated with the first plurality of vulnerabilities or details of the first plurality of vulnerabilities. The first dataset may include a first plurality of descriptions of the first plurality of vulnerabilities and a first plurality of identifiers of the first plurality for vulnerabilities. Each of the first plurality of descriptions may include a natural language description (e.g., description in English) of a corresponding vulnerability.
For example, each of the first plurality of descriptions may include a definition of the corresponding vulnerability, information associated with a cause and effect of the corresponding vulnerability, details of the corresponding vulnerability, or the like. Each of the first plurality of identifiers may include a numeric code or an alpha numeric code that uniquely identifies a corresponding vulnerability of the first plurality of vulnerabilities. The first dataset is a labeled dataset that includes a label for each of the first plurality of vulnerabilities. The label for each of the first plurality of vulnerabilities indicates whether a corresponding vulnerability is of the first vulnerability type and the second vulnerability type. In a current embodiment, it is assumed that the first plurality of vulnerabilities are labeled with a single label, for example, a first label that is assigned to the first vulnerability type. It is further assumed that the first plurality of vulnerabilities are correctly labeled with the first label and that each of the first plurality of vulnerabilities corresponds to (e.g., belongs to or is of) the first vulnerability type. In other words, the first dataset (e.g., the labeled dataset) includes the first plurality of descriptions, the first plurality of identifiers, and the first label for the first plurality of vulnerabilities that correspond to the first vulnerability type. For the sake of brevity, "the first dataset” is interchangeably referred to as “the labeled dataset” throughout the disclosure.
The application server 102 trains/re-trains the first machine learning model for vulnerability classification, using techniques of semi-supervised learning. Operations of the application server 102 are explained in conjunction with an “in operation” section of FIG. 1, FIG. 5, and FIG. 6.
Examples of the application server 102 may include, but are not limited to, a personal computer, a laptop, a mini-computer, a mainframe computer, a cloud-based server, a network of computer systems, or a non-transient and tangible machine executing a machine-readable code.
The database server 104 may include suitable logic, circuitry, interfaces, and/or code, executable by the circuitry that may be configured to store a database and perform one or more database operations. The database, stored in the database server 104, may include a second dataset that is indicative of (e.g., includes) a second plurality of vulnerabilities. The second dataset includes information pertaining to the second plurality of vulnerabilities. For example, the second dataset may include a second plurality of identifiers and/or a second plurality of descriptions for the second plurality of vulnerabilities. The second plurality of identifiers includes an identifier of each of the second plurality of vulnerabilities. The second plurality of descriptions includes a description for each of the second plurality of vulnerabilities. The second dataset may be an unlabeled dataset that excludes any label for the second plurality of vulnerabilities. In other words, the second plurality of vulnerabilities included in the second dataset may not be labeled. The second plurality of identifiers and the second plurality of descriptions may be similar to the first plurality of identifiers and the first plurality of descriptions, respectively.
The second plurality of vulnerabilities may include vulnerabilities that correspond to the first vulnerability type and vulnerabilities that correspond to the second vulnerability type. For the sake of brevity, the terms “second dataset” and “unlabeled dataset” are used interchangeably throughout the disclosure.
In a non-limiting example, the unlabeled dataset may be a public database such as, but not limited to, national vulnerability database (NVD), open-source vulnerability database (OSVDB), or the like. Similarly, in a non-limiting example, the first and second pluralities of identifiers may correspond to common vulnerabilities and exposure (CVE) identifiers known to those of ordinary skill in the art. Similarly, in a non-limiting example, each of the first and second pluralities of descriptions may correspond to a textual description of a corresponding CVE. The second plurality of vulnerabilities, indicated by the unlabeled dataset, may be an aggregation of vulnerabilities known to public. The unlabeled dataset may be accessible to the application server 102, the plurality of user devices 106, or the like.
Examples of the database operations may include, but are not limited to, storing the unlabeled dataset, processing the unlabeled dataset, updating the unlabeled dataset based on reception of information associated with the second plurality of vulnerabilities or new vulnerabilities (e.g., new CVE identifiers), or the like. Examples of the database server 104 may include, but are not limited to, a personal computer, a laptop, a mini-computer, a mainframe computer, a cloud-based server, a network of computer systems, or the like.
The plurality of user devices 106 may include suitable logic, circuity, interface, code, and executable by the circuitry that may be configured to perform various operations. The plurality of user devices 106 may be associated with a plurality of users. Examples of the plurality of users may include, but are not limited to, employees of the entity that manages the application server 102, customers of the entity, or the like. Examples of the plurality of user devices 106 may include, but are not limited to, personal computers, smartphones, tablets, phablets, smart watches, automated teller machines (ATMs), point-of-sale (PoS) devices, or the like.
In operation, the labeled dataset may be received by the application server 102 from the plurality of user devices 106. For example, the labeled dataset may be received from the first user device 106a of a first user who is an employee of the entity (e.g., the financial institution) and is responsible for digital compliance at the entity. In other embodiment, the application server 102 may generate the labeled dataset based on historical data. The historical data may include vulnerabilities that have previously affected the application server 102 or the plurality of user devices 106, and were deemed to have been of the first vulnerability type (e.g., severe). The application server 102 may store, in a memory thereof, the received labeled dataset. As described in the foregoing, the labeled dataset is indicative of the first plurality of vulnerabilities and includes the information pertaining to the first plurality of vulnerabilities that are correctly labeled with the first label assigned to the first vulnerability type.
The application server 102 may retrieve the unlabeled dataset from the database server 104. The application server 102 may then select or sample a plurality of vulnerabilities at random, from the unlabeled dataset. For the sake of brevity, the sampled plurality of vulnerabilities is interchangeably referred to as “third plurality of vulnerabilities”. The application server 102 may further retrieve, from the database server 104, the information pertaining to the third plurality of vulnerabilities that are selected at random from the unlabeled dataset. For example, the application server 102 may retrieve, from the unlabeled dataset stored in the database server 104, identifiers and/or descriptions of the third plurality of vulnerabilities. For the sake of brevity, the identifiers of the third plurality of vulnerabilities are referred to as “third plurality of identifiers” and the descriptions of the third plurality of vulnerabilities are referred to as “third plurality of descriptions”.
The application server 102 may create (e.g., generate) a first training dataset for training the first machine learning model. The first training dataset is indicative of (e.g., includes) the first plurality of vulnerabilities and the third plurality of vulnerabilities. The first training dataset may include the first plurality of identifiers of the first plurality of vulnerabilities, the first plurality of descriptions of the first plurality of vulnerabilities, the first label of the first plurality of vulnerabilities, and the information pertaining to the third plurality of vulnerabilities. The information pertaining to the third plurality of vulnerabilities may include, but is not limited to, the third plurality of identifiers and the third plurality of descriptions. Each of the third plurality of vulnerabilities may be one of the first vulnerability type (e.g., severe) or the second vulnerability type (e.g., non-severe) that is different from the first vulnerability type. A second label, different from the first label, may be assigned to the second vulnerability type. The application server 102 may label the third plurality of vulnerabilities, included in the first training dataset, with the second label irrespective of whether each of the third plurality of vulnerabilities corresponds to the first vulnerability type or the second vulnerability type. In other words, the third plurality of vulnerabilities may be randomly labeled with the second label assigned to the second vulnerability type irrespective of whether a corresponding vulnerability is of the first vulnerability type or the second vulnerability type. Therefore, the first training dataset includes the first label for the first plurality of vulnerabilities and the second label for the third plurality of vulnerabilities.
In a non-limiting example, it is assumed that the first label is “1” and the second label is “0”. Therefore, the first plurality of vulnerabilities in the first training dataset are labeled with the first label “1” and the third plurality of vulnerabilities in the first training dataset are labeled with the second label “0”. The application server 102 may train the first machine learning model in a first stage of training, using the first training dataset. The first machine learning model is trained to classify vulnerabilities as the first type of vulnerability or the second type of vulnerability. For the classification of a vulnerability, the first machine learning model is further trained to generate a prediction score, a confidence score, or a vulnerability score. In other words, there is a confidence score, a vulnerability score, or a prediction score associated with each classification performed by the trained first machine learning model. For the sake of brevity, the terms “prediction score”, “confidence score”, “probability score”, and “vulnerability score” are used interchangeably throughout the disclosure. A vulnerability score for a vulnerability is indicative of a probability (e.g., likelihood) of the vulnerability being the first vulnerability type or the second vulnerability type. In a non-limiting example, a higher vulnerability score for the classification corresponds to a higher likelihood of the vulnerability being the first vulnerability type. Similarly, a lower vulnerability score for the vulnerability corresponds to a higher likelihood of the vulnerability being the second vulnerability type. In other words, a lower vulnerability score corresponds to a lower likelihood of the vulnerability being the first vulnerability type.
For example, the trained first machine learning model may classify a first vulnerability as the first vulnerability type with a first vulnerability score of “0.75”. Similarly, the trained first machine learning model may classify a second vulnerability as the first vulnerability type with a second vulnerability score of “0.95”. The first vulnerability score (“0.75”) and the second vulnerability score (“0.95”) indicate that a confidence of the trained first machine learning model in classifying the second vulnerability as the first vulnerability type is greater than a confidence of the trained first machine learning model in classifying the first vulnerability as the first vulnerability type. In other words, the first and second vulnerability scores indicate that a likelihood of the second vulnerability being (e.g., belonging to) the first vulnerability type is greater than a likelihood of the first vulnerability being (e.g., belonging to) the first vulnerability type.
The application server 102 mines, from the unlabeled dataset, using the trained first machine learning model, a plurality of vulnerabilities that are to be pseudo-labeled with the second label. The mined plurality of vulnerabilities are interchangeably referred to as “fourth plurality of vulnerabilities”. The fourth plurality of vulnerabilities may refer to vulnerabilities, of the second plurality of vulnerabilities indicated by (e.g., included in) the unlabeled dataset, that are classified by the trained first machine learning model as the second vulnerability type with a vulnerability score less than a threshold vulnerability score (e.g., “0.40”). In other words, the vulnerability score associated with the classification of each of the fourth plurality of vulnerabilities is less than the threshold vulnerability score. Vulnerabilities, of the second plurality of vulnerabilities, that are classified by the trained first machine learning model with a vulnerability score greater than or equal to the threshold vulnerability score are not to be pseudo-labeled with the second label.
Consequently, the application server 102 creates a second training dataset that includes the information pertaining to the first plurality of vulnerabilities and information pertaining to the fourth plurality of vulnerabilities. The information pertaining to the fourth plurality of vulnerabilities may include, but is not limited to, an identifier of each of the fourth plurality of vulnerabilities, a description of each of the fourth plurality of vulnerabilities, or the like. The second training dataset includes the first label for the first plurality of vulnerabilities and the second label (e.g., pseudo-label) for the fourth plurality of vulnerabilities. The pseudo-label (e.g., the second label) for the fourth plurality of vulnerabilities is treated like a “true label” for the fourth plurality of vulnerabilities.
The application server 102 re-trains the trained first machine learning model for vulnerability classification (e.g., classification of vulnerabilities), in a second stage of training, using the second training dataset. An accuracy of the re-trained first machine learning model for vulnerability classification (as the first vulnerability type or the second vulnerability type) is greater than an accuracy of the trained first machine learning model (trained in the first stage of training) in classification of vulnerabilities. Operations of mining, creation of new training datasets, and re-training of the trained first machine learning model may be repeated any number (e.g., “n”) of times to improve an accuracy of the first machine learning model for vulnerability classification.
FIG. 2 represents a table 200 that illustrates the labeled dataset, in accordance with an exemplary embodiment of the present disclosure.
The table 200 includes first through fourth columns 202a-202d and first through third rows 204a-204c. The first column 202a is indicative of the first plurality of vulnerabilities. The second column 202b is indicative of the first plurality of descriptions. The third column 202c is indicative of the first plurality of identifiers. The fourth column 202d is indicative of a label (e.g., the first label) of (e.g., associated with or assigned to) each of the first plurality of vulnerabilities.
The first row 204a includes a first description (“Description_1”) of a first vulnerability (“Vulnerability_1”) of the first plurality of vulnerabilities, a first identifier ("ID_1”) of the first vulnerability, and a label (e.g., the first label; “1”) of the first vulnerability. The first label (“1”) indicates that the first vulnerability corresponds to the first vulnerability type.
The second row 204b includes a second description (“Description_2”) of a second vulnerability (“Vulnerability_2”) of the first plurality of vulnerabilities, a second identifier ("ID_2”) of the second vulnerability, and the first label (“1”) that is assigned to the second vulnerability.
The third row 204c includes a third description (“Description_3”) of a third vulnerability (“Vulnerability_3”) of the first plurality of vulnerabilities, a third identifier ("ID_3”) of the third vulnerability, and the first label (“1”) that is assigned to the third vulnerability.
For the sake of brevity, the first plurality of vulnerabilities is shown to include only three vulnerabilities (e.g., the first through third vulnerabilities). However, in an actual implementation, the first plurality of vulnerabilities may include additional vulnerabilities (e.g., tens of vulnerabilities, hundreds of vulnerabilities, thousands of vulnerabilities, or the like) without deviating from the scope of the disclosure. Further, the first label and the second label are shown to be numeric values (e.g., “1” and “0”, respectively). However, the first and second labels may be represented by any alphabet, numeric value, or alphanumeric value without deviating from the scope of the disclosure.
In the current embodiment, the labeled dataset (e.g., the table 200) is shown to include the first plurality of identifiers and the first plurality of descriptions of the first plurality of vulnerabilities. However, in another embodiment, the labeled dataset may include only the first plurality of descriptions and the first label for the first plurality of vulnerabilities. In other words, the labeled dataset may not include the first plurality of identifiers. In another embodiment, the labeled dataset may include only the first plurality of identifiers and the first label for the first plurality of vulnerabilities. In such a scenario, the labeled dataset may not include the first plurality of descriptions.
Further, it is assumed that the received labeled dataset includes the first label for the first plurality of vulnerabilities. However, in another embodiment, the received labeled dataset may not include any label for the first plurality of vulnerabilities. In such a scenario, the received labeled dataset may be updated to include the first label for the first plurality of vulnerabilities.
FIG. 3 represents a table 300 that illustrates the unlabeled dataset, in accordance with an exemplary embodiment of the present disclosure. The table 300 includes the information pertaining to the second plurality of vulnerabilities.
The table 300 includes first through third columns 302a-302c and first through nth rows 304a-304n. The first column 302a is indicative of the second plurality of vulnerabilities. The second column 302b includes a description of each of the second plurality of vulnerabilities. In other words, the second column 302b is indicative of the second plurality of descriptions of the second plurality of vulnerabilities. The third column 302c includes an identifier of each of the second plurality of vulnerabilities. In other words, the third column 302c is indicative of the second plurality of identifiers of the second plurality of vulnerabilities. As shown in FIG. 3, the table 300 (e.g., the unlabeled dataset) excludes (e.g., does not include) any label for the second plurality of vulnerabilities.
In a non-limiting example, it is assumed that the second plurality of vulnerabilities also includes the first plurality of vulnerabilities. In other words, the first plurality of vulnerabilities is a subset of the first plurality of vulnerabilities. Therefore, the second plurality of descriptions and the second plurality of identifiers include the first plurality of descriptions and the first plurality of identifiers, respectively.
The first row 304a includes the first description (“Description_1”) and the first identifier (“ID_1”) of the first vulnerability (“Vulnerability_1”). The second row 304b includes the second description (“Description_2”) and the second identifier (“ID_2”) of the second vulnerability (“Vulnerability_2”). The third row 304c includes the third description (“Description_3”) and the second identifier (“ID_3”) of the third vulnerability (“Vulnerability_3”).
The fourth row 304d includes a fourth description (“Description_4”) and a fourth identifier (“ID_4”) of a fourth vulnerability (“Vulnerability_4”) of the second plurality of vulnerabilities.
The fifth row 304e includes a fifth description (“Description_5”) and a fifth identifier (“ID_5”) of a fifth vulnerability (“Vulnerability_5”) of the second plurality of vulnerabilities.
The sixth row 304f includes a sixth description (“Description_6”) and a sixth identifier (“ID_6”) of a sixth vulnerability (“Vulnerability_6”) of the second plurality of vulnerabilities.
Similarly, each of the eighth through nth rows 304g-304n may include a description and an identifier of a corresponding vulnerability of the second plurality of vulnerabilities (e.g., eighth through nth vulnerabilities). It will be apparent to those of skill in the art that the table 300, shown in FIG. 3, is merely exemplary. In an actual implementation, the table 300 (e.g., the unlabeled dataset) may include any number of rows (e.g., hundreds of rows, thousands of rows, or the like) that include information pertaining to any number of vulnerabilities (e.g., hundreds of vulnerabilities, thousands of vulnerabilities, or the like) without deviating from the scope of the disclosure. In other words, the second plurality of vulnerabilities may include any number of vulnerabilities (e.g., hundreds of vulnerabilities, thousands of vulnerabilities, or the like).
In the current embodiment, it is assumed that the second plurality of vulnerabilities include the first plurality of vulnerabilities. However, in another embodiment, the second plurality of vulnerabilities may be completely different from the labeled dataset and may not include the first plurality of vulnerabilities.
The application server 102 may select (e.g., sample) the third plurality of vulnerabilities from the unlabeled dataset. The third plurality of vulnerabilities may be selected at random (e.g., sampled at random) from the second plurality of vulnerabilities. In a non-limiting example, it is assumed that the third plurality of vulnerabilities include the fourth through sixth vulnerabilities of the second plurality of vulnerabilities. The selected third plurality of vulnerabilities may exclude the first plurality of vulnerabilities (e.g., the first through third vulnerabilities).
Prior to the selection of the third plurality of vulnerabilities, the application server 102 may determine (e.g., identify) vulnerabilities (e.g., the first through third vulnerabilities) included in the first plurality of vulnerabilities. The application server 102 may further determine (e.g., identify) the information pertaining to the first plurality of vulnerabilities. Consequently, the application server 102 may identify, from the unlabeled dataset (e.g., the table 300), one or more vulnerabilities that correspond to the first plurality of vulnerabilities. In other words, the application server 102 may parse the unlabeled dataset (e.g., the table 300) to determine (e.g., identify) one or more vulnerabilities that have a description and/or an identifier that correspond to at least one of the first plurality of identifiers and/or the first plurality of descriptions. In the current embodiment, since the unlabeled dataset is indicative of the first through third vulnerabilities, the application server 102 may determine that the first through third vulnerabilities, indicated by the unlabeled dataset, correspond to the first plurality of vulnerabilities. In other words, the application server 102 may determine that a description and/or an identifier of each of the first through third vulnerabilities, included in the second plurality of vulnerabilities indicated by the unlabeled dataset, matches (e.g., “corresponds to”/“is similar to”/“is same as”) a description of the first plurality of descriptions or an identifier of the first plurality of identifiers. The application server 102 may exclude the determined (e.g., identified) one or more vulnerabilities from the selection of the third plurality of vulnerabilities. Therefore, in the current embodiment, the application server 102 may exclude the first through third vulnerabilities in the unlabeled dataset from selection as the third plurality of vulnerabilities. The application server 102 may select the third plurality of vulnerabilities from a remaining plurality of vulnerabilities (e.g., the fourth through nth vulnerabilities) of the second plurality of vulnerabilities in the unlabeled dataset.
For the sake of brevity, it is assumed that the third plurality of vulnerabilities include only three vulnerabilities (e.g., the fourth through sixth vulnerabilities of the second plurality of vulnerabilities). However, in an actual implementation, the selected third plurality of vulnerabilities may include any number of vulnerabilities (e.g., tens of vulnerabilities, hundreds of vulnerabilities, or the like) without deviating from the scope of the disclosure.
Based on the selection (e.g., the random sampling or selection) of the third plurality of vulnerabilities, the application server 102 may retrieve, from the unlabeled dataset (e.g., the table 300), a description and/or an identifier of each of the selected third plurality of vulnerabilities. In the current embodiment, the application server 102 may retrieve the fourth through sixth descriptions of the fourth through sixth vulnerabilities that are included in the third plurality of vulnerabilities. Further, the application server 102 may retrieve the fourth through sixth identifiers of the fourth through sixth vulnerabilities. In other words, the application server 102 may retrieve the information pertaining to the fourth through sixth vulnerabilities.
The application server 102 may create the first training dataset based on the information pertaining to the first plurality of vulnerabilities and the information pertaining to the third plurality of vulnerabilities. Therefore, the created first training dataset is a combination of the information pertaining to the first plurality of vulnerabilities and the information pertaining to the third plurality of vulnerabilities. Each of the third plurality of vulnerabilities (e.g., the fourth through sixth vulnerabilities) may correspond to the first vulnerability type or the second vulnerability type. However, the application server 102 may label each of the third plurality of vulnerabilities, indicated by the first training dataset, with the second label, irrespective of whether a corresponding vulnerability corresponds to (e.g., belongs to) the first vulnerability type or the second vulnerability type. In other words, each of the fourth through sixth vulnerabilities is labeled with the second label, which is assigned to the second vulnerability type, irrespective of whether a corresponding vulnerability corresponds to the first vulnerability type or the second vulnerability type. It will be apparent to those of skill in the art that one or more vulnerabilities of the third plurality of vulnerabilities may be incorrectly labeled with the second label, thereby, introducing some label noise in the first training dataset.
FIG. 4 represents a table 400 that illustrates the first training dataset, in accordance with an exemplary embodiment of the present disclosure. The table 400 includes the information pertaining to the first plurality of vulnerabilities and the information pertaining to the second plurality of vulnerabilities.
The table 400 includes first through fourth columns 402a-402d and first through sixth rows 404a-404f. The first column 402a is indicative of the first plurality of vulnerabilities and the third plurality of vulnerabilities. In other words, the first column 402a is indicative of the first through sixth vulnerabilities. The second column 402b includes a description of each of the first plurality of vulnerabilities and the third plurality of vulnerabilities. Therefore, the second column 402b includes the first plurality of descriptions of the first plurality of vulnerabilities and a third plurality of descriptions of the selected third plurality of vulnerabilities. The third column 402c includes an identifier of each of the first plurality of vulnerabilities and the third plurality of vulnerabilities. In other words, the third column 402c includes the first plurality of identifiers and a third plurality of identifiers of the selected third plurality of vulnerabilities.
The first row 404a includes the first description (e.g., “Description_1”) and the first identifier (e.g., “ID_1”) of the first vulnerability (e.g., “Vulnerability_1”). The first row 404a further includes the first label (e.g., “1”) of the first vulnerability.
The second row 404b includes the second description (e.g., “Description_2”) and the second identifier (e.g., “ID_2”) of the second vulnerability (e.g., “Vulnerability_2”). The second row 404b further includes the first label (e.g., “1”) of the second vulnerability.
The third row 404c includes the third description (e.g., “Description_3”) and the third identifier (e.g., “ID_3”) of the third vulnerability (e.g., “Vulnerability_3”). The third row 404c further includes the first label (e.g., “1”) of the third vulnerability.
The fourth row 404d includes the fourth description (e.g., “Description_4”) of the fourth vulnerability (e.g., “Vulnerability_4”), the fourth identifier (e.g., “ID_4”) of the fourth vulnerability. The fourth row 404d further includes the second label (e.g., “0”) of the fourth vulnerability.
The fifth row 404e includes the fifth description (e.g., “Description_5”) of the fifth vulnerability (e.g., “Vulnerability_5”) and the fifth identifier (e.g., “ID_5”) of the fifth vulnerability. The fifth row 404e further includes the second label (e.g., “0”) of the fifth vulnerability.
The sixth row 404f includes the sixth description (e.g., “Description_6”) of the sixth vulnerability (e.g., “Vulnerability_6”) and the sixth identifier (e.g., “ID_6”) of the sixth vulnerability. The sixth row 404f further includes the second label (e.g., “0”) of the sixth vulnerability.
It is assumed that each of the first plurality of vulnerabilities is of the first vulnerability type and is correctly labeled with the first label. As described in the foregoing, the third plurality of vulnerabilities are labeled with the second label, irrespective of whether each of the third plurality of vulnerabilities is of the first vulnerability type or the second vulnerability type.
FIG. 5 is a block diagram 500 that illustrates training of the first machine learning model, in accordance with an exemplary embodiment of the present disclosure. The block diagram 500 includes the labeled dataset and the unlabeled dataset. The labeled dataset and the unlabeled dataset are hereinafter designated and referred to as “the labeled dataset 502” and the “unlabeled dataset 504”, respectively. The block diagram 500 further includes the first training dataset and the first machine learning model. The labeled dataset 502 is indicative of the first plurality of vulnerabilities, and the unlabeled dataset 504 is indicative of the second plurality of vulnerabilities. For the sake of brevity, the first plurality of vulnerabilities and the second plurality of vulnerabilities are interchangeably referred to as “the first plurality of vulnerabilities 502” and “the second plurality of vulnerabilities 504”, respectively. The third plurality of vulnerabilities, selected from the second plurality of vulnerabilities 504, are hereinafter designated and referred to as “the third plurality of vulnerabilities 506”. The first training dataset is hereinafter designated and referred to as “the first training dataset 508”.
The first training dataset 508 may correspond to the table 400 shown in FIG. 4. The first training dataset 508 thus includes (i) the first plurality of descriptions and the first plurality of identifiers of the first plurality of vulnerabilities 502, and (ii) the third plurality of descriptions and the third plurality of identifiers of the third plurality of vulnerabilities 506. The first training dataset 508 further includes a label for each of the first plurality of vulnerabilities 502 and the third plurality of vulnerabilities 506. For example, the first training dataset 508 includes the first label for the first plurality of vulnerabilities 502 and the second label for the third plurality of vulnerabilities 506.
In the first stage of training, the first machine learning model (hereinafter, designated and referred to as “the first machine learning model 510”) is trained by the application server 102 using the first training dataset 508. In the first stage of training, the first machine learning model 510 is trained to classify vulnerabilities as one of the first vulnerability type (e.g., severe) or the second vulnerability type (e.g., non-severe).
Prior to training the first machine learning model 510, the application server 102 may pre-process the first training dataset 508. Pre-processing of the first training dataset 508 may include, but is not limited to, lemmatization of the first plurality of descriptions and the third plurality of descriptions, stemming of the first plurality of descriptions and the third plurality of descriptions, removal of stop-words from the first plurality of descriptions and the third plurality of descriptions, or the like. Techniques that may be used for the pre-processing of the first training dataset 508 may be well known to those of skill in the art.
The pre-processed first training dataset 508 may be provided as input to a second machine learning model (not shown) that may be trained to generate a first plurality of embeddings (e.g., word embeddings) that correspond to the first plurality of descriptions and the third plurality of descriptions. The second machine learning model may be trained, using natural language dictionaries and a network security corpus. The network security corpus may include articles/research papers associated with security vulnerabilities, books associated with security vulnerabilities, or the like. The network security corpus enables the second machine learning model to learn word embeddings of security-related or vulnerability-related terms/words that do not usually occur in a regular language (e.g., natural language) dictionary. Examples of the second machine learning model include, but are not limited to, Word2Vec, Global Vectors for Word Representation (GloVe), Doc2Vec, or the like.
The trained second machine learning model may be configured to generate the first plurality of embeddings based on the first plurality of descriptions and the third plurality of descriptions. For each word included in each of the first plurality of descriptions and the third plurality of descriptions, the trained second machine learning model may generate an embedding (e.g., a vector) in an n-dimensional embedding space. Each embedding is representative of a corresponding word included in the first plurality of descriptions and the third plurality of descriptions. In the n-dimensional embedding space, embeddings of words that are similar in meaning or context are close together. Therefore, a distance between embeddings of any two words that are similar in meaning or context is a function of a degree of correlation between the two words. If the degree of correlation between the two words is greater than or equal to a similarity threshold, the distance between the embeddings (e.g., first and second embeddings) of the two words may be less than a distance threshold. Greater the degree of correlation between the two words, less is the distance between the embeddings of the two words. Embeddings and methods of generation of embeddings (e.g., the first plurality of embeddings) are well known to those of skill in the art.
During the first stage of training, the application server 102 may train the first machine learning model 510, using a set of features associated with the first training dataset 508. The set of features may include the first plurality of embeddings, the first and third pluralities of identifiers, the first label for the first plurality of vulnerabilities 502, and the second label for the third plurality of vulnerabilities 506.
Examples of the first machine learning model 510 may include, but are not limited to, a bidirectional long short-term memory (LSTM), a unidirectional LSTM, or the like. For the sake of brevity, it is assumed that the first machine learning model 510 is a bidirectional LSTM. However, it will be apparent to those of skill in the art that the first machine learning model 510 may correspond to any type of machine learning model (e.g., a neural network) without deviating from a scope of the disclosure.
The trained first machine learning model 510 may be configured to classify vulnerabilities as the first vulnerability type or the second vulnerability type. During an execution of the trained first machine learning model 510, the trained first machine learning model 510 may classify a vulnerability as one of the first vulnerability type or the second vulnerability type based on a set of features that correspond to the vulnerability. The set of features that correspond to the vulnerability may include a set of embeddings of a description of the vulnerability (e.g., generated by the second machine learning model) and/or an identifier of the vulnerability. The set of features may be provided as input to the trained first machine learning model 510. Based on the provided input, the trained first machine learning model 510 may be configured to generate (e.g., determine) a vulnerability score for the vulnerability. In one embodiment, the vulnerability score may correspond to a probability or likelihood (e.g., determined by the trained first machine learning model 510) of the vulnerability being the second vulnerability type or the first vulnerability type. The vulnerability score of the vulnerability may indicate whether the vulnerability corresponds to the first vulnerability type or the second vulnerability type.
In one embodiment, the vulnerability score may be expressed in percentage terms (e.g., “50%”, “60%”, “95%”, or the like). For example, a vulnerability score of “70%” for the vulnerability indicates a “70%” likelihood of the vulnerability being the second vulnerability type and a “30%” likelihood of the vulnerability being the first vulnerability type. However, in another embodiment, the vulnerability score may be expressed as a numerical value between a set of vulnerability score limits (e.g., between a first vulnerability score limit “0” and a second vulnerability score limit “1”). In one embodiment, if an absolute value of a difference between a vulnerability score (e.g., “0.95”) and the second vulnerability score limit (“1”) is less than an absolute value of a difference between the vulnerability score and the first vulnerability score limit (“0”), a corresponding vulnerability may be more likely to correspond to the first vulnerability type than the second vulnerability type. Similarly, if an absolute value of a difference between a vulnerability score (e.g., “0.25”) and the first vulnerability score limit (“0”) is less than an absolute value of a difference between the vulnerability score and the second vulnerability score limit (“1”), a corresponding vulnerability may be more likely to correspond to the second vulnerability type than the first vulnerability type. For the sake of brevity, it is assumed that the vulnerability scores are expressed as a numerical value between the first vulnerability score limit and the second vulnerability score limit.
The trained first machine learning model 510 may classify a vulnerability as one of the first vulnerability type or the second vulnerability type based on the vulnerability score for the vulnerability. The trained first machine learning model 510 may classify the vulnerability as the second vulnerability type if the vulnerability score for the vulnerability is less than or equal to a first threshold vulnerability score (e.g., “0.4”, “0.3”, or the like). For example, if the first threshold vulnerability score is “0.3”, the trained first machine learning model 510 may classify the vulnerability as the second vulnerability type. However, it will be apparent to those of skill in the art that the trained first machine learning model 510 may not necessarily classify the vulnerability as the first vulnerability type if the vulnerability score for the vulnerability is greater than the first threshold vulnerability score (e.g., “0.3”).
During the execution of the trained first machine learning model 510, the application server 102 mines, from the unlabeled dataset 504, a plurality of vulnerabilities that are to be pseudo-labeled with the second label. To mine the plurality of vulnerabilities, the application server 102 may provide a set of features, which correspond to the second plurality of vulnerabilities 504, as input to the trained first machine learning model 510. The set of features may include embeddings of the second plurality of descriptions of the second plurality of vulnerabilities 504 included in the unlabeled dataset 504. The set of features may further include (e.g., may be indicative of) the second plurality of identifiers of the second plurality of vulnerabilities 504. Based on the inputted set of features, the trained first machine learning model 510 may determine (e.g., generate) a vulnerability score for each of the second plurality of vulnerabilities 504 (e.g., the first through nth vulnerabilities). In other words, the trained first machine learning model 510 may determine the vulnerability score for each of the second plurality of vulnerabilities 504, based on the information pertaining to the second plurality of vulnerabilities 504 (e.g., information included in the unlabeled dataset 504).
As described in the foregoing, the vulnerability score for each of the second plurality of vulnerabilities 504 is indicative of a probability (e.g., likelihood) of a corresponding vulnerability being (e.g., belonging to) the first vulnerability type or the second vulnerability type. The trained first machine learning model 510 may classify a vulnerability of the second plurality of vulnerabilities 504 as the second type of vulnerability, if the determined vulnerability score for the vulnerability is less than or equal to the first threshold vulnerability score.
For example, if the determined vulnerability score for the fourth vulnerability, of the second plurality of vulnerabilities 504, is less than or equal to the first threshold vulnerability score, the trained first machine learning model 510 may classify the fourth vulnerability as the second vulnerability type. However, if the determined vulnerability score for the first vulnerability, of the second plurality of vulnerabilities 504, is greater than the first threshold vulnerability score, the trained first machine learning model 510 may not necessarily classify the fourth vulnerability as the first vulnerability type. In a non-limiting example, the determined vulnerability score for each of a fourth plurality of vulnerabilities 512, of the second plurality of vulnerabilities 504, may be less than or equal to the first threshold vulnerability score. The fourth plurality of vulnerabilities 512 are to be pseudo-labeled with the second label. In other words, the fourth plurality of vulnerabilities 512 are mined by the application server 102 from the unlabeled dataset 504, using the trained first machine learning model 510. In the current embodiment, it is assumed that the fourth plurality of vulnerabilities 512 includes each vulnerability, of the second plurality of vulnerabilities 504, classified by the trained first machine learning model 510 as the second vulnerability type.
However, in another embodiment, one or more vulnerabilities classified by the trained first machine learning model 510 as the second vulnerability type may be excluded from the fourth plurality of vulnerabilities 512. In other words, not all vulnerabilities that were randomly labeled as the second vulnerability type during the first training stage may be included in the fourth plurality of vulnerabilities 512. In such a scenario, vulnerabilities, of the second plurality of vulnerabilities 504, associated with a vulnerability score less than a first threshold pseudo-labeling score (e.g.,“0.25”) may be included in the fourth plurality of vulnerabilities 512. The first threshold pseudo-labeling score (e.g., “0.25”) may be less (e.g., lower) than or equal to the first threshold vulnerability score. Therefore, only vulnerabilities with highest likelihood (e.g., vulnerability score less than the first threshold pseudo-labeling score) of being the second vulnerability type may be included in the fourth plurality of vulnerabilities 512. In other words, only vulnerabilities with highest likelihood (e.g., vulnerability score less than the first threshold pseudo-labeling score) of being the second vulnerability type may be pseudo-labeled (e.g., selected for pseudo-labeling) with the second label assigned to the second vulnerability type.
One or more vulnerabilities, in the unlabeled dataset 504, may be excluded from being pseudo-labeled with the second label based on the vulnerability score associated with each of the one or more vulnerabilities. For example, the application server 102 may determine the vulnerability score for the seventh vulnerability of the second plurality of vulnerabilities 504, based on an output of the trained first machine learning model 510 for the seventh vulnerability. Based on the vulnerability of the seventh vulnerability being greater than the first threshold vulnerability score, the application server 102 may exclude the seventh vulnerability from being pseudo-labeled with the second label.
In a non-limiting example, it is assumed that the fourth plurality of vulnerabilities 512 include only two vulnerabilities (e.g., the fourth and fifth vulnerabilities of the second plurality of vulnerabilities 504). Therefore, only the fourth and fifth vulnerabilities (e.g., the fourth plurality of vulnerabilities 512) are mined for pseudo-labeling with the second label, by the application server 102 from the unlabeled dataset 504, using the trained first machine learning model 510. Based on the mining of the fourth plurality of vulnerabilities 512, the application server 102 may retrieve from the unlabeled dataset 504, the information pertaining to the fourth plurality of vulnerabilities 512. For example, the application server 102 may retrieve from the unlabeled dataset 504, the fourth description of the fourth vulnerability, the fourth identifier of the fourth vulnerability, the fifth description of the fifth vulnerability, and the fifth identifier of the fifth vulnerability.
FIG. 6 is a block diagram 600 that illustrates the second stage of training of the trained first machine learning model 510, in accordance with an exemplary embodiment of the disclosure. FIG. 6 is explained in conjunction with FIG. 5.
As shown in FIG. 6, the application server 102 creates (e.g., generates) the second training dataset (hereinafter, designated and referred to as “the second training dataset 602”). The application server 102 creates the second training dataset 602, based on the mined fourth plurality of vulnerabilities 512 and the labeled dataset 502 that is indicative of the first plurality of vulnerabilities 502. The second training dataset 602 includes the information pertaining to the first plurality of vulnerabilities 502 and the information pertaining to the fourth plurality of vulnerabilities 512. Further, the second training dataset 602 includes a label for each of the first plurality of vulnerabilities 502 and the fourth plurality of vulnerabilities 512. In other words, the second training dataset 602 (e.g., table 700 shown in FIG. 7) includes the first label for the first plurality of vulnerabilities 502 and the second label (e.g., pseudo-label) for the mined fourth plurality of vulnerabilities 512.
Referring now to FIG. 7, the table 700 that illustrates the second training dataset 602, in accordance with an exemplary embodiment of the present disclosure, is shown.
The table 700 includes first through fourth columns 702a-702d and first through fifth rows 704a-704e. The first column 702a is indicative of the first plurality of vulnerabilities 502 (e.g., the first through third vulnerabilities) and the mined fourth plurality of vulnerabilities 512 (e.g., the fourth and fifth vulnerabilities). The second column 702b includes the description of each of the first plurality of vulnerabilities 502 and the fourth plurality of vulnerabilities 512. The third column 702c includes the identifier of each of the first plurality of vulnerabilities 502 and the fourth plurality of vulnerabilities 512. The fourth column 702c includes the label for each of the first plurality of vulnerabilities 502 and the fourth plurality of vulnerabilities 512.
The first row 704a includes the first description (e.g., “Description_1”) of the first vulnerability (e.g., “Vulnerability_1”), the first identifier (e.g., “ID_1”) of the first vulnerability, and the first label (e.g., “1”) of the first vulnerability.
The second row 704b includes the second description (e.g., “Description_2”), the second identifier (e.g., “ID_2”) of the second vulnerability (e.g., “Vulnerability_2”), and the first label (e.g., “1”) of the second vulnerability.
The third row 704c includes the third description (e.g., “Description_3”) of the third vulnerability, the third identifier (e.g., “ID_3”) of the third vulnerability (e.g., “Vulnerability_3”). The third row 404c further includes the first label (e.g., “1”) of the third vulnerability.
The fourth row 704d includes the fourth description (e.g., “Description_4”) of the fourth vulnerability (e.g., “Vulnerability_4”), the fourth identifier (e.g., “ID_4”) of the fourth vulnerability, and the second label (e.g., “0”) of the fourth vulnerability.
The fifth row 704e includes the fifth description (e.g., “Description_5”) of the fifth vulnerability (e.g., “Vulnerability_5”), the fifth identifier (e.g., “ID_5”) of the fifth vulnerability, and the second label (e.g., “0”) of the fifth vulnerability.
The table 700 indicates that, in the second training dataset 602, the fourth vulnerability and the fifth vulnerability are pseudo-labeled with the second label (e.g., “0”) assigned to the second vulnerability type (e.g., non-severe vulnerabilities). For the sake of brevity, it is assumed that a new training dataset (e.g., the second training dataset 602) is created (e.g., generated) by the application server 102, following the mining of the fourth plurality of vulnerabilities 512. However, in another embodiment, the second training dataset 602 may correspond to an updated version of an existing training dataset (e.g., the first training dataset 508) that is updated based on the mined fourth plurality of vulnerabilities 512. For example, the first training dataset 508 may be updated to exclude the sixth vulnerability (e.g., the information pertaining to the sixth vulnerability), based on the mined fourth plurality of vulnerabilities 512. In such a scenario, the updated first training dataset 508 may include the information pertaining to the first plurality of vulnerabilities 502 (e.g., the first through third vulnerabilities) and the information pertaining to the fourth and fifth vulnerabilities (e.g., the mined fourth plurality of vulnerabilities 512).
Referring back to FIG. 6, the application server 102 may re-train the trained first machine learning model 510 in the second stage of training, using the second training dataset 602. Process of re-training the trained first machine learning model 510 in the second stage of training may be similar to the training of the first machine learning model 510 in the first stage of training, as described in the foregoing description of FIG. 5.
In the second stage of training, the application server 102 may provide a set of features, associated with the second training dataset 602, as input to the trained first machine learning model 510. The set of features may include a plurality of embeddings associated with a description (e.g., the first through fifth descriptions) of each of the first plurality of vulnerabilities 502 and the fourth plurality of vulnerabilities 512. The set of features may further include a plurality of identifiers (e.g., the first through fifth identifiers) of the first plurality of vulnerabilities 502 and the fourth plurality of vulnerabilities 512. The set of features may further include the first label for the first plurality of vulnerabilities 502 and the second label for the fourth plurality of vulnerabilities 512.
The trained first machine learning model 510 may be re-trained (for vulnerability classification) in the second stage of training, using the second training dataset 602 (e.g., the set of features associated with the second training dataset 602). The re-trained first machine learning model 510 is trained to classify vulnerabilities as the first vulnerability type or the second vulnerability type. It will be apparent to those of skill in the art that an accuracy of the re-trained first machine learning model 510 in vulnerability classification may be greater than an accuracy of the trained first machine learning model 510 (trained in the first stage of training) in vulnerability classification.
The mined fourth plurality of vulnerabilities 512 are determined to have a vulnerability score less than or equal to the first threshold vulnerability score or the first threshold pseudo-labeling score. As a result, it is presumed that the second training dataset 602 has lower label noise than the first training dataset 508. Therefore, the re-trained first machine learning model 510 after the second stage of training has a higher accuracy in vulnerability classification than the trained first machine learning model 510 trained in the first stage of training.
The application server 102 may iteratively perform various operations described in FIG. 6 for improving an accuracy of the first machine learning model 510. For example, the application server 102 may mine a plurality of vulnerabilities (e.g., a fifth plurality of vulnerabilities 604), from the unlabeled dataset 504, using the re-trained first machine learning model 510, in a subsequent stage of training. The mining of the fifth plurality of vulnerabilities 604 may be similar to the mining of the fourth plurality of vulnerabilities 512 from the unlabeled dataset 504 (as described in the foregoing description of FIG. 5). The mined fifth plurality of vulnerabilities 604 are to be pseudo-labeled with the second label assigned to the second vulnerability type. The application server 102 may generate a new training dataset (e.g., a third training dataset; not shown) to re-train the first machine learning model 510 in the subsequent stage of training (e.g., a third stage of training). The new training dataset (e.g., the third training dataset) may be indicative of the fifth plurality of vulnerabilities 604, mined from the unlabeled dataset 504, and the first plurality of vulnerabilities 502. The new training dataset may include the information pertaining to the first plurality of vulnerabilities 502 and information pertaining to each of the fifth plurality of vulnerabilities 604. The new training dataset may include the first plurality of descriptions, the first plurality of identifiers, a plurality of descriptions of the fifth plurality of vulnerabilities 604, and a plurality of identifiers of the fifth plurality of vulnerabilities 604.
A process of re-training the first machine learning model 510, in subsequent stages of training (e.g., the third stage of training, the fourth stage of training, the nth stage of training, or the like) and mining vulnerabilities from the unlabeled dataset 504 may be repeated until a requisite accuracy of the first machine learning model 510 is obtained (e.g., accuracy of the first machine learning model 510 in vulnerability classification).
In one embodiment, the application server 102 may determine an accuracy of the first machine learning model 510 in vulnerability classification, using a loss function. As is known to those of skill in the art, the loss function may be used to measure a difference between a value estimated by the first machine learning model 510 and a true value. The application server 102 may, based on the loss function (e.g., using the loss function) determine a loss associated with the first machine learning model 510 at each stage of training (e.g., the first stage of training, the second stage of training, the third stage of training, or the like). The application server 102 may update a set of weights associated with the first machine learning model 510 (e.g., the bidirectional LSTM), based on the determined loss at each stage of training of the first machine learning model 510. As is known to those of skill in the art, the loss associated with the first machine learning model 510 may be associated with a test dataset. The test dataset may be indicative of a plurality of vulnerabilities. The test dataset may be provided, by the application server 102, as input to the trained/re-trained first machine learning model 510. The application server 102 may compare an output, of the trained/re-trained first machine learning model 510, to known/correct outputs. In a current embodiment, the loss function may include a supervised loss component and a semi-supervised loss component. In a non-limiting example, the loss function may be represented as “K1(Ls) + K2(Lps)”, where “Ls” represents supervised loss, “Lps” represents semi-supervised loss, “K1” is a coefficient associated with the supervised loss, and “K2” is a coefficient associated with the semi-supervised loss. The supervised loss is determined, by the application server 102, based on a correctness of a classification, by the first machine learning model 510, of vulnerabilities as the first vulnerability type and/or the second vulnerability type. The semi-supervised loss is determined based on correctness of classification, by the first machine learning model 510, of vulnerabilities (e.g., the fourth plurality of vulnerabilities 512) to be pseudo-labeled with the second labeled as the second vulnerability type. The application server 102 may determine that the accuracy of the first machine learning model 510 is greater than or equal to a threshold accuracy if the determined loss is less than or equal to a loss threshold. If the determined loss is greater than the loss threshold, the first machine learning model 510 may re-trained in subsequent stages until the loss is less than or equal to the loss threshold.
The application server 102 may not be limited to mining vulnerabilities (e.g., the fourth plurality of vulnerabilities 512, the fifth plurality of vulnerabilities 604, or the like) that are to be pseudo-labeled with the second label (e.g., vulnerabilities that correspond to the second vulnerability type). In an actual implementation, the application server 102 may also mine, from the unlabeled dataset 504, using the trained/re-trained first machine learning model 510, vulnerabilities that are to be pseudo-labeled with the first label assigned to the first vulnerability type (e.g., severe). In such a scenario, the vulnerabilities that are pseudo-labeled with the first label may also be included in a training dataset (e.g., the second training dataset 602, the third training dataset, or the like) created, by the application server 102, for a subsequent stage of training. Process of selection of vulnerabilities for pseudo-labeling with the first label may be similar to the selection of vulnerabilities for pseudo-labeling with the second label. In other words, mining of the vulnerabilities to be pseudo-labeled with the first label may be similar to the mining of vulnerabilities to be pseudo-labeled with the second label.
The application server 102 may host a service application (not shown) that is accessible by the plurality of user devices 106. For example, the service application may be executed on the plurality of user devices 106. The service application may be a standalone application installed and/or executed on the plurality of user devices 106. In another embodiment, the service application may be a web application that is accessible by way of a web browser installed on the plurality of user devices 106. The service application may enable users to provide information pertaining to (e.g., associated with) one or more vulnerabilities for classification of the one or more vulnerabilities. Functionality of the service application is explained in conjunction with FIG. 8.
FIG. 8 is a process flow diagram 800 that illustrates classification of vulnerabilities, in accordance with an exemplary embodiment of the present disclosure. For the sake of brevity, it is assumed that the first machine learning model 510 has already been trained and/or re-trained in various stages of training (e.g., the first through nth stages of training). The process flow diagram includes the first user device 106a, of the plurality of user devices 106, and the application server 102.
The service application that is executed/installed on the first user device 106a may be accessed by the first user (as shown by arrow 802). Upon access of the service application, the service application may render a user interface (UI) on a display screen of the first user device 106a (as shown by arrow 804). The rendered UI enables users (e.g., the first user) to access a vulnerability classification service offered by the application server 102. Consequently, details of a set of vulnerabilities may be provided by the first user in the rendered UI (as shown by arrow 806). The rendered UI may enable users (e.g., the first user) to provide (e.g., upload) details of vulnerabilities for vulnerability classification. The details of the set of vulnerabilities may include an identifier of each of the set of vulnerabilities, a description of each of the set of vulnerabilities, or the like. For the sake of brevity, it is assumed in a non-limiting example that details of the set of vulnerabilities includes the identifier of each of the set of vulnerabilities.
In a non-limiting example, the details of the set of vulnerabilities may be provided by the first user by providing a comma-separated values (CSV) file that includes the identifier of each of the set of vulnerabilities. In other words, the CSV file may include a set of identifiers of the set of vulnerabilities. However, provision of the details of the set of vulnerabilities is not limited to CSV files. It will be apparent to those of skill in the art that the details of the set of vulnerabilities may be provided (e.g., by the first user) in other formats (e.g., a text document, a word document, a portable document format or the like) without deviating from the scope of the disclosure.
In the current embodiment, each of the set of vulnerabilities may correspond to a vulnerability detected in at least one of the plurality of user devices 106, the communication network 108, the application server 102, or the like. The set of vulnerabilities may be detected by a set of vulnerability scanners that are installed on the plurality of user devices 106 and/or the application server 102. Functioning of the set of vulnerability scanners is well known to those of skill in the art and is not explained to avoid obscuring the disclosure.
Based on the provision of the details of the set of vulnerabilities (e.g., the detected set of vulnerabilities), the service application that is executed on the first user device 106a may generate a vulnerability classification request (as shown by arrow 808). The first user device 106a may communicate the vulnerability classification request to the application server 102 (as shown by arrow 810). The vulnerability classification request includes the details of the set of vulnerabilities. For example, the vulnerability classification request may include the CSV file provided by the first user. The application server 102 may receive the vulnerability classification request. The application server 102 may classify each of the set of vulnerabilities as one of the first vulnerability type or the second vulnerability type, using the trained/re-trained first machine learning model 510 (as shown by arrow 812). For the classification of the set of vulnerabilities, the application server 102 may generate a set of features based on the details of the set of vulnerabilities (as described in the foregoing descriptions of FIGS. 5 and 6). In the current embodiment, the set of features (e.g., word embeddings) may be indicative of the set of identifiers of the set of vulnerabilities. The application server 102 may provide the generated set of features as input to the first machine learning model 510. The first machine learning model 510 may, based on the provided input, generate a score (e.g., a vulnerability score) for each of the set of vulnerabilities.
The generated score for each of the set of vulnerabilities may indicate whether a corresponding vulnerability, of the set of vulnerabilities, is of the first vulnerability type or the second vulnerability type. For example, if the generated score for a vulnerability, of the set of vulnerabilities, is greater than or equal to a second threshold vulnerability score (e.g., “0.7”), the vulnerability may be classified as the first vulnerability type. In another example, if the generated score for a vulnerability, of the set of vulnerabilities, is less than the first threshold vulnerability score (e.g., “0.4”), the vulnerability may be classified as the second vulnerability type.
The application server 102 may generate and communicate a vulnerability classification response to the first user device 106a (as shown by arrow 814). The vulnerability classification response may include the score, generated by the first machine learning model 510, for each of the set of vulnerabilities. Further, the vulnerability classification response may be indicative of a classification of each of the set of vulnerabilities as the first vulnerability type or the second vulnerability type, based on a generated score for a corresponding vulnerability. The first user device 106a may receive the vulnerability classification response. Based on the vulnerability classification response, the service application presents a set of scores on the rendered UI (as shown by arrow 816). The set of scores includes the score generated by the first machine learning model 510 for each of the set of vulnerabilities. The service application may further present, on the rendered UI, a message indicating a classification of each of the set of vulnerabilities. The message may be indicative a label (e.g., the first label or the second label) for each of the set of vulnerabilities. In other words, the message may indicate whether each vulnerability of the set of vulnerabilities is a severe vulnerability (e.g., the first vulnerability type) or a non-severe vulnerability (e.g., the second vulnerability type).
In some embodiments, the application server 102 may generate a priority list (e.g., a ranking list) based on the set of scores generated for the set of vulnerabilities. In a non-limiting example, the priority list may include the set of vulnerabilities ordered based on the score generated for each of the set of vulnerabilities. For example, the priority list may be indicative of the set of vulnerabilities, in descending order of scores (e.g., the set of scores) generated for the set of vulnerabilities. A position of a vulnerability in the priority list may be indicative of a severity of the vulnerability with respect to other vulnerabilities of the set of vulnerabilities. In a non-limiting example, a vulnerability, of the set of vulnerabilities, that occupies a first position in the priority list may be deemed to be most severe among the set of vulnerabilities. Similarly, a vulnerability, of the set of vulnerabilities, that occupies a second position in the priority list may be deemed to be less severe than the vulnerability that occupies the first position in the priority list. Similarly, a vulnerability, of the set of vulnerabilities, that occupies a last position in the priority list may be deemed to be least severe among the set of vulnerabilities.
The vulnerability classification response may further include the generated priority list. Consequently, the priority list may be presented by the service application on the rendered UI for viewing by the first user. Accordingly, the detected set of vulnerabilities may be patched based on the priority list.
In one embodiment, a classification of a vulnerability (e.g., the first vulnerability) of the set of vulnerabilities may be changed by the first user. For example, the first label assigned to the first vulnerability may be changed by the first user. For changing a label, one or more options included in the rendered UI may be selected by the first user to generate a label change request for changing the first label of the first vulnerability to the second label. The label change request may be generated by the service application and communicated by the first user device 106a to the application server 102. The label change request may indicate that the first label of the first vulnerability is incorrect and is to be changed to the second label. Based on the reception of the label change request by the application server 102, the application server 102 may create a new training dataset or update an existing training dataset (e.g., the second training dataset 602, the third training dataset, the nth training dataset, or the like) to change the first label of the first vulnerability to be the second label. The application server 102 may re-train the first machine learning model 510, in a subsequent stage of training (e.g., the third stage of training, the fourth stage of training, or the like), using the new training dataset or the updated existing training dataset.
In one embodiment, one or more scores of the set of scores may be less than the second threshold vulnerability score (e.g., “0.7”), but greater than the first threshold vulnerability score (e.g., “0.4”). In such a scenario, one or more vulnerabilities corresponding to the one or more scores may not be labeled with either of the first and second label. The service application may present, on the rendered UI, a message that indicates that the one or more vulnerabilities are not classified as the first vulnerability type or the second vulnerability type. The service application may further prompt the first user to provide a manual label for each of the one or more vulnerabilities. Based on the manual label received from the first user for each of the one or more vulnerabilities, the application server 102 may re-train the first machine learning model 510 in a subsequent stage of training. Process of training and re-training has already been explained in the foregoing description of FIGS. 5 and 6.
In the current embodiment, it is assumed that a new training dataset (e.g., the second training dataset 602, the third training dataset, or the like) is created for each stage of training (e.g., the second stage of training, the third stage of training, or the like). However, in another embodiment, an existing training dataset may be updated for each subsequent stage of training, without deviating from the scope of the disclosure. For example, instead of creating the second training dataset 602, the first training dataset 508 may updated based on the mined fourth plurality of vulnerabilities 512. As a result, the updated first training dataset 508 may be same as the second training dataset 602.
In the current embodiment, it is assumed that an existing machine learning model (e.g., the first machine learning model 510) is re-trained in each subsequent stage of training (e.g., the second stage of training, the third stage of training, or the like). However, in another embodiment, a new machine learning model (e.g., a second machine learning model similar to the first machine learning model 510; not shown) may be trained based on a corresponding dataset (e.g., the second training dataset 602, the third training dataset, or the like) in each subsequent stage of training (e.g., the second stage of training, the third stage of training, or the like) without deviating from the scope of the disclosure.
In one embodiment, the application server 102 may determine, based on the mined fifth plurality of vulnerabilities 604, that one or more of vulnerabilities in the second training dataset 602 were incorrectly pseudo-labeled with the second label. For example, the application server 102 may determine that the fourth vulnerability was incorrectly pseudo-labeled with the second label, based on an output of the re-trained first machine learning model 510 to a provided input (e.g., the set of features associated with the second training dataset 602). For example, the mined fifth plurality of vulnerabilities 604 may not include the fourth vulnerability. Based on the determination that the fourth vulnerability was incorrectly pseudo-labeled with the second label, the application server 102 may update the second training dataset 602 to exclude the fourth vulnerability from the second training dataset 602. Alternatively, the application server 102 may update the second training dataset 602 to change the second label of the fourth vulnerability to the first label. Alternatively, a training dataset (e.g., the third training dataset) created for a subsequent stage of training (e.g., the third stage of training) may not include the fourth vulnerability. The application server 102 may re-train the trained/re-trained first machine learning model 510 in the third stage of training, using the updated second training dataset 602 or the created new third training dataset.
FIG. 9 is a block diagram that illustrates the application server 102, in accordance with an exemplary embodiment of the present disclosure. The application server 102 may include processing circuitry 902, a memory 904, and a network interface 906. The processing circuitry 902, the memory 904, and the network interface 906 may communicate with each other by way of a communication bus 908. The processing circuitry 902 may include an application host 910 and a machine learning engine 912.
The processing circuitry 902 includes suitable logic, circuitry, interfaces, and/or code for executing a set of instructions stored in a suitable data storage device (for example, the memory 904) for vulnerability management. Examples of the processing circuitry 902 may include, but are not limited to, an application-specific integrated circuit (ASIC) processor, a reduced instruction set computer (RISC) processor, a complex instruction set computer (CISC) processor, a field programmable gate array (FPGA), a central processing unit (CPU), or the like. The processing circuitry 902 executes various operations for training/re-training the first machine learning model 510 and classifying vulnerabilities by way of the application host 910 and the machine learning engine 912.
The application host 910 is configured to host the service application that is executable on the plurality of user devices 106. The application host 910 may be further configured to render the UI on the plurality of user devices 106 (e.g., the first user device 106a). The application host 910 may be further configured to receive the vulnerability classification request, from the first user device 106a, and communicate the vulnerability classification response to the first user device 106a. The application host 910 may be further configured to receive the label change request from the first user device 106a.
The machine learning engine 912 is configured to train/re-train the first machine learning model 510 in various stages of training (e.g., the first through nth stages of training). For the various stages of training, the machine learning engine 912 may generate various training datasets (e.g., the first through nth training datasets). The machine learning engine 912 may be further configured to mine vulnerabilities from the unlabeled dataset 504. For example, the machine learning engine 912 may be configured to mine the fourth plurality of vulnerabilities 512, the fifth plurality of vulnerabilities 604, or the like.
The memory 904 may include suitable logic, circuitry, and/or interfaces for storing the set of instructions to be executed by one or more components of the processing circuitry 902 for vulnerability management. The memory 904 may be configured to store data that is required by one or more components of the processing circuitry 902 for executing the set of instructions. For example, the memory 904 may be configured to store the first machine learning model 510, the second machine learning model (hereinafter, designated and referred to as “the second machine learning model 916”), training datasets (e.g., the first through nth training datasets).
Examples of the memory 904 may include a random-access memory (RAM), a read-only memory (ROM), a removable storage drive, a hard disk drive (HDD), a flash memory, a solid-state memory, and the like. It will be apparent to a person skilled in the art that the scope of the disclosure is not limited to realizing the memory 904 in the application server 102, as described herein. In another embodiment, the memory 904 may be realized in form of a database server or a cloud storage working in conjunction with the application server 102, without departing from the scope of the disclosure.
The network interface 906 may include suitable logic, circuitry, interfaces, and/or code, executable by the circuitry, to transmit and receive data over the communication network 108 using one or more communication network protocols. The network interface 906 may receive messages and data (e.g., the labeled dataset 502, the vulnerability classification request, the label change request, or the like) from the database server 104 and the plurality of user devices 106. The network interface 206 may transmit messages and data (e.g., the vulnerability classification response) to the plurality of user devices 106. Examples of the network interface 906 may include, but are not limited to, an antenna, a radio frequency transceiver, a wireless transceiver, a Bluetooth transceiver, an ethernet port, or any other device configured to transmit and receive data.
FIG. 10 is a block diagram that illustrates a system architecture of a computer system 1000, in accordance with an embodiment of the present disclosure. An embodiment of the present disclosure, or portions thereof, may be implemented as computer readable code on the computer system 1000. In one example, the application server 102, the database server 104, and the plurality of user devices 106 may be implemented as the computer system 1000. Hardware, software, or any combination thereof may embody modules and components used to implement the methods of FIGS. 11A-11C, 12, and 13.
The computer system 1000 includes a CPU 1002 that may be a special-purpose or a general-purpose processing device. The CPU 1002 may be a single processor, multiple processors, or combinations thereof. The CPU 1002 may have one or more processor cores. Further, the CPU 1002 may be connected to a communication infrastructure 1004, such as a bus, message queue, multi-core message-passing scheme, and the like. The computer system 1000 may further include a main memory 1006 and a secondary memory 1008. Examples of the main memory 1006 may include RAM, ROM, and the like. The secondary memory 1008 may include a hard disk drive or a removable storage drive, such as a floppy disk drive, a magnetic tape drive, a compact disc, an optical disk drive, a flash memory, and the like.
The computer system 1000 further includes an input/output (I/O) interface 1010 and a communication interface 1012. The I/O interface 1010 includes various input and output devices that are configured to communicate with the CPU 1002. Examples of the input devices may include a keyboard, a mouse, a joystick, a touchscreen, a microphone, and the like. Examples of the output devices may include a display screen, a speaker, headphones, and the like. The communication interface 1012 may be configured to allow data to be transferred between the computer system 1000 and various devices that are communicatively coupled to the computer system 1000. Examples of the communication interface 1012 may include a modem, a network interface, i.e., an Ethernet card, a communication port, and the like. Data transferred via the communication interface 1012 may correspond to signals, such as electronic, electromagnetic, optical, or other signals as will be apparent to a person skilled in the art.
FIGS. 11A-11C, collectively, represent a flowchart 1100 that illustrates a method for training the first machine learning model 510 for vulnerability classification, in accordance with an exemplary embodiment of the present disclosure. FIGS. 11A-11C are described in conjunction with FIGS. 1-7.
With reference to FIG. 11A, at step 1102, the application server 102 receives the labeled dataset 502, which is indicative of the first plurality of vulnerabilities 502, from the first user device 106a of the plurality of user devices 106. The labeled dataset 502 includes the first plurality of identifiers and the first plurality of descriptions of the first plurality of vulnerabilities 502. The labeled dataset 502 further includes the first label for the first plurality of vulnerabilities 502. Each of the first plurality of vulnerabilities 502 corresponds to the first vulnerability type and is correctly labeled with the first label that is assigned to the first vulnerability type.
At step 1104, the application server 102 selects the third plurality of vulnerabilities 506, from the unlabeled dataset 504 that is indicative of the second plurality of vulnerabilities 504. The third plurality of vulnerabilities 506 may be selected at random from the second plurality of vulnerabilities 504.
At step 1106, the application server 102 retrieves, from the information pertaining to the selected third plurality of vulnerabilities 506. As described in the foregoing descriptions of FIGS. 1, 3, and 5, the application server 102 may retrieve the third plurality of identifiers and the third plurality of descriptions from the second plurality of identifiers and the second plurality of descriptions, respectively.
At step 1108, the application server 102 creates the first training dataset 508. The first training dataset 508 includes the information pertaining to the first plurality of vulnerabilities 502 and the third plurality of vulnerabilities 506. The first training dataset 508 includes the first plurality of identifiers, the first plurality of descriptions, the third plurality of identifiers, and the third plurality of descriptions. The first training dataset 508 further includes the first label for the first plurality of vulnerabilities 502.
At step 1110, the application server 102 labels the third plurality of vulnerabilities 506 with the second label assigned to the second vulnerability type. Each of the third plurality of vulnerabilities 506 is labeled with the second label that is assigned to the second vulnerability type irrespective of whether a corresponding vulnerability is of the first vulnerability type or the second vulnerability type.
At step 1112, the application server 102 trains the first machine learning model 510 in the first stage of training, using the first training dataset 508. The first machine learning model 510 is trained to classify vulnerabilities as the first vulnerability type or the second vulnerability type (as described in the foregoing descriptions of FIGS. 1, 5, and 6).
Referring now to FIG. 11B, at step 1114, the application server 102 mines, from the unlabeled dataset 504, using the trained first machine learning model 510, the fourth plurality of vulnerabilities 512 to be pseudo-labeled with the second label. The application server 102 mines the fourth plurality of vulnerabilities 512, based on the vulnerability score for each of the plurality of vulnerabilities. The mining of the fourth plurality of vulnerabilities 512 is explained in the foregoing description of FIG. 5.
At step 1116, the application server 102 creates the second training dataset 602 that includes the labeled first plurality of vulnerabilities 502 and the pseudo-labeled fourth plurality of vulnerabilities 512 (e.g., the fourth and fifth vulnerabilities). In another embodiment, the first training dataset 508 may be updated to include the pseudo-labeled fourth plurality of vulnerabilities 512 (e.g., the fourth and fifth vulnerabilities). Based on the mining of the fourth plurality of vulnerabilities 512, the application server 102 may determine that one or more vulnerabilities of the third plurality of vulnerabilities 506 was incorrectly labeled with the second label in the first training dataset 508. For example, the vulnerability score associated with the sixth vulnerability may be greater than the first threshold vulnerability score. Accordingly, the application server 102 may exclude the sixth vulnerability from being pseudo-labeled with the second label based on the vulnerability score associated with the second vulnerability.
At step 1118, the application server 102 re-trains the trained first machine learning model 510 in a subsequent stage of training (e.g., the second stage of training), using the second training dataset 602.
At step 1120, the application server 102 mines, from the unlabeled dataset 504, using the re-trained first machine learning model 510, a plurality of vulnerabilities (e.g., the fifth plurality of vulnerabilities 604) that are to be pseudo-labeled with the second label.
At step 1122, the application server 102 creates a new training dataset (e.g., the third training dataset) indicative of the first plurality of vulnerabilities 502 that are labeled with the first label and the second plurality of vulnerabilities 504 that are pseudo-labeled with the second label. In another embodiment, the application server 102 updates the first training dataset 508 to be indicative of the first plurality of vulnerabilities 502 that are labeled with the first label and the second plurality of vulnerabilities 504 that are pseudo-labeled with the second label.
At step 1124, the application server 102 may re-train the first machine learning model 510 in a subsequent stage of training (e.g., the third stage of training), using the created new training dataset (e.g., the third training dataset). At step 1126, the application server 102 determines an accuracy of the re-trained first machine learning model 510, using the loss function. For example, the application server 102 may determine a loss associated with the re-trained first machine learning model 510, using test data. The application server 102 may compare known outputs, associated with the test data, to outputs of the re-trained first machine learning model 510 for the test data. The application server 102 may determine the loss based on the comparison. At step 1128, the application server 102 determines whether the determined loss is than the loss threshold. If at step 1128, it is determined that the determined loss is not less than the loss threshold, step 1120 is performed. In other words, if the determined loss is greater than the loss threshold, step 1120 is performed.
FIG. 12 is a flowchart 1200 that illustrates a method for changing a label of a vulnerability, in accordance with an exemplary embodiment of the present disclosure. For the sake of brevity, FIG. 12 is explained in conjunction with FIG. 6.
At step 1202, the application server 102 receives, from the first user device 106a, a label change request for change of the first label of a vulnerability. For example, the application server 102 receives a label change request for change of the first label, which is assigned to the first vulnerability, to the second label. At step 1204, the application server 102 updates the second training dataset 602 to change the first label of the first vulnerability to the second label. For the sake of brevity, it is assumed that the application server 102 updates an existing training dataset (e.g., the second training dataset 602) to change the first label assigned to the first vulnerability to the second label. In another embodiment, the application server 102 may create a new training dataset (e.g., the third training dataset) which indicates that the second label is assigned to the first vulnerability.
At step 1206, the application server 102 re-trains the first machine learning model 510 in the third stage training, using the updated second training dataset 602. At step 1208, the application server 102 communicates a label change response to the first user device 106a. The label change response may include a message indicating that the second label is now successfully assigned to the first vulnerability. The service application, executed on the first user device 106a, may present the message on the rendered UI.
FIG. 13 is a high-level flowchart 1300 that illustrates a method for managing vulnerabilities in a system, in accordance with an exemplary embodiment of the present disclosure.
At step 1302, the application server 102 creates the first training dataset 508. The first training dataset 508 is indicative of the first plurality of vulnerabilities 502, each correctly labeled with the first label that is assigned to the first vulnerability type. The first training dataset 508 is further indicative of the third plurality of vulnerabilities 506 that are selected from the unlabeled dataset 504 and labeled with the second label, irrespective of whether the third plurality of vulnerabilities 506 correspond to the first vulnerability type or the second vulnerability type. At step 1304, the application server 102 trains the first machine learning model 510 in the first stage of training, using the first training dataset 508. The application server 102 mines, from the unlabeled dataset 504, using the trained first machine learning model 510, the fourth plurality of vulnerabilities 512 to be pseudo-labeled with the second label. At step 1306, the application server 102 creates the second training dataset 602 that is indicative of the labeled first plurality of vulnerabilities 502 and the pseudo-labeled fourth plurality of vulnerabilities 512. At step 1308, the application server 102 re-trains, for vulnerability classification, the trained first machine learning model 510 in the second stage of training, using the second training dataset 602.
Embodiments in the disclosure enable the application server 102 to train, using techniques of semi-supervised learning, the first machine learning model 510 for vulnerability classification. The embodiments in the disclosure facilitate training of the first machine learning model 510 when only a labeled data sample for one class of vulnerabilities (e.g., the first vulnerability type) exists and labeled data for other class of vulnerabilities (e.g., the second vulnerability type) is absent. Through iterative training of the first machine learning model 510 and repeated mining of vulnerabilities for pseudo-labeling, the accuracy of the first machine learning model 510 may be improved up to a requisite accuracy level. Manual labeling of vulnerabilities as the first vulnerability type or the second vulnerability type is not required after the reception of the labeled dataset 502. This results in effective and efficient utilization of time of the employees of the entity. Accurate labeling/classification of vulnerabilities facilitates timely patching and/or mitigation of the vulnerabilities. The application server 102 is configured to generate the priority list for the set of vulnerabilities indicated by the vulnerability classification request, enabling the employees of the entity to prioritize patching of the set of vulnerabilities according to a severity level of each of the set of vulnerabilities. Further, the application server 102 enables users (e.g., the first user) to change or modify a label of a vulnerability. This allows the application server 102 to re-train the first machine learning model 510 based on changing requirements or compliance policies of the entity.
Techniques consistent with the present disclosure provide, among other features, systems and methods for managing vulnerabilities. While various exemplary embodiments of the disclosed system and method have been described above it should be understood that they have been presented for purposes of example only, not limitations. It is not exhaustive and does not limit the disclosure to the precise form disclosed. Modifications and variations are possible in light of the above teachings or may be acquired from practicing of the disclosure, without departing from the breadth or scope.
In the claims, the words ‘comprising’, ‘including’ and ‘having’ do not exclude the presence of other elements or steps then those listed in a claim. The terms “a” or “an,” as used herein, are defined as one or more than one. Unless stated otherwise, terms such as “first” and “second” are used to arbitrarily distinguish between the elements such terms describe. Thus, these terms are not necessarily intended to indicate temporal or other prioritization of such elements. The fact that certain measures are recited in mutually different claims does not indicate that a combination of these measures cannot be used to advantage.
While various embodiments of the present disclosure have been illustrated and described, it will be clear that the present disclosure is not limited to these embodiments only. Numerous modifications, changes, variations, substitutions, and equivalents will be apparent to those skilled in the art, without departing from the scope of the present disclosure, as described in the claims. , Claims:CLAIMS
WE CLAIM:
1. A method for managing vulnerabilities in a system, the method comprising:
creating, by a server, a first training dataset that includes:
a first plurality of vulnerabilities, each correctly labeled with a first label that is assigned to a first vulnerability type, and
a second plurality of vulnerabilities selected from an unlabeled dataset and labeled with a second label assigned to a second vulnerability type, irrespective of whether the second plurality of vulnerabilities correspond to the first vulnerability type or the second vulnerability type;
training, by the server, a machine learning model in a first stage of training, using the first training dataset;
mining, by the server, from the unlabeled dataset, using the trained machine learning model, a third plurality of vulnerabilities to be pseudo-labeled with the second label;
creating, by the server, a second training dataset that includes the labeled first plurality of vulnerabilities and the pseudo-labeled third plurality of vulnerabilities; and
re-training, by the server, for vulnerability classification, the trained machine learning model in a second stage of training, using the second training dataset.
2. The method as claimed in claim 1, wherein:
the first training dataset includes:
a description and an identifier of each of the first plurality of vulnerabilities and the second plurality of vulnerabilities, and
the second training dataset includes:
the description and the identifier of the first plurality of vulnerabilities and a description and an identifier of each of the pseudo-labeled third plurality of vulnerabilities.
3. The method as claimed in claim 2, further comprising:
providing, by the server, one of the description and the identifier of each of the pseudo-labeled third plurality of vulnerabilities as input to the re-trained machine learning model;
determining, by the server, that a vulnerability of the third plurality of vulnerabilities was incorrectly pseudo-labeled with the second label, based on an output of the trained machine learning model for the provided input;
updating, by the server, the second training dataset to change the second label of the vulnerability to the first label, based on the determination that the vulnerability was incorrectly pseudo-labeled; and
re-training, by the server, in a third stage of training using the updated second training dataset, the trained machine learning model.
4. The method as claimed in claim 1, further comprising:
receiving, by the server, a request for a change of the first label of a vulnerability included in the second training dataset;
updating, by the server, the second training dataset to change the first label of the vulnerability to the second label; and
re-training, by the server, in a third stage of training using the updated second training dataset, the re-trained machine learning model.
5. The method as claimed in claim 1, wherein the mined third plurality of vulnerabilities are pseudo-labeled with the second label based on a vulnerability score associated with each of the third plurality of vulnerabilities, wherein the vulnerability score associated with each of the third plurality of vulnerabilities is based on an output of the trained machine learning model for each of the third plurality of vulnerabilities, and is less than a threshold score.
6. The method as claimed in claim 5, wherein a vulnerability score associated with a vulnerability in the unlabeled dataset is based on an output of the trained machine learning model for the vulnerability, and wherein the vulnerability in the unlabeled dataset is excluded from being pseudo-labeled with the second label based on the vulnerability score associated with the vulnerability being greater than the threshold score.
7. The method as claimed in any of claims 1 to 6, wherein the second label is different from the first label, and wherein the second vulnerability type is different from the first vulnerability type.
8. The method as claimed in any of claims 1 to 6, wherein a severity level of the first vulnerability type is greater than or equal to a threshold severity level and a severity level of the second vulnerability type is less than the threshold severity level.
9. The method as claimed in any of claims 1 to 6, wherein an accuracy of the re-trained machine learning model in classifying vulnerabilities as the first vulnerability type or the second vulnerability is greater than an accuracy of the trained machine learning model in classifying vulnerabilities as the first vulnerability type or the second vulnerability type.
10. A system for managing vulnerabilities, the system comprising:
a server configured to:
create a first training dataset that includes:
a first plurality of vulnerabilities, each correctly labeled with a first label that is assigned to a first vulnerability type, and
a second plurality of vulnerabilities selected from an unlabeled dataset and labeled with a second label assigned to a second vulnerability type, irrespective of whether the second plurality of vulnerabilities correspond to the first vulnerability type or the second vulnerability type;
train a machine learning model in a first stage of training, using the first training dataset;
mine, from the unlabeled dataset, using the trained machine learning model, a third plurality of vulnerabilities to be pseudo-labeled with the second label;
create a second training dataset that includes the labeled first plurality of vulnerabilities and the pseudo-labeled third plurality of vulnerabilities; and
re-train the trained machine learning model for vulnerability classification in a second stage of training, using the second training dataset.
| # | Name | Date |
|---|---|---|
| 1 | 202221045814-FORM 1 [10-08-2022(online)].pdf | 2022-08-10 |
| 2 | 202221045814-DRAWINGS [10-08-2022(online)].pdf | 2022-08-10 |
| 3 | 202221045814-COMPLETE SPECIFICATION [10-08-2022(online)].pdf | 2022-08-10 |
| 4 | 202221045814-FORM-26 [12-08-2022(online)].pdf | 2022-08-12 |
| 5 | 202221045814-FORM 3 [12-08-2022(online)].pdf | 2022-08-12 |
| 6 | 202221045814-ENDORSEMENT BY INVENTORS [12-08-2022(online)].pdf | 2022-08-12 |
| 7 | Abstract1.jpg | 2022-11-10 |