Specification
DESC:FORM 2
THE PATENTS ACT, 1970
(39 of 1970)
&
THE PATENT RULES, 2003
COMPLETE SPECIFICATION
(See Section 10 and Rule 13)
Title of invention:
METHODS AND SYSTEMS FOR OUT-OF-DISTRIBUTION DETECTION FOR TEXT CLASSIFICATION
Applicant:
Tata Consultancy Services Limited
A company Incorporated in India under the Companies Act, 1956
Having address:
Nirmal Building, 9th Floor,
Nariman Point, Mumbai 400021,
Maharashtra, India
The following specification particularly describes the invention and the manner in which it is to be performed.
CROSS-REFERENCE TO RELATED APPLICATIONS AND PRIORITY
The present application claims priority from Indian provisional patent application no 202121033046, filed on July 22, 2021. The entire contents of the aforementioned application are incorporated herein by reference.
TECHNICAL FIELD
The disclosure herein generally relates to text classification, and, more particularly, to methods and systems for out-of-distribution detection for text classification using a plug and play language model.
BACKGROUND
Text classification is an important task of natural language processing (NLP) in which an open-ended text is categorized into a set of predefined categories. In today’s world, application of text classification is varied and increasing day by day. Few examples of the application include sentiment analysis for determining sentiment behind a text, intent detection in which intent behind the text is determined, spam detection for filtering spams, topic labelling for finding suitable labels for topics present in the text, language detection for detecting language of the text, etc. Though artificial intelligence (AI)/machine learning (ML) techniques around the text classification are improving, the problems associated with detection of outliers that hampers the text classification still exist.
For instance, consider the situation where an employee is asking about the salary structure with a conversation agent that is designed to answer questions on human resource (HR) policies or information technology (IT) infrastructure issues. The conversation agent generally answers queries with fixed intent from a particular domain (e.g., HR policies or IT infrastructure) by employing NLP text classification techniques that classifies user queries in one of the several predefined intents. So, in this particular situation, the conversation agent is supposed to provide an answer that is to be provided when an outlier situation (salary related query) is encountered such as ‘Please connect with a finance person for salary related queries’ or ‘I do not have answer for this query’ instead of providing some random answers that are completely irrelevant in the current context, such as ‘Do you want to know about sick leave policy?’. This can only be possible if out-of-distribution (OOD) detection techniques that help in recognizing outliers / anomalies are put forth in place for NLP classification tasks i.e., text / image classification tasks. Though the problem of OOD detection is well handled in field of computer vision, fewer OOD detection models are available for classification tasks, such as the text / image classification.
Further, the available OOD detection models often ignore the impact of OOD detection in calibration errors. Additionally, issues like data leakage and overlapping domain problem still exist as these are not well handled by the present OOD detection models which contribute to lower OOD detection accuracy.
SUMMARY
Embodiments of the present disclosure present technological improvements as solutions to one or more of the above-mentioned technical problems recognized by the inventors in conventional systems. For example, in one embodiment, a method for out-of-distribution detection for text classification using a plug and play language model is provided. The method comprises receiving, by an out-of-distribution detection system (OODDS) via one or more hardware processors, one or more sentence samples from one or more client devices, wherein the one or more sentence samples form an in-domain (IND) dataset; selecting, by the OODDS via the one or more hardware processors, at least two initial tokens from each sentence sample of the one or more sentence samples as an input seed for the respective sample; generating, by the OODDS via the one or more hardware processors, one or more new sentences for the one or more sentence samples, wherein each new sentence of the one or more new sentences is generated corresponding to each sentence sample of the one or more sentence samples based on the input seed selected for the respective sample using a plug and play language model (PPLM), wherein each new sentence that is generated is directed towards one or more words present in an out-of-distribution (OOD) dataset; filtering, by the OODDS via the one or more hardware processors, the one or more new sentences based on the IND dataset using embedding obtained from a Bidirectional Encoder Representations from Transformers (BERT) model to generate a PPLM OOD dataset; and training, by the OODDS via the one or more hardware processors, a classifier based, at least in part, on the PPLM OOD dataset and the IND dataset using a pre-defined loss function to obtain a trained classifier, wherein the trained classifier enables the OODDS to differentiate between an IND sample sentence and an OOD sample sentence based on a classifier output obtained for an input sample sentence.
In another aspect, there is provided an out-of-distribution detection system for text classification using a plug and play language model. The system comprises a memory storing instructions; one or more communication interfaces; and one or more hardware processors coupled to the memory via the one or more communication interfaces, wherein the one or more hardware processors are configured by the instructions to: receive one or more sentence samples from one or more client devices, wherein the one or more sentence samples form an in-domain (IND) dataset; select at least two initial tokens from each sentence sample of the one or more sentence samples as an input seed for the respective sample; generate one or more new sentences for the one or more sentence samples, wherein each new sentence of the one or more new sentences is generated corresponding to each sentence sample of the one or more sentence samples based on the input seed selected for the respective sample using the plug and play language model (PPLM), wherein each new sentence that is generated is directed towards one or more words present in an out-of-distribution (OOD) dataset; filter the one or more new sentences based on the IND dataset using embedding obtained from a Bidirectional Encoder Representations from Transformers (BERT) model to generate a PPLM OOD dataset; and train a classifier based, at least in part, on the PPLM OOD dataset and the IND dataset using a pre-defined loss function to obtain a trained classifier, wherein the trained classifier enables the OODDS to differentiate between an IND sample sentence and an OOD sample sentence based on a classifier output obtained for an input sample sentence.
In yet another aspect, there are provided one or more non-transitory machine-readable information storage mediums comprising one or more instructions which when executed by one or more hardware processors cause a method for text classification using a plug and play language model. The method comprises receiving, by an out-of-distribution detection system (OODDS) via one or more hardware processors, one or more sentence samples from one or more client devices, wherein the one or more sentence samples form an in-domain (IND) dataset; selecting, by the OODDS via the one or more hardware processors, at least two initial tokens from each sentence sample of the one or more sentence samples as an input seed for the respective sample; generating, by the OODDS via the one or more hardware processors, one or more new sentences for the one or more sentence samples, wherein each new sentence of the one or more new sentences is generated corresponding to each sentence sample of the one or more sentence samples based on the input seed selected for the respective sample using a plug and play language model (PPLM), wherein each new sentence that is generated is directed towards one or more words present in an out-of-distribution (OOD) dataset; filtering, by the OODDS via the one or more hardware processors, the one or more new sentences based on the IND dataset using embedding obtained from a Bidirectional Encoder Representations from Transformers (BERT) model to generate a PPLM OOD dataset; and training, by the OODDS via the one or more hardware processors, a classifier based, at least in part, on the PPLM OOD dataset and the IND dataset using a pre-defined loss function to obtain a trained classifier, wherein the trained classifier enables the OODDS to differentiate between an IND sample sentence and an OOD sample sentence based on a classifier output obtained for an input sample sentence.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention, as claimed.
BRIEF DESCRIPTION OF THE DRAWINGS
The accompanying drawings, which are incorporated in and constitute a part of this disclosure, illustrate exemplary embodiments and, together with the description, serve to explain the disclosed principles:
FIG. 1 illustrates an example representation of an environment, related to at least some example embodiments of the present disclosure.
FIG. 2 illustrates an exemplary block diagram of a system for text classification using a plug and play language model, in accordance with an embodiment of the present disclosure.
FIG. 3 illustrates a schematic block diagram representation of an out-of-distribution detection process associated with the system of FIG. 2 or the OODDS of FIG. 1 for out-of-distribution detection for text classification, in accordance with an embodiment of the present disclosure.
FIG. 4 illustrates an exemplary flow diagram of a method for text classification using the plug and play language model, in accordance with an embodiment of the present disclosure.
FIG. 5 illustrates an exemplary flow diagram of a method for filtering sentences using a Bidirectional Encoder Representations from Transformers (BERT) model, in accordance with an embodiment of the present disclosure.
FIGS. 6A through 6C are graphical representations illustrating segregation of in-domain (IND) and out-of-distribution (OOD) samples obtained by applying a plurality of OOD classification techniques, in accordance with an embodiment of the present disclosure.
DETAILED DESCRIPTION OF EMBODIMENTS
Exemplary embodiments are described with reference to the accompanying drawings. In the figures, the left-most digit(s) of a reference number identifies the figure in which the reference number first appears. Wherever convenient, the same reference numbers are used throughout the drawings to refer to the same or like parts. While examples and features of disclosed principles are described herein, modifications, adaptations, and other implementations are possible without departing from the scope of the disclosed embodiments.
Nowadays, text classification has become an important part of every industry as it helps in categorization of data which further helps in generating insights from the data that can be used to improve business processes. With the advancement in the AI/ML algorithms, such as neural networks that are working to make classification systems more robust, accurate detection of OOD with reduced calibration plays an important role. In the recent past, a plurality of OOD detection techniques have been introduced to optimize the OOD detection. For example, in a detection technique by Hendrycks and Gimpel (MSP) (e.g., refer A baseline for detecting misclassified and out-of-distribution examples in neural networks. International conference on learning representations (ICLR), 2019), authors proposed a method in which the hyperparameters are fine-tuned on a validation set to optimize the OOD detection. In this method, maximum confidence scores from SoftMax output are utilized to generate a detection score that can be further utilized to classify the OOD samples. The is done because the correctly classified examples tend to have maximum SoftMax confidence scores than OOD examples. The main challenge that is observed in the above technique is that the neural network that is trained based on the cross-entropy loss tend to be overconfident, which results in higher confidence score for the OOD samples if they share similar patterns and phrases.
In another detection technique by Shiyu Liang, Yixuan Li, R. Srikant (e.g., refer enhancing the reliability of out-of-distribution image detection in neural network), the authors proposed a method in which the temperature scaling is utilized with input perturbations using an OOD validation dataset to tune hyperparameters. The problem that exists with the above method is that the OOD validation dataset that is selected for tuning hyperparameters is a general dataset and the parameters tuned with one OOD dataset are found to be invalid for other OOD validation datasets. In yet another detection technique by Ren et al. (e.g., refer “Likelihood ratios for out of distribution detection. Neural Information Processing Systems (NeurIPS 2019)”), authors generated an effective OOD detection method for image classifiers in which a background model is trained using perturbed IND samples employing likelihood ratio. The likelihood ratio method effectively corrects confounding background statistics which further enhances OOD detection performance. The above disclosed method is effective but can be used only for image classification.
Further, in few more detection techniques by Lee et al. (e.g., refer “A simple unified framework for detecting out-of-distribution samples and adversarial attacks 2018” and “Enhancing the reliability of out-of-distribution image detection in neural networks 2017”), authors proposed methods to generate synthetic OOD samples around uniform distribution using generative adversarial networks (GAN). In the above proposed methods, OOD samples are forced to have a uniform distribution by reducing the Kullback-Leibler (KL) divergence between the generated probabilities on OOD samples and uniform distribution. Though the OOD detectors proposed by Lee et al. are effective but these detectors are mainly oriented towards visual tasks, specifically, for image classification.
Embodiments of the present disclosure overcome the disadvantages of the various OOD detection techniques present in the art by using a plug and play language model (PPLM) provided by Dathathri et al. (e.g., refer A simple approach to controlled text generation. International conference on learning representations (ICLR), 2020). The PPLM guides sentence generation for OOD samples, which are further utilized for training an OOD classifier to improve OOD detection accuracy of the conventional algorithms. Further, a filtering technique like Bidirectional Encoder Representations from Transformers (BERT) model is used by systems and methods of the present disclosure for filtering a subset of OOD sample generated by PPLM to only concentrate around the class boundary that further improves the OOD detection performance. Additionally, a post-hoc Dirichlet calibration is applied by systems and methods to minimize the calibration errors found in conventional algorithms. This further solves the problem of data leakage and overlapping domains.
The present disclosure considers an out-domain sample/out-of-distribution (OOD) detection problem related to text classification. Currently, most of the conversational systems use text classification for performing intent identification that may further be utilized for generating accurate response for a user query as the user query is answered based on the intent. Generally, the conversational systems employ a classification model (herein also referred as a classifier) that is trained to perform the text classification (i.e., classification of the user queries into one of a plurality of predefined intents). The classified text may further be utilized to provide responses to the asked user queries based on an assigned intent. The classification model employed in the conversational systems, upon encountering user queries that form part of OOD samples, may provide improper responses (because of non-clarity on the intent) as the classification model tend to make overconfident predictions on the OOD test samples. In the present disclosure, the system 100 improves user experience by providing an improved OOD detection technique in form of an out-of-distribution detection system (explained in detail with reference to FIGS. 1 and 2) that effectively handles the problem of improper responses by accurately detecting OOD samples. The conversational systems employing the out-of-distribution detection system may redirect a user to the other conversational systems that are designed to handle the asked user queries or may raise a flag for manual intervention upon encountering queries related with the OOD without requiring retraining of the classification models employed in those conversational systems.
In an embodiment, as the OOD space for any conversational system is huge, it is hypothesized that the samples that are close to a cluster boundary of in-domain sample (herein after referred as IND) in the embedding space are more effective in discriminating between IND and OOD samples. The out-of-distribution detection system, to differentiate between an in-domain sample and an out-domain sample/out-of-distribution sample (OOD) for each sample sentence (query), may generate a sentence that are close to the IND samples for each sample sentence using the PPLM. In an embodiment, the PPLM may utilize an attribute model to guide generation of the sentences as the attribute model updates the latent representation, which is further being used to generate new distribution over the vocabulary without requiring fine tuning of a classifier employed in the corresponding conversational system. Further, the generated sentences may be utilized by the out-of-distribution detection system to form a proxy for OOD. Thereafter, the out-of-distribution detection system may use the formed proxy for performing entropy regularization, thereby improving the efficiency of the OOD detection.
Referring now to the drawings, and more particularly to FIG. 1 through 6C, where similar reference characters denote corresponding features consistently throughout the figures, there are shown preferred embodiments and these embodiments are described in the context of the following exemplary system and/or method.
FIG. 1 illustrates an exemplary representation of an environment 100 related to at least some example embodiments of the present disclosure. Although the environment 100 is presented in one arrangement, other embodiments may include the parts of the environment 100 (or other parts) arranged otherwise depending on, for example, generating new sentences that are directed towards out-of-distribution (OOD) dataset, obtaining embedding from a Bidirectional Encoder Representations from Transformers (BERT) model, etc. The environment 100 generally includes a plurality of client devices, such as a client device 102a and 102b, and an out-of-distribution detection system (hereinafter referred as ‘OODDS’) 106, each coupled to, and in communication with (and/or with access to) a network 104. It should be noted that two client devices are shown for the sake of explanation; there can be more or lesser number of client devices.
The network 104 may include, without limitation, a light fidelity (Li-Fi) network, a local area network (LAN), a wide area network (WAN), a metropolitan area network (MAN), a satellite network, the Internet, a fiber optic network, a coaxial cable network, an infrared (IR) network, a radio frequency (RF) network, a virtual network, and/or another suitable public and/or private network capable of supporting communication among two or more of the parts or users illustrated in FIG. 1, or any combination thereof.
Various entities in the environment 100 may connect to the network 104 in accordance with various wired and wireless communication protocols, such as Transmission Control Protocol and Internet Protocol (TCP/IP), User Datagram Protocol (UDP), 2nd Generation (2G), 3rd Generation (3G), 4th Generation (4G), 5th Generation (5G) communication protocols, Long Term Evolution (LTE) communication protocols, or any combination thereof.
The client devices 102a and 102b are associated with a classifier trainer (e.g., a user or an entity such as an organization) who wants to train a classifier using the OODDS 106. Example of the client devices 102a and 102b include, but are not limited to, a personal computer (PC), a mobile phone, a tablet device, a Personal Digital Assistant (PDA), a voice activated assistant, a smartphone and a laptop.
The out-of-distribution detection system (OODDS) 106 includes one or more hardware processors and a memory. The OODDS 106 is configured to perform one or more of the operations described herein. The OODDS 106 is configured to receive one or more sentence samples via the network 104 from the client devices 102a and 102b associated with the user who wants to train the classifier to classify between an in-domain (IND) sample sentence and an OOD sample sentence. In general, the OODDS 106, for training the classifier using a plug and play language model (PPLM), require some input seeds and the sentence samples received from the client devices 102a and 102b will work as the input seed for initiating the training. In an embodiment, the received sentence samples received may form an IND dataset. The OODDS 106 is further configured to select at least two initial tokens from each sentence sample and then uses the at least two initial tokens as input seeds. Further, the OODDS 106 generates a new sentence corresponding to each sentence sample of the one or more sentence samples based on the input seed selected for the respective sample using the PPLM. In a more illustrative manner, the OODDS 106 facilitates generation of new sentences that are directed towards words present in an OOD dataset. In one embodiment, the OOD dataset is created using words from a plurality of domains, such as science domain, religion domain, computer domain, politics domain, sports domain etc. Basically, the words included in the OOD dataset are used by the OODDS 106 to control generation of the new sentences using the PPLM.
Thereafter, the OODDS 106 is configured to perform filtering of the new sentences based on the IND dataset using embedding obtained from a Bidirectional Encoder Representations from Transformers (BERT) model to generate a PPLM OOD dataset. Once the PPLM OOD dataset is available, the OODDS 106 uses the PPLM OOD dataset along with the IND dataset to train a classifier using a predefined loss function.
In an embodiment, once the classifier is trained, the classifier may provide a classifier output for each sample sentence collected from the client devices 102a and 102b. The classifier output may then be used by the OODDS 106 to differentiate between an IND sample and an OOD sample that will further improve the text classification.
The number and arrangement of systems, devices, and/or networks shown in FIG. 1 are provided as an example. There may be additional systems, devices, and/or networks; fewer systems, devices, and/or networks; different systems, devices, and/or networks; and/or differently arranged systems, devices, and/or networks than those shown in FIG. 1. Furthermore, two or more systems or devices shown in FIG. 1 may be implemented within a single system or device, or a single system or device shown in FIG. 1 may be implemented as multiple, distributed systems or devices. Additionally, or alternatively, a set of systems (e.g., one or more systems) or a set of devices (e.g., one or more devices) of the environment 100 may perform one or more functions described as being performed by another set of systems or another set of devices of the environment 100 (e.g., refer scenarios described above).
FIG. 2 illustrates an exemplary block diagram of an out-of-distribution detection system (OODDS) 200 for out-of-distribution detection for text classification using a plug and play language model (PPLM), in accordance with an embodiment of the present disclosure. In an embodiment, the out-of-distribution detection system (OODDS) may also be referred as system and may be interchangeably used herein. The system 200 is similar to the OODDS 106 explained with reference to FIG. 1. In some embodiments, the system 200 is embodied as a cloud-based and/or SaaS-based (software as a service) architecture. In some embodiments, the system 200 may be implemented in a server system. In some embodiments, the system 200 may be implemented in a variety of computing systems, such as laptop computers, notebooks, hand-held devices, workstations, mainframe computers, and the like.
In an embodiment, the system 200 includes one or more processors 204, communication interface device(s) or input/output (I/O) interface(s) 206, and one or more data storage devices or memory 202 operatively coupled to the one or more processors 204. The one or more processors 204 may be one or more software processing modules and/or hardware processors. In an embodiment, the hardware processors can be implemented as one or more microprocessors, microcomputers, microcontrollers, digital signal processors, central processing units, state machines, logic circuitries, and/or any devices that manipulate signals based on operational instructions. Among other capabilities, the processor(s) is configured to fetch and execute computer-readable instructions stored in the memory.
The I/O interface device(s) 206 can include a variety of software and hardware interfaces, for example, a web interface, a graphical user interface, and the like and can facilitate multiple communications within a wide variety of networks N/W and protocol types, including wired networks, for example, LAN, cable, etc., and wireless networks, such as WLAN, cellular, or satellite. In an embodiment, the I/O interface device(s) can include one or more ports for connecting a number of devices to one another or to another server.
The memory 202 may include any computer-readable medium known in the art including, for example, volatile memory, such as static random access memory (SRAM) and dynamic random access memory (DRAM), and/or non-volatile memory, such as read only memory (ROM), erasable programmable ROM, flash memories, hard disks, optical disks, and magnetic tapes. In an embodiment a database 208 can be stored in the memory 202, wherein the database 208 may comprise, but are not limited to inputs received from one or more client devices (e.g., client nodes, target nodes, computing devices, and the like) such as samples, and percentiles. In an embodiment, the memory 202 may store information pertaining to training samples, plug and play language modeling technique, token selection criteria, and the like. The memory 202 further comprises (or may further comprise) information pertaining to input(s)/output(s) of each step performed by the systems and methods of the present disclosure. In other words, input(s) fed at each step and output(s) generated at each step are comprised in the memory 202 and can be utilized in further processing and analysis.
FIG. 3, with reference to FIGS, 1-2, illustrates a schematic block diagram representation 300 of an out-of-distribution detection process associated with the system 200 of FIG. 2 or the OODDS 106 of FIG. 1 for out-of-distribution detection for text classification, in accordance with an embodiment of the present disclosure.
FIG. 4, with reference to FIGS. 1-3, illustrates an exemplary flow diagram of a method 400 for out-of-distribution detection for text classification using the OODDS 100 of FIG. 1 and system 200 of FIG. 2, in accordance with an embodiment of the present disclosure. In an embodiment, the system 200 comprises one or more data storage devices or the memory 202 operatively coupled to the one or more hardware processors 204 and is configured to store instructions for execution of steps of the method 400 by the one or more hardware processors 204. The steps of the method 400 of the present disclosure will now be explained with reference to the components of the OODDS 100 as depicted in FIG. 1, and the system 200 of FIGS. 2-3, and the flow diagram.
In an embodiment of the present disclosure, at step 402, the one or more hardware processors 204 of the system 200 receive one or more sentence samples from one or more client devices, such as the client devices 102a and 102b. The received one or more sentence samples form an in-domain (IND) dataset i.e., a set of IND data. The set of IND data may also be represented by:
D_ind ={? (x?_(1,) y_(1 ) ),? (x?_(2,) y_(2 )),… ? (x?_(n,) y_(n )),}
Where, x_(1 ) to x_n represent different domains that form part of the IND data.
At step 404 of the present disclosure, the one or more hardware processors 204 of the system 200 select at least two initial tokens i.e., words from each sentence sample of the one or more sentence samples as an input seed for the respective sample. In particular, two initial words from each sentence present in the D_ind dataset is selected. For example, an original sentence sample that is received from a client device (e.g., the client device 102) is ‘This article includes answers what options have for software Intel-based Unix systems’. In an embodiment of the present disclosure, the one or more hardware processors 204 may select ‘This article’ as the input seed for the received sentence. It should be noted that the OOD dataset (also referred as D_ood dataset) corresponding to D_ind dataset is created (e.g., say manually by an administrator of the system 200). In an embodiment, the D_ood dataset is created using words from a plurality of domains, such as science domain, religion domain, computer domain, etc. It is to be understood by a person having ordinary skill in the art or person skilled in the art that the above exemplary domains shall not be construed as limiting the scope of the present disclosure. In an embodiment, the words included in the D_ood dataset may be used to control sentence generation that is explained in detail with reference to step 406. The D_ood dataset for the original sentence can include words such as ‘astronomy’, ‘atom’, ‘biology’, ‘cell’, ‘chemical’, ‘chemistry’, ‘earth’, ‘climate’, etc.
In an embodiment of the present disclosure, at step 406, the one or more hardware processors 204 of the system 200 generate one or more new sentences for the received one or more sentence samples. Each new sentence of the one or more new sentences is generated corresponding to each sentence sample of the one or more sentence samples based on the input seed selected for the respective sample using a plug and play language model (PPLM). Each new sentence that is generated is directed towards one or more words present in the D_ood dataset. In particular, the hardware processors 204 generate a new sentence corresponding to each sentence sample. The generated new sentence is directed towards words included in D_ood dataset based on the input seed selected for the respective sentence sample using the PPLM model. For example, a new sentence that is generated for the original sentence based on the input seed and the D_ood dataset can be ‘This article explores a recent study on a large scale of global climate change, which finds no direct evidence of the Earth’s climate warming’. As can be seen the new sentence starts with the input seed i.e., ‘this article’ and include words, such as ‘climate’ and ‘earth’ that are present in the D_ood dataset. The one or more new sentences generated through the PPLM model may form D_ood sentences.
At step 408 of the present disclosure, the hardware processors 204 of the system 200 filter the one or more new sentences based on the IND dataset i.e., D_ind dataset using embedding obtained from a Bidirectional Encoder Representations from Transformers (BERT) model to generate a PPLM OOD dataset (also referred as PPLM D_ood dataset). The method of filtering is explained in detail with reference to FIG. 5.
In an embodiment of the present disclosure, at step 410, the one or more hardware processors 204 of the system 200 train a classifier based, at least in part, on the PPLM OOD dataset and the IND dataset using a pre-defined loss function to obtain a trained classifier. Basically, the classifier is trained based on the PPLM D_ood dataset and the D_ind dataset using the loss function represented by:
L= minimize-0?? E_((x,y)~d_ind^train ) ? [L_CE (y_(in ),f_(? ) (x))]+ a.E_((x_OOD^PPLM )~d_out^ood ) [L_E (f_(? ) (x_OOD^PPLM ),U (y))],
Where, L_CE represents cross entropy loss.
f_(? ) (x) represents SoftMax output prediction for an input sample x,
x_OOD^PPLM represents the D_ood sentences generated using PPLM,
d_out^ood represents the PPLM D_ood dataset obtained at step 408,
L_E represents the entropy regularization that tries to reduce loss between output probability vector of an OOD sample to a uniform distribution,
The loss function can also be represented as:
Loss =L(cross entropy)+?L(entropy regularization)
As can be seen, the entropy regularization is employed on the PPLM D_ood dataset for training the classifier as it is assumed that when samples i.e., PPLM D_ood dataset are forced to have highest entropy, the OOD samples may be found closer to a uniform distribution. In the present disclosure, the system 200 concentrates around the class boundary of the D_ind dataset using the PPLM D_ood dataset, which leads to drastic improvement in the performance of the classifier as the classifier is trained on reduced dataset instead of OOD space that is huge. Further, once the classifier is trained, the classifier may provide a classifier output for each sample collected from one or more client devices. Thereafter, the OODDS 200 may use the classifier output obtained for an input sample sentence to differentiate between an IND sample sentence and an OOD sample sentence. In an embodiment, when probability vector at the classifier output is found to be low on all elements of a prediction, the received sample may be classified as an OOD sample.
In an embodiment, the one or more hardware processors 204 of the system 200 perform a post-hoc Dirichlet calibration on the classifier output using a regularization technique viz an Off-Diagonal regularization technique (ODIR) known in the art. The Dirichlet calibration can be considered as a log for transforming the uncalibrated probabilities with the accuracies produced by the above-mentioned classifier. For instance, the ODIR is defined as follows:
ODIR=1/(k ×(k-1)) ?_(i?j)¦w_(i,j)^2 ,
Where, k = 2
i = 1 to 10
j = 1 to k
The application of the Dirichlet calibration in the present disclosure refrains/prevents the classifier from making overconfident decisions on OOD samples and may also improve the compatibility between the predicted probabilities and the accuracies produced by the classifier, thereby improving detection accuracy of the OOD samples.
FIG. 5, with references to FIGS. 1 to 4, illustrate an exemplary flow diagram of a method 500 for filtering sentences using the BERT model, in accordance with an embodiment of the present disclosure. The steps of the method 500 of the present disclosure will now be explained with reference to the components of the OODDS 100 of FIG. 1 and system 200 of FIG. 2, and the flow diagram of FIG.4.
In an embodiment of the present disclosure, at step 502, the one or more hardware processors 204 of the system 200 obtain an embedding for each sentence sample of the one or more sentence samples and for each new sentence of the one or more new sentences using the BERT model. The embeddings obtained for the one or more sentence samples are referred as IND embeddings. Similarly, the embeddings obtained for each new sentence of the one or more new sentences are referred as OOD embeddings. In particular, the hardware processors 204 obtain embeddings for each sample in D_ind dataset and D_ood sentences using the BERT model. The embeddings obtained for the samples in the D_ind dataset may interchangeably be referred as D_ind embeddings/cluster. Similarly, the embeddings obtained for the samples in the D_ood sentences may interchangeably be referred as D_ood embeddings/cluster. In an embodiment, the IND embeddings i.e., D_ind embeddings include a plurality of points.
At step 504 of the present disclosure, the one or more hardware processors 204 of the system 200 determine a cluster center C_ind for IND embeddings i.e., D_ind embeddings.
At step 506 of the present disclosure, the one or more hardware processors 204 of the system 200 determine a farthest point among the plurality of points of the IND embeddings i.e., D_ind embeddings.
At step 508 of the present disclosure, the one or more hardware processors 204 of the system 200 calculate a distance between the cluster center and the farthest point. Basically, a distance ‘d’ between the cluster center C_ind and the farthest point in the D_ind cluster is calculated at this step.
At step 510 of the present disclosure, the one or more hardware processors 204 of the system 200 compute a predefined percentile of the calculated distance. Basically, to set a range for selection of the outliers, the predefined percentile of the distance ‘d’ is computed. In an embodiment, the predefined percentile can be 95th percentile. The computed predefined percentile may be referred as ‘p’ and further be used to remove the outliers in the D_ood cluster.
At step 512 of the present disclosure, the one or more hardware processors 204 of the system 200 calculate a Euclidian distance between the cluster center C_ind and each embedding of the OOD embeddings i.e., D_ood embeddings.
At step 514 of the present disclosure, the one or more hardware processors 204 of the system 200 compare the Euclidian distance calculated for each embedding in D_ood embeddings with the computed predefined percentile.
At step 516 of the present disclosure, the one or more hardware processors 204 of the system 200 select at least one embedding from the OOD embeddings based on the comparison. In particular, the one or more hardware processors 204 select at least one embedding from the D_ood embeddings based on a comparison of (i) corresponding Euclidian distance with (ii) the computed predefined percentile. For instance, in the present disclosure, the D_ood embeddings are selected whose Euclidian distance is found to be greater than ‘p’ and less than ‘p+10’ i.e.,
p
Documents
Application Documents
| # |
Name |
Date |
| 1 |
202121033046-STATEMENT OF UNDERTAKING (FORM 3) [22-07-2021(online)].pdf |
2021-07-22 |
| 2 |
202121033046-PROVISIONAL SPECIFICATION [22-07-2021(online)].pdf |
2021-07-22 |
| 3 |
202121033046-FORM 1 [22-07-2021(online)].pdf |
2021-07-22 |
| 4 |
202121033046-DRAWINGS [22-07-2021(online)].pdf |
2021-07-22 |
| 5 |
202121033046-DECLARATION OF INVENTORSHIP (FORM 5) [22-07-2021(online)].pdf |
2021-07-22 |
| 6 |
202121033046-Proof of Right [17-08-2021(online)].pdf |
2021-08-17 |
| 7 |
202121033046-FORM 3 [10-12-2021(online)].pdf |
2021-12-10 |
| 8 |
202121033046-FORM 18 [10-12-2021(online)].pdf |
2021-12-10 |
| 9 |
202121033046-ENDORSEMENT BY INVENTORS [10-12-2021(online)].pdf |
2021-12-10 |
| 10 |
202121033046-DRAWING [10-12-2021(online)].pdf |
2021-12-10 |
| 11 |
202121033046-COMPLETE SPECIFICATION [10-12-2021(online)].pdf |
2021-12-10 |
| 12 |
Abstract1.jpg |
2021-12-14 |
| 13 |
202121033046-FORM-26 [08-04-2022(online)].pdf |
2022-04-08 |
| 14 |
202121033046-FER.pdf |
2023-03-06 |
| 15 |
202121033046-FER_SER_REPLY [24-07-2023(online)].pdf |
2023-07-24 |
| 16 |
202121033046-COMPLETE SPECIFICATION [24-07-2023(online)].pdf |
2023-07-24 |
| 17 |
202121033046-CLAIMS [24-07-2023(online)].pdf |
2023-07-24 |
Search Strategy
| 1 |
SearchStrategyE_03-03-2023.pdf |
| 2 |
SearchHistoryamended(15)AE_25-06-2024.pdf |