Specification
FORM 2
THE PATENTS ACT, 1970
(39 of 1970)
&
THE PATENT RULES, 2003
COMPLETE SPECIFICATION (See Section 10 and Rule 13)
Title of invention:
SYSTEM AND METHOD FOR CALIBRATION OF SEQUENCE PREDICTION MODELS
Applicant
Tata Consultancy Services Limited A company Incorporated in India under the Companies Act, 1956
Having address:
Nirmal Building, 9th floor,
Nariman point, Mumbai 400021,
Maharashtra, India
Preamble to the description
The following specification particularly describes the invention and the manner in which it is to be performed.
TECHNICAL FIELD [001] The disclosure herein generally relates to sequence prediction models, and, more particularly, to system and method for calibration of sequence prediction models.
BACKGROUND
[002] Deep neural networks have yielded remarkable accuracy across a range of learning tasks, including classification, regression, and more recently structured prediction tasks. However, the deep models suffer from a tendency to get overconfident or under confident in their predictions. Calibration of these models is crucial for building trust in AI systems and to determine which samples need to be routed for manual inspection in a deployment pipeline. While calibration is a well-studied problem for deep classification models, and many techniques have been proposed for the same, calibration for structured prediction problems is still a challenging task.
[003] In calibration of structured prediction models, calibrated confidence scores help in determining when the predictions are more likely to be correct and hence could be trusted. Previous approaches have shown that using Temperature Scaling (TS) to scale the confidence scores of the logits help in reducing the miscalibration of sequence-to-sequence (seq2seq) prediction models. However, we observed that further improvement in the calibration of these models is possible. However, the existing TS does not make use of the calibration error produced from the original mis-calibrated model.
SUMMARY [004] Embodiments of the present disclosure present technological improvements as solutions to one or more of the above-mentioned technical problems recognized by the inventors in conventional systems. For example, in one embodiment, a processor-implemented method for calibration of sequence prediction model is provided. The method includes determining, via one or more
hardware processors, calibration error generated for each token of a plurality of tokens associated with each sequence of the plurality of sequences of a sequence prediction model. The sequence prediction model is pretrained for prediction of the plurality of sequences and pre calibrated using a loss function. Further the method includes recalibrating, via the one or more hardware processors, the sequence prediction model using an error-induced loss function to obtain a recalibrated sequence prediction model. Recalibrating the sequence prediction model includes learning temperature scaling parameter for each token of the sequence prediction model by optimizing the error induced loss function comprising an objective function comprising the loss function and the calibration error associated with each token, and applying a corresponding learned temperature scaling parameter to each token of the plurality of tokens resulting in calibrated probabilities for each token of the plurality of tokens and recalibration of the sequence prediction model
[005] In another aspect, a system for calibration of sequence prediction model is provided. The system includes a memory storing instructions; one or more communication interfaces; and one or more hardware processors coupled to the memory via the one or more communication interfaces, wherein the one or more hardware processors are configured by the instructions to determine calibration error generated for each token of a plurality of tokens associated with each sequence of the plurality of sequences of a sequence prediction model, the sequence prediction model pretrained for prediction of the plurality of sequences and pre calibrated using a loss function. Further, the one or more hardware processors are configured by the instructions to recalibrating the sequence prediction model using an error-induced loss function to obtain a recalibrated sequence prediction model. To recalibrating the sequence prediction model, the one or more hardware processors are configured by the instructions to learn temperature scaling parameter for each token of the sequence prediction model by optimizing the error induced loss function comprising an objective function comprising the loss function and the calibration error associated with each token; and apply a corresponding learned temperature scaling parameter to each token of the plurality of tokens resulting in
calibrated probabilities for each token of the plurality of tokens and recalibration of the sequence prediction model.
[006] In yet another aspect, a non-transitory computer readable medium for calibration of sequence prediction model is provided. The method includes determining, via one or more hardware processors, calibration error generated for each token of a plurality of tokens associated with each sequence of the plurality of sequences of a sequence prediction model. The sequence prediction model is pretrained for prediction of the plurality of sequences and pre calibrated using a loss function. Further the method includes recalibrating, via the one or more hardware processors, the sequence prediction model using an error-induced loss function to obtain a recalibrated sequence prediction model. Recalibrating the sequence prediction model includes learning temperature scaling parameter for each token of the sequence prediction model by optimizing the error induced loss function comprising an objective function comprising the loss function and the calibration error associated with each token, and applying a corresponding learned temperature scaling parameter to each token of the plurality of tokens resulting in calibrated probabilities for each token of the plurality of tokens and recalibration of the sequence prediction model.
[007] It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention, as claimed.
BRIEF DESCRIPTION OF THE DRAWINGS [008] The accompanying drawings, which are incorporated in and
constitute a part of this disclosure, illustrate exemplary embodiments and, together
with the description, serve to explain the disclosed principles:
[009] FIG. 1 illustrates a block diagram of a system for calibration of
sequence prediction models according to some embodiments of the present
disclosure.
[010] FIG. 2 is a flow chart of a method for calibration of sequence
prediction models according to some embodiments of the present disclosure.
[011] FIG. 3 illustrates a trend of loss functions i.e. NLL and NLL+log(ECE) on EN-Fr NMT dataset, in accordance with an embodiment of present disclosure.
[012] FIGS. 4A-4C illustrate reliability diagrams for En-Fr NMT dataset representing the calibration errors in various settings.
DETAILED DESCRIPTION OF EMBODIMENTS [013] Calibration refers to the extent to which the estimated prediction probability from a model reflects the true underlying probability. Calibration is critical in live deployment of a model since it ensures interpretable probabilities, and can play a significant role in reducing uncertainty and prediction bias. Good confidence estimates can thus help establish trustworthiness for the user, especially for decisions of neural networks that can be difficult to interpret. Calibration can also help evaluate the robustness of a neural network to adversarial perturbations. For example, a network may produce a correct prediction for a given image, but when the image is perturbed, its verdict may change entirely.
[014] A well calibrated model is crucial for ensuring robust deployment of the model in an artificial intelligence (AI) system and routing hard samples for human inspection. For example, if a model M makes 10000 predictions with probability values of the predictions around 0:85 (called the confidence of prediction) then, in a well calibrated scenario, it is expected that 85% of these predictions may be correct. If the model M makes predictions with much higher probability then it is overconfident about its predictions, and if it makes the predictions with lesser probability then it is under-confident in its predictions. Thus, for a perfectly calibrated model M, accuracy(M) should be equal to confidence(M). [015] Most of the earlier methods for calibration are largely focused towards calibration of classification models, for both binary as well as multiclass classification. Nearly all the approaches introduced so far are post-processing methods that are applied on a pre-trained network to calibrate it, such as Platt Scaling, Isotonic Regression, Histogram Binning and Bayesian Binning into Quantiles and the most widely used Temperature Scaling method.
[016] For binary classification models the calibration approaches introduced so far include (1) Histogram binning that uses the binning based strategy where all uncalibrated probabilities are divided into bins and average number of positive-class samples in each bin is taken as the calibrated probability, (2) Isotonic regression that generalizes Histogram binning approach by jointly optimizing bin boundaries and bin predictions, and (3) Platt scaling which is a parametric approach that uses logistic regression for probability calibration. For multiclass classification models, Matrix scaling, Vector scaling and Temperature scaling are all the generalizations of Platt scaling, where Temperature scaling is empirically shown to be the simplest yet effective technique. Attended Temperature Scaling improves temperature scaling by addressing the calibration challenge on small validation datasets, noisy-labeled samples, and highly accurate as well as low accuracy neural networks.
[017] However, very few prior works have focused on calibration of structured prediction models. Recent efforts at confidence calibration of these models have focused on miscalibration of several commonly used models. A co-reference sampling algorithm was proposed to improve the calibration of NLP models. Also, a focal loss was introduced claiming that it allows to learn well calibrated models as opposed to the standard cross-entropy loss. Another known study focused on confidence modeling for neural semantic parsers which are built upon seq2seq models. In the field of calibration of Neural machine translation (NMT) models in particular, a known work focused on finding the reasons behind the miscalibration of NMT models. Another method showed how NMT model can be calibrated during inference while yet another method assessed how the uncertainty in the data is captured by the model distribution and how it affects model predictions.
[018] With the increasing demand of more interpretable and trustworthy outputs from neural networks, extending calibration to other modern day networks like Graph Neural Networks (GNN), Convolutional Neural Networks (CNN), and structured prediction models is becoming increasingly important and would greatly aid the use of these models in industrial applications.
[019] Some of the conventional works present the calibration for structured prediction problems such as speech recognition, optical character recognition, and medical diagnosis. For example, one of the known studies addresses the problem of calibrating prediction confidence for output entities of interest in natural language processing (NLP) applications. Another study has illustrated the miscalibration in NMT tasks. However, none of the prior approaches directly use the calibration error to improve the calibration of the model.
[020] The task of sequence prediction models is to predict an output sequence y = {y1, y2,....,ym} ∈ Y for a given input sequence x = {x1,x2,....,xn} ∈ X. Let P(y|x) be the probability distribution when the model predicts sequence y for sequence x. For each token xi with actual output token yi, the predicted token is
yi′ = argmax P (y| x i ) where y ∈ Y.
[021] Additionally, for each prediction yi′, if yi′ = yi, then accuracy is 1 else 0 with confidence score as P (yi ′ |y′< i ,xi ) where y′ 1 is optimized with respect to NLL on the validation set and added to the logits vector(qi) according to the equation:
Ri = ^j (2)
[033] Higher the value of T, ’softer’ the distribution will be i.e. the model will be less confident about its prediction. If T = 1, the model retrieves the original probability.
[034] Expected Calibration Error (ECE): ECE is a standard metric which is used to determine the gap between the expected value of predictions and the actual observed frequency of positive instances. It follows a partitioning approach where the whole confidence space of the test samples, which is the interval [0,1], is divided into n fixed number of bins Bk; k = 1;…; n. The estimated probability for each instance will lie in one of the bins. Mathematically, for each bin Bk, the associated calibration error according to is defined as follows:
ECE = Σnk=1BK|| accuracy(Bk)-confidence(Bk) || (3)
[035] Referring now to the drawings, and more particularly to FIG. 1 through 4C, where similar reference characters denote corresponding features consistently throughout the figures, there are shown preferred embodiments and these embodiments are described in the context of the following exemplary system and/or method.
[036] FIG. 1 illustrates a block diagram of a system 100 for calibration of sequence prediction models, according to some embodiments of the present disclosure. The disclosed system 100 is configured to calibrate the mis calibrated sequence prediction models.
[037] For calibration in multi / binary class setting, a function Fy is used to map the output pθ(y||x) to a calibrated confidence score for all y∈Y. In structured prediction setting of the sequence prediction models, the cardinality of Y is usually large, a token-wise calibration approach is adapted for these models. In essence, token-wise calibration deals with calibrating each token of the sequence simultaneously which ultimately results in a full model calibration. The Temperature scaling technique is used for calibration in multi/binary class settings. However, applying TS directly for the token level setting does not calibrate each token properly in the sequence. In Temperature Scaling for binary/multi classification, learning of temperature parameter (T) to get calibrated probability estimates, involves minimizing the loss function i.e. NLL because lower the value of loss function, better the calculated T, thereby producing better calibrated probabilities. However, in the case of token-wise calibration, modifying the loss function in Temperature Scaling produces more improved calibrated probabilities and thus increases the trust in the model.
[038] In accordance with various embodiments disclosed herein, the system 100 facilitates in post processing recalibration of sequence prediction models by using an Error Aware Temperature Scaling (EATS). The EATS technique involves learning of temperature parameter for each token by optimizing an objective function that consists of both the NLL as well as the calibration error generated for each token.
[039] The system 100 includes or is otherwise in communication with one or more hardware processors such as a processor 102, at least one memory such as a memory 104, and an I/O interface 106. The processor 102, memory 104, and the I/O interface 106 may be coupled by a system bus such as a system bus 108 or a similar mechanism. The I/O interface 106 may include a variety of software and hardware interfaces, for example, a web interface, a graphical user interface, and the like The interfaces 106 may include a variety of software and hardware interfaces, for example, interfaces for peripheral device(s), such as a keyboard, a mouse, an external memory, a camera device, and a printer. Further, the interfaces 106 may enable the system 200 to communicate with other devices, such as web
servers and external databases. The interfaces 106 can facilitate multiple communications within a wide variety of networks and protocol types, including wired networks, for example, local area network (LAN), cable, etc., and wireless networks, such as Wireless LAN (WLAN), cellular, or satellite. For the purpose, the interfaces 106 may include one or more ports for connecting a number of computing systems with one another or to another server computer. The I/O interface 106 may include one or more ports for connecting a number of devices to one another or to another server.
[040] The hardware processor 102 may be implemented as one or more microprocessors, microcomputers, microcontrollers, digital signal processors, central processing units, state machines, logic circuitries, and/or any devices that manipulate signals based on operational instructions. Among other capabilities, the hardware processor 102 is configured to fetch and execute computer-readable instructions stored in the memory 104.
[041] The memory 104 may include any computer-readable medium known in the art including, for example, volatile memory, such as static random access memory (SRAM) and dynamic random access memory (DRAM), and/or non-volatile memory, such as read only memory (ROM), erasable programmable ROM, flash memories, hard disks, optical disks, and magnetic tapes. In an embodiment, the memory 104 includes a plurality of modules 120 and a repository 140 for storing data processed, received, and generated by one or more of the modules 120. The modules 120 may include routines, programs, objects, components, data structures, and so on, which perform particular tasks or implement particular abstract data types.
[042] The repository 140, amongst other things, includes a system database 142 and other data 144. The other data 144 may include data generated as a result of the execution of the one or more modules 120.
[043] FIG. 2 illustrates a flowchart for a method 200 for calibration of sequence prediction models in accordance with an example embodiment of present disclosure. The method 200 depicted in the flow chart may be executed by a system,
for example, the system, 100 of FIG. 1. In an example embodiment, the system 100 may be embodied in a computing device.
[044] Operations of the flowchart, and combinations of operation in the flowchart, may be implemented by various means, such as hardware, firmware, processor, circuitry and/or other device associated with execution of software including one or more computer program instructions. For example, one or more of the procedures described in various embodiments may be embodied by computer program instructions. In an example embodiment, the computer program instructions, which embody the procedures, described in various embodiments may be stored by at least one memory device of a system and executed by at least one processor in the system. Any such computer program instructions may be loaded onto a computer or other programmable system (for example, hardware) to produce a machine, such that the resulting computer or other programmable system embody means for implementing the operations specified in the flowchart. It will be noted herein that the operations of the method 200 are described with help of system 100. However, the operations of the method 200 can be described and/or practiced by using any other system.
[045] The sequence models, for instance a sequence-to-sequence prediction model includes a plurality of sequences. Each of the sequences may have a plurality of tokens. Herein, it is assumed that the sequence prediction model is pretrained. As previously described, a gap between the expected value of predictions and the actual observed frequency of positive instances may be defined in terms of calibration error, also referred to as expected calibration error (ECE). The sequence prediction model is calibrated using a loss function, for example, a negative log likelihood (NLL).
[046] At 202 of the method 200, a calibration error generated for each token of the plurality of tokens associated with each sequence of the plurality of sequences may be determined. The trained sequence prediction model is recalibrated, via the one or more hardware processors 102, by using an error-induced loss function to obtain a recalibrated sequence prediction model at 204 of
method 200. The method for recalibrating the trained sequence prediction model is described further with reference to 206-208.
[047] In accordance with various embodiments of the present disclosure, the system 100 uses an error aware temperature scaling (EATS) as post processing calibration approach for the trained sequence prediction models. EATS uses an error-induced loss function which involves minimizing the objective function.
[048] The objective function is represented as below:
min(NLL + log(ECEi)) ∀ i ∈ 1,…, n
where, negative log likelihood (NLL) is the loss function,
expected calibration error (ECE) is the calibration error associated with each
token, and
n is the maximum number of tokens in any sequence of the plurality of
sequences.
[049] At 206, the method 200 includes learning temperature scaling parameter for each token of the sequence prediction model by optimizing the error induced loss function that includes the objective function. Learning the temperature parameter for each token using the proposed approach results in a drop in the miscalibration of models since the new loss function optimizes more steadily when compared to loss function of TS which contains NLL only as shown in FIG. 3. ECE cannot be used directly in the new loss function because it is discontinuous at bins boundaries and hence, is not differentiable. To circumvent this issue, the present embodiment uses logarithm of calibration error which gives us a differentiable objective function. As NLL consists of negative logarithm which represents that the function reaches infinity when input is 0, and reaches 0 when input is 1. Therefore, the function NLL + log(ECEi) produces a higher loss at smaller confidence values, and produces lower loss at larger confidence values.
[050] At 208, the method 200 includes applying a corresponding learned temperature scaling parameter to each token of the plurality of tokens resulting in calibrated probabilities for each token of the plurality of tokens. Particularly, during testing, for each token, ti the corresponding Ti is applied to scale the logits that are inputs to the Softmax, thereby resulting in a calibrated probability for each token.
These calibrated probabilities per token are further used to evaluate the calibration of the entire sequence. An example experimental scenario for calibration of sequence prediction model using error-aware temperature scaling is described further in the description below.
Experimental scenario:
[051] In an experimental scenario, the disclosed system 100 was evaluated on various parameters using a set of five known NMT datasets. First, an experiment was performed to compare token-wise calibration using disclosed system (comprising EATS) versus the calibration using TS. Secondly, the proposed token-wise calibration approach was compared against the entire sequence calibration approach using single temperature parameter. At last, the proposed method was differentiated from an existing work on calibration of seq2seq prediction models by a known model.
[052] The datasets considered for the experimental scenario included:
[053] 1. IWSLT’15 English-Vietnamese (En-Vi): The state-of-the-art publicly available pre-trained Tensorflow NMT model on WMT+IWSLT benchmark i.e. En-Vi NMT consists of multi-layered LSTMs arranged in the attention based NMT architecture. For experimentation purpose, a set up was used which comprises of 133K training sentences while tst2012 with 1553 sentences and tst2013 with 1268 sentences are used as validation and test data respectively. A 2-layer LSTM network of 512 units with a bidirectional encoder was trained. The resultant embedding dimension is 512. Luong Attention (scale=True) was used together with dropout keeping probability of 0:8 and SGD with learning rate 1:0.
[054] 2. WMT German-English (De-En) and WMT English-German (En-De)
These contain another publicly available state-of-the-art pre-trained Tensorflow NMT or GNMTmodels on the WMT+IWSLT benchmark. These also follow attention-based NMT architecture or GNMT architecture consisting of multi-layered LSTMs. For De-En dataset, a an experimental setup was deployed in which 4:5M sentences were used for training whereas validation and test set comprises of 3000 (newstest2013) and 2169 (newstest2015) sentences, respectively and vis-a-
vis for En-De dataset. A 4-layer LSTM network of 1024 units was trained with bidirectional encoder and embedding dim was 1024.
[055] 3. WMT’14 English-French (En-Fr) This NMT dataset uses the vanilla NMT model comprising of 18K sentences for training and the test set consists of 2000 sentences. A validation set was built by removing 2000 random sentence-pairs from the training data. The embedding dimension used for this experiment was 100.
[056] 4. Integer sequences: This dataset use an encoder-decoder LSTM based model. It consists of 60K training sequences where the source sequence is a series of randomly generated integer values, such as [20; 36; 40; 10; 34; 28], and the target sequence is a reversed pre-defined subset of the input sequence, such as the first 3 elements in reverse order [40; 36; 20]. The test and validation set consists of 20000 sequences each. To measure calibration error for each token, ECE with 20 bins, each of size 0.05 was used. A batch-size of 128 was used. Other details about optimizer and hyper-parameters are kept unchanged in the base models from the downloaded source. Full model ECE was obtained from ECE values of each token evaluated from the proposed approach as follows:
ECE = ece1+ ece2 + ....+ ecen
where ecei is the ECE for token i, i = 1,…,n and n is number of tokens.
[057] The calibration errors produced by the token-wise recalibration method in sequence prediction tasks in both the cases i.e., when using NLL as the loss function as in TS, as well as when using the error-induced loss function as in EATS were compared. As shown in Table 1, for all the four datasets, ECE value is lower when using proposed method. Moreover, for each of the datasets, token-level ECE is also lower than the baseline ECE, thus resulting in a better calibrated model. FIGS. 4A-4C shows the reliability plots to get a visual representation of calibration-error obtained using different calibration strategies. Particularly, FIGS. 4A-4C illustrate reliability diagrams for En-Fr NMT dataset representing the calibration errors in various settings FIG. 4A for no-calibration approach used, FIG. 4B when
using Temperature Scaling(TS), and FIG. 4C using Error-Aware Temperature Scaling (EATS).
Table 1
SN
1 2 3
4 5 6 Dataset Baseline ECE Token-level ECE using TS Token-level ECE using EATS
De-En NMT 11.90% 4.40% 3.80%
En-Vi NMT 12.30% 4.20% 3.50%
En-fr NMT 30.30% 9.80% 7.50%
Integer Sequence 8.80% 2.50% 2.30%
En-De GNMT4 14.70% 6.50% 4.30%
De-En GNMT 16.80% 4.50% 2.20%
[058] Here, the results generated on applying EATS in both the token-level approach as well as in the sequence-level approach, for calibrating the sequences in each of the given datasets are compared. Table 2 shows that the ECE computed using the token-level approach is much better than that from sequence level approach where only one temperature parameter is learned for each sequence. Moreover, ECE values after applying temperature scaling are very low when compared to the baseline ECE values.
Table 2
Token Wise Full Sequence
SN
1 2 3
4
5 Dataset Baseline ECE Token-level ECE Baseline ECE Sequence-level ECE
De-En NMT 11.90% 3.80% 36.00% 5.00%
En-Vi NMT 12.30% 3.50% 27.40% 7.70%
En-fr NMT 30.30% 7.50% 26.20% 22.00%
Integer Sequence 8.80% 2.30% 24.70% 18.00%
En-De GNMT4 14.70% 4.30% 35.70% 6.40%
6 De-En GNMT 16.80% 2.20% 29.40% 4.90%
[059] The proposed EATS method is compared with previously existing calibration techniques over the metric ’Weighted ECE’ known in art to measure the calibration error in various datasets adopting NMT or GNMT architectures. Table 3 represents the weighted ECE values for both the approaches, thus validating that the proposed calibration approach using EATS produces a better calibrated model.
Table 3
SN
1 2
3
4 Dataset Baseline Temperature Scaling Previous approach Disclosed Method
De-En NMT 9.8 3.7 3.5 3.2
En-Vi NMT 3.5 2.2 2 1.6
En-De GNMT4 4.8 2.7 2.4 2.3
De-En GNMT 3.3 2.3 2.2 2
[060] The written description describes the subject matter herein to enable any person skilled in the art to make and use the embodiments. The scope of the subject matter embodiments is defined by the claims and may include other modifications that occur to those skilled in the art. Such other modifications are intended to be within the scope of the claims if they have similar elements that do not differ from the literal language of the claims or if they include equivalent elements with insubstantial differences from the literal language of the claims.
[061] Various embodiments herein disclose method and system for calibration of sequence prediction models. In particular, the disclosed method is a token-wise recalibration technique of seq2seq models involves learning of temperature values for each token by optimizing the loss function that consists of the calibration error along with NLL for each token. This helps in learning an improved temperature value for each token, thereby increasing the certainty of predictions made by seq2seq models. Results on NMT tasks demonstrate that the method is able to perform calibration with a very low degree of error when evaluated with respect to various metrics. The proposed method is task agnostic for all seq2seq prediction models
[062] It is to be understood that the scope of the protection is extended to such a program and in addition to a computer-readable means having a message therein; such computer-readable storage means contain program-code means for implementation of one or more steps of the method, when the program runs on a server or mobile device or any suitable programmable device. The hardware device can be any kind of device which can be programmed including e.g. any kind of computer like a server or a personal computer, or the like, or any combination thereof. The device may also include means which could be e.g. hardware means like e.g. an application-specific integrated circuit (ASIC), a field-programmable gate array (FPGA), or a combination of hardware and software means, e.g. an ASIC and an FPGA, or at least one microprocessor and at least one memory with software processing components located therein. Thus, the means can include both hardware means and software means. The method embodiments described herein could be implemented in hardware and software. The device may also include software means. Alternatively, the embodiments may be implemented on different hardware devices, e.g. using a plurality of CPUs.
[063] The embodiments herein can comprise hardware and software elements. The embodiments that are implemented in software include but are not limited to, firmware, resident software, microcode, etc. The functions performed by various components described herein may be implemented in other components or combinations of other components. For the purposes of this description, a computer-usable or computer readable medium can be any apparatus that can comprise, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.
[064] The illustrated steps are set out to explain the exemplary embodiments shown, and it should be anticipated that ongoing technological development will change the manner in which particular functions are performed. These examples are presented herein for purposes of illustration, and not limitation. Further, the boundaries of the functional building blocks have been arbitrarily defined herein for the convenience of the description. Alternative boundaries can be defined so long as the specified functions and relationships thereof are
appropriately performed. Alternatives (including equivalents, extensions,
variations, deviations, etc., of those described herein) will be apparent to persons skilled in the relevant art(s) based on the teachings contained herein. Such alternatives fall within the scope of the disclosed embodiments. Also, the words “comprising,” “having,” “containing,” and “including,” and other similar forms are intended to be equivalent in meaning and be open ended in that an item or items following any one of these words is not meant to be an exhaustive listing of such item or items, or meant to be limited to only the listed item or items. It must also be noted that as used herein and in the appended claims, the singular forms “a,” “an,” and “the” include plural references unless the context clearly dictates otherwise.
[065] Furthermore, one or more computer-readable storage media may be utilized in implementing embodiments consistent with the present disclosure. A computer-readable storage medium refers to any type of physical memory on which information or data readable by a processor may be stored. Thus, a computer-readable storage medium may store instructions for execution by one or more processors, including instructions for causing the processor(s) to perform steps or stages consistent with the embodiments described herein. The term “computer-readable medium” should be understood to include tangible items and exclude carrier waves and transient signals, i.e., be non-transitory. Examples include random access memory (RAM), read-only memory (ROM), volatile memory, nonvolatile memory, hard drives, CD ROMs, DVDs, flash drives, disks, and any other known physical storage media.
[066] It is intended that the disclosure and examples be considered as exemplary only, with a true scope of disclosed embodiments being indicated by the following claims.
We Claim:
1. A processor-implemented method (200) for calibration of sequence
prediction model, comprising:
determining (202), via one or more hardware processors, a calibration error generated for each token of a plurality of tokens associated with each sequence of a plurality of sequences of the sequence prediction model, wherein the sequence prediction model is pretrained for prediction of the plurality of sequences and pre calibrated using a loss function; and
recalibrating (204), via the one or more hardware processors, the sequence prediction model using an error-induced loss function to obtain a recalibrated sequence prediction model, wherein recalibrating the sequence prediction model comprises:
learning (206) a temperature scaling parameter for each token of the sequence prediction model by optimizing an error induced loss function comprising an objective function comprising the loss function and the calibration error associated with each token; and
applying (208) a corresponding learned temperature scaling parameter to each token of the plurality of tokens resulting in calibrated probabilities for each token of the plurality of tokens and recalibration of the sequence prediction model.
2. The processor implemented method of claim 1, wherein the sequence prediction model comprises a seq-2-seq model.
3. The processor implemented method of claim 1, wherein the objective function is represented using the following equation:
min(NLL + log(ECEi)) ∀ i ∈ 1,…, n
where, negative log likelihood (NLL) is the loss function,
expected calibration error (ECE) is the calibration error associated with each
token, and
n is the maximum number of tokens in any sequence of the plurality of sequences.
4. A system (100) for calibration of sequence prediction model, comprising:
a memory (102) storing instructions;
one or more communication interfaces (106); and
one or more hardware processors (104) coupled to the memory (102) via the one or more communication interfaces (106), wherein the one or more hardware processors (104) are configured by the instructions to:
determine a calibration error generated for each token of a plurality of tokens associated with each sequence of a plurality of sequences of a sequence prediction model, wherein the sequence prediction model is pretrained for prediction of the plurality of sequences and pre calibrated using a loss function;
recalibrate the sequence prediction model using an error-induced loss function to obtain a recalibrated sequence prediction model, wherein to recalibrate the sequence prediction model, the one or more hardware processors are configured by the instructions to:
learn temperature scaling parameter for each token of the sequence prediction model by optimizing the error induced loss function comprising an objective function comprising the loss function and the calibration error associated with each token; and
apply a corresponding learned temperature scaling parameter to each token of the plurality of tokens resulting in calibrated probabilities for each token of the plurality of tokens and recalibration of the sequence prediction model.
5. The system of claim 4 wherein the sequence prediction model comprises a
seq-2-seq model.
6. The system of claim 4, wherein the objective function is represented using
the following equation:
min(NLL + log(ECEi)) ∀ i ∈ 1,…, n
where, negative log likelihood (NLL) is the loss function, expected calibration error (ECE) is the calibration error associated with each token, and
n is the maximum number of tokens in any sequence of the plurality of sequences.