Specification
DESC:FORM 2
THE PATENTS ACT, 1970
(39 of 1970)
&
THE PATENT RULES, 2003
COMPLETE SPECIFICATION
(See Section 10 and Rule 13)
Title of invention:
METHOD AND SYSTEM FOR EVALUATING CLINICAL EFFICACY OF MULTI-LABEL MULTI-CLASS COMPUTATIONAL DIAGNOSTIC MODELS
Applicant:
Tata Consultancy Services Limited
A company Incorporated in India under the Companies Act, 1956
Having address:
Nirmal Building, 9th Floor,
Nariman Point, Mumbai 400021,
Maharashtra, India
The following specification particularly describes the invention and the manner in which it is to be performed.
CROSS-REFERENCE TO RELATED APPLICATIONS AND PRIORITY
The present application claims priority from Indian provisional patent application no. 202221052587, filed on September 14, 2022. The entire contents of the aforementioned application are incorporated herein by reference.
TECHNICAL FIELD
The present invention generally relates to the field of clinical model evaluation and, more particularly, to a method and system for evaluating clinical efficacy of multi-label multi-class computational diagnostic models.
BACKGROUND
Machine learning techniques have been applied to a wide set of diagnostic problems (for example, diagnosis of diabetic complications, Arrhythmia detection and classification etc.), which are often multi-label, i.e. where one or more diagnosis are detected from one diagnostic sample. Evaluation of such computational diagnostic models is often done using metrics which are used for evaluating conventional machine learning models. This however poses a challenge as models evaluated on different sets of metrics cannot be compared. Further, the choice of metric can serve to highlight key strengths of a model and ignore its weaknesses. Different metrics do not agree on comparative performance of models either, thus the choice of best diagnostic model can be dictated by choice of metric.
Additionally, results reported on several metrics are not necessarily informative enough from a clinical perspective. A large set of scores, measuring different aspects of performance does not help in determining the model that is better for clinical applications. Since the metrics are borrowed from machine learning, where requirements are different, a higher score on a certain metric does not necessarily translate to better diagnostic performance, and vice versa. Unlike problems in machine learning where requirements are varied, in clinical practice some facts are ubiquitous, and can be treated like gospel. For instance, a wrong diagnosis is worse than a missed diagnosis which is in turn worse than over diagnosis up to a certain extent. The standard metrics used in a multi-label setting (Hamming Loss, subset accuracy, etc.) does not reflect this. There might also be a scenario where certain sets of diagnosis have similar treatment plans and outcomes, thus making certain types of missed diagnosis less deleterious. The principle of risk avoidance states that in a computational diagnostic model, sensitivity should be correlated to cost (or lethality) with significant ailments having markedly higher sensitivity than minor issues. However, when this comes at a cost of specificity, it might lead to alarm fatigue. Thus, a conventional multi-label metric does not align with the highly context dependent clinical principles and practice when rating a diagnostic model and is unable to capture the critically important features that ought to be present in a diagnostic model.
SUMMARY
Embodiments of the present disclosure present technological improvements as solutions to one or more of the above-mentioned technical problems recognized by the inventors in conventional systems. For example, in one embodiment, a method for evaluating clinical efficacy of multi-label multi-class computational diagnostic models is provided. The method comprises receiving a dataset comprising a plurality of diagnostic samples and corresponding ground truth. Further, the method comprises predicting a diagnosis corresponding to each of the plurality of diagnostic samples using a multi-label multi-class computational diagnostic model and classifying the predicted diagnosis in a class among a plurality of classes comprising: (i) a wrong diagnosis, (ii) a missed diagnosis, (iii) an over diagnosis and (iv) a right diagnosis. The predicted diagnosis comprises one or more diagnostic conditions. The method further comprises calculating a first penalty for the diagnosis corresponding to each of the plurality of diagnostic samples based on the class of the predicted diagnosis and a second penalty for the diagnosis corresponding to each of the plurality of diagnostic samples based on a contradiction matrix. The method further comprises computing a pre-score for each of the plurality of diagnostic samples based on the corresponding first penalty and second penalty. Furthermore, the method comprises obtaining a score corresponding to the multi-label multi-class computational diagnostic model by summing up the pre-score of each of the plurality of diagnostic samples and evaluating the multi-label multi-class computational diagnostic model with a metric that is based on (i) the score corresponding to the multi-label multi-class computational diagnostic model, (ii) a pre-computed score of a perfect multi-label multi-class computational diagnostic model whose predictions always belong to the right diagnosis class and (iii) a pre-computed score of a null multi-label multi-class computational diagnostic model which predicts null or 0 only.
In another aspect, a system for evaluating clinical efficacy of multi-label multi-class computational diagnostic models is provided. The system includes: a memory storing instructions; one or more communication interfaces; and one or more hardware processors coupled to the memory via the one or more communication interfaces, wherein the one or more hardware processors are configured by the instructions to: receive a dataset comprising a plurality of diagnostic samples and corresponding ground truth. Further the one or more hardware processors are configured to predict a diagnosis corresponding to each of the plurality of diagnostic samples using a multi-label multi-class computational diagnostic model and classify the predicted diagnosis in a class among a plurality of classes comprising: (i) a wrong diagnosis, (ii) a missed diagnosis, (iii) an over diagnosis and (iv) a right diagnosis. The predicted diagnosis comprises one or more diagnostic conditions. The one or more hardware processors are further configured to calculate a first penalty for the diagnosis corresponding to each of the plurality of diagnostic samples based on the class of the predicted diagnosis and a second penalty for the diagnosis corresponding to each of the plurality of diagnostic samples based on a contradiction matrix. The one or more hardware processors are further configured to obtain a score corresponding to the multi-label multi-class computational diagnostic model by summing up the pre-score of each of the plurality of diagnostic samples and evaluate the multi-label multi-class computational diagnostic model with a metric that is based on (i) the score corresponding to the multi-label multi-class computational diagnostic model, (ii) a pre-computed score of a perfect multi-label multi-class computational diagnostic model whose predictions always belong to the right diagnosis class and (iii) a pre-computed score of a null multi-label multi-class computational diagnostic model which predicts null or 0 only.
In yet another aspect, there are provided one or more non-transitory machine-readable information storage mediums comprising one or more instructions which when executed by one or more hardware processors cause a method for evaluating clinical efficacy of multi-label multi-class computational diagnostic models. The method comprises receiving a dataset comprising a plurality of diagnostic samples and corresponding ground truth. Further, the method comprises predicting a diagnosis corresponding to each of the plurality of diagnostic samples using a multi-label multi-class computational diagnostic model and classifying the predicted diagnosis in a class among a plurality of classes comprising: (i) a wrong diagnosis, (ii) a missed diagnosis, (iii) an over diagnosis and (iv) a right diagnosis. The predicted diagnosis comprises one or more diagnostic conditions. The method further comprises calculating a first penalty for the diagnosis corresponding to each of the plurality of diagnostic samples based on the class of the predicted diagnosis and a second penalty for the diagnosis corresponding to each of the plurality of diagnostic samples based on a contradiction matrix. The method further comprises computing a pre-score for each of the plurality of diagnostic samples based on the corresponding first penalty and second penalty. Furthermore, the method comprises obtaining a score corresponding to the multi-label multi-class computational diagnostic model by summing up the pre-score of each of the plurality of diagnostic samples and evaluating the multi-label multi-class computational diagnostic model with a metric that is based on (i) the score corresponding to the multi-label multi-class computational diagnostic model, (ii) a pre-computed score of a perfect multi-label multi-class computational diagnostic model whose predictions always belong to the right diagnosis class and (iii) a pre-computed score of a null multi-label multi-class computational diagnostic model which predicts null or 0 only.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention, as claimed.
BRIEF DESCRIPTION OF THE DRAWINGS
The accompanying drawings, which are incorporated in and constitute a part of this disclosure, illustrate exemplary embodiments and, together with the description, serve to explain the disclosed principles:
FIG. 1 illustrates an exemplary block diagram of a system for evaluating clinical efficacy of multi-label multi-class computational diagnostic models, according to some embodiments of the present disclosure.
FIGS. 2A and 2B, collectively referred as FIG. 2, are flow diagrams illustrating method for evaluating clinical efficacy of multi-label multi-class computational diagnostic models, according to some embodiments of the present disclosure.
FIG. 3 is an alternative representation of flow diagram of FIG. 2, according to some embodiments of the present disclosure.
FIG. 4 is a graph illustrating results of Challenge Metric.
FIG. 5 illustrates a comparison of result of an ideal metric with a plurality of conventional metrics.
FIGS. 6 and 7 illustrate comparison of values calculated by method illustrated in FIG. 2 with a plurality of metrics including accuracy, subset accuracy, hamming loss, F1 score and challenge metric for diagnosis predicted for different diagnostic samples, according to some embodiments of the present disclosure.
FIGS. 8 and 9 illustrate effect of relative prevalence of a diagnostic condition on method illustrated in FIG. 2 and a plurality of metrics including accuracy, subset accuracy, hamming loss, F1 score and challenge metric, according to some embodiments of the present disclosure.
DETAILED DESCRIPTION OF EMBODIMENTS
Exemplary embodiments are described with reference to the accompanying drawings. In the figures, the left-most digit(s) of a reference number identifies the figure in which the reference number first appears. Wherever convenient, the same reference numbers are used throughout the drawings to refer to the same or like parts. While examples and features of disclosed principles are described herein, modifications, adaptations, and other implementations are possible without departing from the scope of the disclosed embodiments. It is intended that the following detailed description be considered as exemplary only, with the true scope being indicated by the following embodiments described herein.
With machine learning based approaches showing promise in the multi-label multi-class paradigm, they are being widely adopted to computational diagnostic models. When evaluating these models, several factors prove to be important, like sensitivity, specificity, risk avoidance, etc. Existing metrics are usually borrowed from machine learning, and since each metric is usually designed to pick up on certain features, the current consensus is to report results on a large set of metrics. The choice of metrics can serve to downplay limitations of a model, and different choice of metrics can change the order relation amongst several competing models. It is challenging to compare efficacy of models which have been evaluated on different sets of metrics, and even if that is not the case, it is not clear how to summarize information from several metrics to choose a clinically applicable diagnostic model. From a diagnostic standpoint, the metrics themselves are far from perfect, often biased by prevalence of negative samples or other statistical factors.
Often, the multi-label multi-class computational diagnostic models are classifiers implemented using machine learning techniques. To evaluate the quality of a classifier f_? on a dataset D comprising a plurality of diagnostic samples, it is sufficient to analyze a set P={(x ^_i,y_i)|?i such that (z_i,y_i)?D} , wherein x ^_i is prediction of the classifier for a diagnostic sample z_i which has a ground truth label y_i in the dataset D. The job of a metric, given such a set P is to provide a number or a score which is correlated to the performance of the classifier.
State of the art metrics that are suitable for such an evaluation are bipartition based metrics which are again broadly divided into two categories: label based (listed in Table 1) and example based (listed in Table 2). The example based metrics assign a score based on averages over certain functions of the actual and predicted label sets. Label based metrics on the other hand compute the prediction performance of each label in isolation and then compute averages over labels. Certain other binary metrics have been proposed in a clinical diagnostic context, like threat score (Hicks SA et al. On evaluation metrics for medical applications of artificial intelligence. Scientific Reports. 2022;5979. doi:10.1038/s41598-022-09954-8.) or Mathews Correlation Coefficient, however they are not generally used in a multi-label context.
Table 1
Metric Definition
Macro-precision 1/P ?_(j=1)^P¦?tp?_j/(?tp?_j+?fp?_j )
Macro-recall 1/P ?_(j=1)^P¦?tp?_j/(?tp?_j+?fn?_j )
Macro-F1-score 1/P ?_(j=1)^P¦??F1?_j,?F1?_j=(?2p?_j r_j)/(p_j+r_j )?
Micro-precision (?_(j=1)^P¦?tp?_j )/(?_(j=1)^P¦?tp?_j +?_(j=1)^P¦?fp?_j )
Micro-recall (?_(j=1)^P¦?tp?_j )/(?_(j=1)^P¦?tp?_j +?_(j=1)^P¦?fn?_j )
Micro-F1-score (2*micro-precision - micro-recall)/(micro-precision + micro-recall)
Table 2
Metric Definition
Hamming loss 1/N ?_(j=1)^N¦?1/P|x ^_i ?y_i |?
Accuracy 1/N ?_(j=1)^N¦(|x ^_iny_i |)/(|x ^_i?y_i |)
Precision 1/N ?_(j=1)^N¦(|x ^_iny_i |)/(|y_i |)
Recall 1/N ?_(j=1)^N¦(|x ^_iny_i |)/(|x ^_i |)
F1-score 1/N ?_(j=1)^N¦(2×|x ^_iny_i |)/(|x ^_i |+|y_i |)
Subset accuracy 1/N ?_(j=1)^N¦?I(x ^_i=y_i)?
Challenge Metric (CM)
(Alday EAP et al. Classification of 12-lead ECGs: the PhysioNet/ Computing in Cardiology Challenge 2020. Physiological Measurement. 2021;41(12):124003. doi:10.1088/1361-6579/abc960.) a_(j,k)=?_(i=1)^N¦(I(a_j?x ^_i and a_k?y_i))/(|x ^_i?y_i |)
s_unnorm=?_(k=1)^P¦?_(j=1)^P¦?a_jk w_jk ?
CM=(s_unnorm-s_inactive)/(s_perfect-s_inactive )
Label based metrics in use today take the form of micro or macro averages of binary classification metrics (given in Table 1), such as precision, recall and F1 (or the general Fß) to provide summary information of performance across several categories. Specificity is unsuited in the clinical domain, due to the class imbalance usually present in diagnostic datasets, where negative examples are plentiful. A macro averaged measure is computed by first independently computing the binary metric for each class and then averaging over them. A micro average on the other hand will aggregate the statistics across classes and compute the final metric. However, both of these approaches have their own drawbacks. The micro average favors classifiers with stronger performance on predominant classes whereas the macro average favors classifiers suited to detecting rarely occurring classes. In a clinical setting where it is very common for certain presentations to be very rare, the micro average measures are less meaningful, as it is the rare diseases that are often of most concern and would benefit greatly from intervention. From a machine learning point of view, it is unreasonable to expect a classifier to have a high sensitivity when it is only provided a few examples, additionally if a diagnostic criteria is indeed extremely rare, its influence on the quality of the diagnostic system should be limited.
Example based metrics (given in Table 2) are specifically designed to pick out certain key features of a multi-label classifier. It is in general inadequate to compute just one or two metrics, as they each have individual properties which provide beneficial cues. One notable recent work by Alday, et. al (Classification of 12-lead ECGs: the PhysioNet/ Computing in Cardiology Challenge 2020. Physiological Measurement. 2021;41(12):124003. doi:10.1088/1361-6579/abc960), set out to design a metric called Challenge Metric (CM) that takes clinical outcomes into account in a multi-class multi-label diagnostic setting. Here, initially a multi class confusion matrix A=[a_ij] is defined according to equation 1. Next, a score t(Y,X) is computed according to equation 2, wherein w_ij is a weight matrix which assigns partial rewards to incorrect guesses. w_ii=1 and in general 0g?(z_i)?_k if a_j?y_i and a_k?y_i (16)
The predicted diagnosis is a set of all diagnostic conditions that satisfy a prediction threshold as given in equation 17.
x ^_i={a_j |x_ij=1,?j?{1,2,...P}, where x_ij=g_? ?(z_i)?_j={¦(1,if g_? ?(z_i)?_j>t_ij@0,otherwise )¦ (17)
In an embodiment, the predicted diagnosis maybe processed before proceeding to step 206 of the method 200 by collapsing diagnostic conditions that are equivalent. For example, Premature Atrial Contraction (PAC) and Supraventricular Premature Beats (SVPB) are equivalent diagnostic conditions. So, any one of them will be retained among the set of all diagnostic conditions in the predicted diagnosis. At step 206 of the method 200, the predicted diagnosis is classified in a class among a plurality of classes comprising: (i) a wrong diagnosis, (ii) a missed diagnosis, (iii) an over diagnosis and (iv) a right diagnosis. The predicted diagnosis for a diagnostic sample from among the plurality of diagnostic samples is classified as (i) the wrong diagnosis if the predicted diagnosis and ground truth corresponding to the diagnostic sample are disjoint, i.e. x ^_iny_i=Ø, (ii) the missed diagnosis if the predicted diagnosis is a proper subset of the ground truth corresponding to the diagnostic sample, i.e. x ^_i?y_i, (iii) the over diagnosis if the ground truth corresponding to the diagnostic sample is a proper subset of the predicted diagnosis i.e. y_i?x ^_i, or (iv) the right diagnosis otherwise. Hence, for a predicted diagnosis x ^_i of a diagnostic sample z_i having ground truth y_i, the sets x ^_iny_i, ?y_i-x ^?_i, and x ^_i-y_i correspond to right diagnosis, missed diagnosis and over diagnosis respectively. If x ^_iny_i=Ø, i.e., there are no diagnostic conditions common between the predicted diagnosis and the ground truth, then the predicted diagnosis is a wrong diagnosis.
Once the predicted diagnosis is classified, at step 208 of the method 200, a first penalty (mathematically represented as a_(k(i)) for a diagnostic sample (k) is calculated for the diagnosis corresponding to each of the plurality of diagnostic samples based on the class of the predicted diagnosis. For example, the first penalty is calculated by one of: (i) s_i/n_i if the predicted diagnosis is a right diagnosis, (ii) ?-s?_i/n_i if the predicted diagnosis is a missed diagnosis, (iii) s_i/n^* [1/|y_k | (?_(c_j?y_k)¦w_(i,j) )-1]if the predicted diagnosis is an over diagnosis, and (iv) 0 if the predicted diagnosis is a wrong diagnosis, wherein s_i is pre-defined significance weight corresponding to class of diagnostic conditions in the dataset. This reflects the fact that all diagnostic conditions might not be equally relevant, and classes which are critical have a higher value of s_i, so their contribution to the final score is larger. The significance weights can be set to 1 for all the diagnostic conditions if their relative importance is the same. n_i is number of occurrences of the predicted diagnosis in the dataset and is introduced in the first penalty to ensure that prevalence of diagnostic conditions doesn’t affect the final score. n^* is calculated by equation 18. y_k is the ground truth corresponding to the diagnostic sample for which prediction is done. w_(i,j) is a weight matrix comprising cost of misclassification as in Aldey et. al. (Classification of 12-lead ECGs: the PhysioNet/Computing in Cardiology Challenge 2020. Physiological Measurement. 2021;41(12):124003. doi:10.1088/1361-6579/abc960). This gives partial rewards to over diagnosis which are of similar nature in outcomes or treatment of the diagnostic conditions. If such a matrix is unavailable or not required, w_(i,j) can be set to a constant value among (0, 1). By calculating the first penalty in this way, strict monotonicity of the first penalty is maintained by assigning highest first penalty to wrong diagnosis followed by missed diagnosis and over diagnosis thereby imbibing risk aversion principle of clinical diagnosis.
n^*= max{n_i |?c_i?y_k } (18)
Once the first penalty for each diagnostic sample is calculated, at step 210 of the method 200, a second penalty for the diagnosis corresponding to each of the plurality of diagnostic samples is calculated based on a contradiction matrix which provides contradictory and non-contradictory pairs of diagnostic conditions. For example, hypotension and hypertension cannot occur together hence they are contradictory pairs. In an embodiment, the contradiction matrix is developed with the help of an expert in medical field. It can be mathematically represented as C_(i,j), such that C_(i,j)=1 if diagnostic conditions c_i and c_j cannot occur together. The second penalty is calculated by equation 19 if c_i is a diagnostic condition in the predicted diagnosis or 0 otherwise, wherein n_i is number of occurrences of the predicted diagnosis in the dataset, x ^_k is the predicted diagnosis, and s_j is pre-defined significance weight corresponding to the class of diagnosis.
(-1)/n_i ?_(?j s.t. c_j?x ^_k)¦?s_j·C_ij ? (19)
The contradiction matrix is responsible for penalizing impossible or mutually exclusive diagnostic conditions. This ensures that the predictions are not only accurate but are logically consistent. This contradiction matrix reduces the space of possible predictions to exclude impossible diagnosis, for example, atrial fibrillation is marked by a lack of a P-wave in the ECG signal and sinus rhythms have a P-wave, thus precluding each other.
Once the second penalty is calculated, at step 212 of the method 200, a pre-score (t_k) for each of the plurality of diagnostic samples is computed based on the corresponding first penalty (a_(k(i))) and second penalty (b_(k(i))) as given in equation 20. Further, at step 214 of the method 200, a total score corresponding to the multi-label multi-class computational diagnostic model is obtained by summing up the pre-score of each of the plurality of diagnostic samples as given in equation 21, wherein Y={y_i |?i?{1,2,...N}} and X={x ^_i |?i?{1,2,...N}}. In other words, Y is set of ground truth labels in the dataset and X is set of predicted diagnosis for the diagnostic samples in the dataset.
t_k=?_(i=1)^P¦?a_(k(i))+b_(k(i)) ? (20)
t(Y,X)=?_(k=1)^N¦t_k (21)
Once the total score corresponding to the multi-label multi-class computational diagnostic model is obtained, at step 214 of the method 200, the multi-label multi-class computational diagnostic model is evaluated with a metric that is based on (i) the score corresponding to the multi-label multi-class computational diagnostic model, (ii) a pre-computed score of a perfect multi-label multi-class computational diagnostic model whose predictions always belong to the right diagnosis class and (iii) a pre-computed score of a null multi-label multi-class computational diagnostic model which predicts null or 0 only. The metric M_CS is given by equation 22, wherein t(Y,Y) is the pre-computed score of a perfect multi-label multi-class computational diagnostic model and t(Y,Ø) is the pre-computed score of a null multi-label multi-class computational diagnostic model. In an embodiment, t(Y,Y) and t(Y,Ø) are computed using the steps 208 to 214 of the method 200 by assuming that the predicted diagnosis is always right diagnosis, and the predicted diagnosis is null respectively. This way of evaluating the multi-label multi-class computational diagnostic model using the metric (given by equation 22) ensures that a perfect model gets a maximum possible score of 1 and an inactive model that predicts nothing gets a score of 0.
M_CS=(t(Y,X)-t(Y,Ø))/(t(Y,Y)-t(Y,Ø)) (22)
USE CASE EXAMPLE AND EXPERIMENTAL RESULTS
As a use case example, the method 200 is applied on the PhysioNet 2020/21 challenge dataset, where 27 cardiovascular diseases (CVDs) are to be detected from 12 lead ECG signals. The classes of diagnostic conditions present are given in Table 3 with their corresponding significance weights. Table 4 illustrates the weight matrix. The contradiction matrix is defined with the help of domain experts. It is a square matrix of the diagnostic conditions listed in Table 3 and an entry in a cell of the matrix is 1 if diagnostic conditions corresponding to its row and column do not occur together.
Table 3
Class of diagnostic condition Acronym Significance weight (s_i)
1st Degree AV Block IAVB 0.5 (Critical)
Atrial Fibrillation AF 1 (Super critical)
Atrial Flutter AFL 1 (Super critical)
Bradycardia Brady 0.5 (Critical)
Complete Right Bundle Branch Block CRBBB 1 (Super critical)
Incomplete Right Bundle Branch Block IRBBB 0.25
Left Anterior Fascicular Block LAnFB 0.5 (Critical)
Left Axis Deviation LAD 0.25
Left Bundle Branch Block LBBB 1 (Super critical)
Low QRS Voltage LQRSV 1 (Super critical)
Nonspecific Intraventricular Conduction Disorder NSIVCB 0.25
Pacing Rhythm PR 0.25
Premature Atrial Contraction PAC 0.25
Premature Ventricular Contractions PVC 0.25
Prolonged PR Interval LPR 0.25
Prolonged QT Interval LQT 0.5 (Critical)
Q Wave abnormal QAb 0.5 (Critical)
Right Axis Deviation RAD 0.25
Right Bundle Branch Block RBBB 1 (Super critical)
Sinus Arrhythmia SA 0.5 (Critical)
Sinus Bradycardia SB 0.5 (Critical)
Normal Sinus Rhythm NSR 0.25
Sinus Tachycardia STach 0.25
Supraventricular Premature Beats SVPB 0.25
T Wave Abnormal Tab 1 (Super critical)
T Wave Inversion TInv 0.25
Ventricular Premature Beats VPB 0.25
Table 4
IAVB AF AFL Brady CRBBB IRBBB LAnFB LAD LBBB LQRSV NSIVCB PR PAC PVC LPR LQT QAb RAD RBBB SA SB NSR STach SVPB Tab TInv VPB
IAVB 1 0.3 0.3 0.5 0.4 0.5 0.5 0.5 0.3 0.4 0.5 0.4 0.5 0.4 0.5 0.3 0.2 0.5 0.4 0.5 0.5 0.5 0.4 0.5 0.3 0.3 0.4
AF 0.3 1 0.5 0.3 0.4 0.3 0.3 0.3 0.5 0.4 0.3 0.4 0.3 0.4 0.3 0.5 0.4 0.3 0.4 0.3 0.3 0.2 0.4 0.3 0.5 0.5 0.4
AFL 0.3 0.5 1 0.3 0.4 0.3 0.3 0.3 0.5 0.4 0.3 0.4 0.3 0.4 0.3 0.5 0.4 0.3 0.4 0.3 0.3 0.2 0.4 0.3 0.5 0.5 0.4
Brady 0.5 0.3 0.3 1 0.4 0.5 0.5 0.5 0.3 0.4 0.5 0.4 0.5 0.4 0.5 0.3 0.2 0.5 0.4 0.5 0.5 0.5 0.4 0.5 0.3 0.3 0.4
CRBBB 0.4 0.4 0.4 0.4 1 0.4 0.5 0.5 0.4 0.5 0.5 0.5 0.4 0.5 0.4 0.5 0.3 0.5 1 0.4 0.4 0.3 0.5 0.4 0.4 0.4 0.5
IRBBB 0.5 0.3 0.3 0.5 0.4 1 0.5 0.5 0.3 0.4 0.5 0.4 0.5 0.4 0.5 0.3 0.2 0.5 0.4 0.5 0.5 0.5 0.4 0.5 0.3 0.3 0.4
LAnFB 0.5 0.3 0.3 0.5 0.5 0.5 1 0.5 0.4 0.4 0.5 0.5 0.5 0.5 0.5 0.4 0.2 0.5 0.5 0.5 0.5 0.4 0.5 0.5 0.3 0.3 0.5
LAD 0.5 0.3 0.3 0.5 0.5 0.5 0.5 1 0.4 0.4 0.5 0.5 0.5 0.5 0.5 0.4 0.2 0.5 0.5 0.5 0.5 0.4 0.5 0.5 0.3 0.3 0.5
LBBB 0.3 0.5 0.5 0.3 0.4 0.3 0.4 0.4 1 0.5 0.4 0.4 0.4 0.4 0.3 0.5 0.4 0.4 0.4 0.3 0.3 0.3 0.4 0.4 0.5 0.5 0.4
LQRSV 0.4 0.4 0.4 0.4 0.5 0.4 0.4 0.4 0.5 1 0.4 0.5 0.4 0.5 0.4 0.5 0.3 0.4 0.5 0.4 0.4 0.3 0.5 0.4 0.4 0.4 0.5
NSIVCB 0.5 0.3 0.3 0.5 0.5 0.5 0.5 0.5 0.4 0.4 1 0.5 0.5 0.5 0.5 0.4 0.2 0.5 0.5 0.5 0.5 0.4 0.5 0.5 0.3 0.3 0.5
PR 0.4 0.4 0.4 0.4 0.5 0.4 0.5 0.5 0.4 0.5 0.5 1 0.5 0.5 0.4 0.4 0.3 0.5 0.5 0.4 0.4 0.4 0.5 0.5 0.4 0.4 0.5
PAC 0.5 0.3 0.3 0.5 0.4 0.5 0.5 0.5 0.4 0.4 0.5 0.5 1 0.5 0.5 0.4 0.2 0.5 0.4 0.5 0.5 0.4 0.5 1 0.3 0.3 0.5
PVC 0.4 0.4 0.4 0.4 0.5 0.4 0.5 0.5 0.4 0.5 0.5 0.5 0.5 1 0.4 0.4 0.3 0.5 0.5 0.4 0.4 0.4 0.5 0.5 0.4 0.4 1
LPR 0.5 0.3 0.3 0.5 0.4 0.5 0.5 0.5 0.3 0.4 0.5 0.4 0.5 0.4 1 0.3 0.2 0.5 0.4 0.5 0.5 0.5 0.4 0.5 0.3 0.3 0.4
LQT 0.3 0.5 0.5 0.3 0.5 0.3 0.4 0.4 0.5 0.5 0.4 0.4 0.4 0.4 0.3 1 0.3 0.4 0.5 0.3 0.3 0.3 0.4 0.4 0.5 0.5 0.4
QAb 0.2 0.4 0.4 0.2 0.3 0.2 0.2 0.2 0.4 0.3 0.2 0.3 0.2 0.3 0.2 0.3 1 0.2 0.3 0.2 0.2 0.1 0.3 0.2 0.4 0.4 0.3
RAD 0.5 0.3 0.3 0.5 0.5 0.5 0.5 0.5 0.4 0.4 0.5 0.5 0.5 0.5 0.5 0.4 0.2 1 0.5 0.5 0.5 0.4 0.5 0.5 0.3 0.3 0.5
RBBB 0.4 0.4 0.4 0.4 1 0.4 0.5 0.5 0.4 0.5 0.5 0.5 0.4 0.5 0.4 0.5 0.3 0.5 1 0.4 0.4 0.3 0.5 0.4 0.4 0.4 0.5
SA 0.5 0.3 0.3 0.5 0.4 0.5 0.5 0.5 0.3 0.4 0.5 0.4 0.5 0.4 0.5 0.3 0.2 0.5 0.4 1 0.5 0.5 0.4 0.5 0.3 0.3 0.4
SB 0.5 0.3 0.3 0.5 0.4 0.5 0.5 0.5 0.3 0.4 0.5 0.4 0.5 0.4 0.5 0.3 0.2 0.5 0.4 0.5 1 0.5 0.4 0.5 0.3 0.3 0.4
NSR 0.5 0.2 0.2 0.5 0.3 0.5 0.4 0.4 0.3 0.3 0.4 0.4 0.4 0.4 0.5 0.3 0.1 0.4 0.3 0.5 0.5 1 0.4 0.4 0.2 0.2 0.4
STach 0.4 0.4 0.4 0.4 0.5 0.4 0.5 0.5 0.4 0.5 0.5 0.5 0.5 0.5 0.4 0.4 0.3 0.5 0.5 0.4 0.4 0.4 1 0.5 0.4 0.4 0.5
SVPB 0.5 0.3 0.3 0.5 0.4 0.5 0.5 0.5 0.4 0.4 0.5 0.5 1 0.5 0.5 0.4 0.2 0.5 0.4 0.5 0.5 0.4 0.5 1 0.3 0.3 0.5
Tab 0.3 0.5 0.5 0.3 0.4 0.3 0.3 0.3 0.5 0.4 0.3 0.4 0.3 0.4 0.3 0.5 0.4 0.3 0.4 0.3 0.3 0.2 0.4 0.3 1 0.5 0.4
TInv 0.3 0.5 0.5 0.3 0.4 0.3 0.3 0.3 0.5 0.4 0.3 0.4 0.3 0.4 0.3 0.5 0.4 0.3 0.4 0.3 0.3 0.2 0.4 0.3 0.5 1 0.4
VPB 0.4 0.4 0.4 0.4 0.5 0.4 0.5 0.5 0.4 0.5 0.5 0.5 0.5 1 0.4 0.4 0.3 0.5 0.5 0.4 0.4 0.4 0.5 0.5 0.4 0.4 1
Suppose a given ground truth label is [AF, CRBBB, RBBB, RAD] and diagnosis predicted from a computational diagnostic model is [AF, AFL, NSR, RAD] for a diagnostic sample. First CRBBB and RBBB being equivalent is collapsed. Then, the first penalty is calculated according to equation 23, wherein c_i is a diagnostic condition in predicted diagnosis. It is obtained from the step 208 by considering n_i=1 since there is only 1 occurrence of the predicted diagnosis, y_i=[AF, CRBBB, RBBB, RAD], and x ^_i=[AF, AFL, NSR, RAD].
a_ik={¦(s_i ifc_i?AF,RAD@-s_i ifc_i?CRBB@s_i [1/|y_k | (?_(c_j?y_k)¦w_(i,j) )-1] ifc_i?AFL,NSR@0 otherwise)¦ (23)
Hence, the first penalty is a1=[0, 1, ((0.5 + 0.4 + 0.3)/3 - 1), 0, -1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.25, 0, 0, ((0.3+0.2 + 0.4)/3-1), 0, 0, 0, 0, 0, 0 ] which is [0, 1, -0.6, 0, -1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.25, 0, 0, -0.7, 0, 0, 0, 0, 0, 0 ]. The second penalty calculated from the contradiction matrix is b1 = [0, -1, -1, 0, -1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -0.25, 0, 0, -(0.25+0.25+0.25 + 0.25), 0, 0, 0, 0, 0, 0 ]. Thus, the pre-score corresponding to the diagnostic sample is t = -1.05 - 4.25 = -4.3. The total score of the multi-label multi-class computational diagnostic model is same as pre-score because there is only one diagnostic sample. The score of the perfect and null multi-label multi-class computational diagnostic models are 2.25 and -2.25 respectively. Hence, the metric can be computed as (-4.3 -(-2.25))/(2.25-(-2.25)) = (-2.05)/4.5 = -0.4555.
Table 5 provides a comparison of values calculated by method 200 and challenge metric for a number of predictions. FIG. 6 illustrates comparison of values calculated by method 200 with a plurality of metrics including accuracy, subset accuracy, hamming loss, F1 score and challenge metric for the ground truth and predicted diagnosis provided in Table 5. Similarly, FIG. 7 illustrates comparison of metrics for a different scenario the ground truth has two diagnostic conditions ({IAVB, Brady}). It can be observed that method 200 penalizes wrong diagnosis highest followed by missed and over diagnosis thereby satisfying the characteristics that are necessary for a good clinical metric.
Table 5
Ground Truth Prediction Class of diagnosis Metric of present disclosure Challenge metric
CRBBB, AF, Qab LAD, STach, Tinv Wrong diagnosis -0.23 0.253
CRBBB, AF, Qab LAD, Stach Wrong diagnosis -0.159 0.16
CRBBB, AF, Qab LAD Wrong diagnosis -0.081 0.048
CRBBB, AF, Qab ? Missed diagnosis 0 -0.121
CRBBB, AF, Qab CRBBB Missed diagnosis 0.25 0.245
CRBBB, AF, Qab CRBBB, AF Missed diagnosis 0.75 0.633
CRBBB, AF, Qab CRBBB, AF, QAb, LAD, NSIVCB Over diagnosis 0.756 0.823
CRBBB, AF, Qab CRBBB, AF, QAb, LAD Over diagnosis 0.918 0.889
CRBBB, AF, Qab CRBBB, AF, Qab Right diagnosis 1 1
Table 6 illustrates examples where contradictions of diagnostic conditions are taken into consideration while evaluating the model. In these examples, NSR, AF and AF, SB are pairwise contradictory whereas AF, AFL are not. Also, (PR, CRBBB), (PR, Stach), (PR, SB), (PR, AF), (SB, Stach), (SB, AF) and (AF, Stach) are pairwise contradictory.
Table 6
Ground Truth Prediction Metric Score
AF AF 1
AF NSR -0.037
AF AF, NSR 0.462
AF AF, AFL 0.874
AF AF, NSR, SB -0.072
AF AF, AFL, NSR 0.337
PR ? 0
PR CRBBB -0.262
PR PR, CRBBB 0.237
PR PR, CRBBB, Stach -0.512
PR PR, CRBBB, STach, SB -1.069
PR PR, CRBBB, STach, SB, AF -2.195
To test prevalence independence, a dataset with only two diagnostic conditions and a hypothetical classifier is considered. The classifier can detect condition A ({SA}) with 90% sensitivity (good performance), and condition B ({SB}) with 50% sensitivity (bad performance), and a fixed specificity of 100%. Then, the relative proportions of A, B is varied to study the performance of each measure. Ideally poor performance of the classifier in one class obfuscated by relative rarity of occurrence is not desired. However, from FIG. 8 it can be seen that only two metrics capture this fact. This experiment is repeated with 95% specificity and results are illustrated in FIG. 9. It can be seen that only metric of present disclosure is agnostic to relative prevalence of diagnostic conditions.
The written description describes the subject matter herein to enable any person skilled in the art to make and use the embodiments. The scope of the subject matter embodiments is defined by the claims and may include other modifications that occur to those skilled in the art. Such other modifications are intended to be within the scope of the claims if they have similar elements that do not differ from the literal language of the claims or if they include equivalent elements with insubstantial differences from the literal language of the claims.
It is to be understood that the scope of the protection is extended to such a program and in addition to a computer-readable means having a message therein; such computer-readable storage means contain program-code means for implementation of one or more steps of the method, when the program runs on a server or mobile device or any suitable programmable device. The hardware device can be any kind of device which can be programmed including e.g., any kind of computer like a server or a personal computer, or the like, or any combination thereof. The device may also include means which could be e.g., hardware means like e.g., an application-specific integrated circuit (ASIC), a field-programmable gate array (FPGA), or a combination of hardware and software means, e.g., an ASIC and an FPGA, or at least one microprocessor and at least one memory with software processing components located therein. Thus, the means can include both hardware means, and software means. The method embodiments described herein could be implemented in hardware and software. The device may also include software means. Alternatively, the embodiments may be implemented on different hardware devices, e.g., using a plurality of CPUs.
The embodiments herein can comprise hardware and software elements. The embodiments that are implemented in software include but are not limited to, firmware, resident software, microcode, etc. The functions performed by various components described herein may be implemented in other components or combinations of other components. For the purposes of this description, a computer-usable or computer readable medium can be any apparatus that can comprise, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.
The illustrated steps are set out to explain the exemplary embodiments shown, and it should be anticipated that ongoing technological development will change the manner in which particular functions are performed. These examples are presented herein for purposes of illustration, and not limitation. Further, the boundaries of the functional building blocks have been arbitrarily defined herein for the convenience of the description. Alternative boundaries can be defined so long as the specified functions and relationships thereof are appropriately performed. Alternatives (including equivalents, extensions, variations, deviations, etc., of those described herein) will be apparent to persons skilled in the relevant art(s) based on the teachings contained herein. Such alternatives fall within the scope of the disclosed embodiments. Also, the words “comprising,” “having,” “containing,” and “including,” and other similar forms are intended to be equivalent in meaning and be open ended in that an item or items following any one of these words is not meant to be an exhaustive listing of such item or items or meant to be limited to only the listed item or items. It must also be noted that as used herein and in the appended claims, the singular forms “a,” “an,” and “the” include plural references unless the context clearly dictates otherwise.
Furthermore, one or more computer-readable storage media may be utilized in implementing embodiments consistent with the present disclosure. A computer-readable storage medium refers to any type of physical memory on which information or data readable by a processor may be stored. Thus, a computer-readable storage medium may store instructions for execution by one or more processors, including instructions for causing the processor(s) to perform steps or stages consistent with the embodiments described herein. The term “computer-readable medium” should be understood to include tangible items and exclude carrier waves and transient signals, i.e., be non-transitory. Examples include random access memory (RAM), read-only memory (ROM), volatile memory, nonvolatile memory, hard drives, CD ROMs, DVDs, flash drives, disks, and any other known physical storage media.
It is intended that the disclosure and examples be considered as exemplary only, with a true scope of disclosed embodiments being indicated by the following claims.
,CLAIMS:
A processor implemented method comprising:
receiving (202), via one or more hardware processors, a dataset comprising a plurality of diagnostic samples and corresponding ground truth;
predicting (204), via the one or more hardware processors, a diagnosis corresponding to each of the plurality of diagnostic samples using a multi-label multi-class computational diagnostic model, wherein the predicted diagnosis comprises one or more diagnostic conditions;
classifying (206), via the one or more hardware processors, the predicted diagnosis in a class among a plurality of classes comprising: (i) a wrong diagnosis, (ii) a missed diagnosis, (iii) an over diagnosis and (iv) a right diagnosis;
calculating (208), via the one or more hardware processors, a first penalty for the diagnosis corresponding to each of the plurality of diagnostic samples based on the class of the predicted diagnosis;
calculating (210), via the one or more hardware processors, a second penalty for the diagnosis corresponding to each of the plurality of diagnostic samples based on a contradiction matrix;
computing (212), via the one or more hardware processors, a pre-score for each of the plurality of diagnostic samples based on the corresponding first penalty and second penalty;
obtaining (214), via the one or more hardware processors, a score corresponding to the multi-label multi-class computational diagnostic model by summing up the pre-score of each of the plurality of diagnostic samples; and
evaluating (216), via the one or more hardware processors, the multi-label multi-class computational diagnostic model with a metric that is based on (i) the score corresponding to the multi-label multi-class computational diagnostic model, (ii) a pre-computed score of a perfect multi-label multi-class computational diagnostic model whose predictions always belong to the right diagnosis class and (iii) a pre-computed score of a null multi-label multi-class computational diagnostic model which predicts null or 0 only.
The method as claimed in claim 1, wherein the predicted diagnosis for a diagnostic sample from among the plurality of diagnostic samples is classified as (i) the wrong diagnosis if the predicted diagnosis and ground truth corresponding to the diagnostic sample are disjoint, (ii) the missed diagnosis if the predicted diagnosis is a proper subset of the ground truth corresponding to the diagnostic sample, (iii) the over diagnosis if the ground truth corresponding to the diagnostic sample is a proper subset of the predicted diagnosis, or (iv) the right diagnosis otherwise.
The method as claimed in claim 1, wherein the first penalty is calculated by one of: (i) s_i/n_i if the predicted diagnosis is a right diagnosis, (ii) ?-s?_i/n_i if the predicted diagnosis is a missed diagnosis, (iii) s_i/n^* [1/|y_k | (?_(c_j?y_k)¦w_(i,j) )-1]if the predicted diagnosis is an over diagnosis, and (iv) 0 if the predicted diagnosis is a wrong diagnosis, wherein s_i is pre-defined significance weight corresponding to class of diagnostic conditions in the dataset, n_i is number of occurrences of the predicted diagnosis in the dataset, n^*= max{n_i |?c_i?y_k }, y_k is ground truth corresponding to the diagnostic sample for which prediction is done, w_(i,j) is a weight matrix comprising cost of misclassification, and wherein strict monotonicity of the first penalty is maintained by assigning highest first penalty to wrong diagnosis followed by missed diagnosis and over diagnosis thereby imbibing risk aversion principle of clinical diagnosis.
The method as claimed in claim 1, wherein the contradiction matrix provides contradictory and non-contradictory pairs of diagnostic conditions.
The method as claimed in claim 1, wherein the second penalty is calculated as (i) (-1)/n_i ?_(?j s.t. c_j?x ^_k)¦?s_j·C_ij ? if c_i is a diagnostic condition in the predicted diagnosis or (ii) 0 otherwise, wherein n_i is number of occurrences of the predicted diagnosis in the dataset, x ^_k is the predicted diagnosis, s_j is pre-defined significance weight corresponding to the class of diagnosis and C_ij is an entry in the contradiction matrix.
A system (100), comprising:
a memory (102) storing instructions;
one or more Input/Output (I/O) interfaces (106); and
one or more hardware processors (104) coupled to the memory (102) via the one or more communication interfaces (106), wherein the one or more hardware processors (104) are configured by the instructions to:
receive a dataset comprising a plurality of diagnostic samples and corresponding ground truth;
predict a diagnosis corresponding to each of the plurality of diagnostic samples using a multi-label multi-class computational diagnostic model, wherein the predicted diagnosis comprises one or more diagnostic conditions;
classify the predicted diagnosis in a class among a plurality of classes comprising: (i) a wrong diagnosis, (ii) a missed diagnosis, (iii) an over diagnosis and (iv) a right diagnosis;
calculate a first penalty for the diagnosis corresponding to each of the plurality of diagnostic samples based on the class of the predicted diagnosis;
calculate a second penalty for the diagnosis corresponding to each of the plurality of diagnostic samples based on a contradiction matrix;
compute a pre-score for each of the plurality of diagnostic samples based on the corresponding first penalty and second penalty;
obtain a score corresponding to the multi-label multi-class computational diagnostic model by summing up the pre-score of each of the plurality of diagnostic samples; and
evaluate the multi-label multi-class computational diagnostic model with a metric that is based on (i) the score corresponding to the multi-label multi-class computational diagnostic model, (ii) a pre-computed score of a perfect multi-label multi-class computational diagnostic model whose predictions always belong to the right diagnosis class and (iii) a pre-computed score of a null multi-label multi-class computational diagnostic model which predicts null or 0 only.
The system as claimed in claim 6, wherein the predicted diagnosis for a diagnostic sample from among the plurality of diagnostic samples is classified as (i) the wrong diagnosis if the predicted diagnosis and ground truth corresponding to the diagnostic sample are disjoint, (ii) the missed diagnosis if the predicted diagnosis is a proper subset of the ground truth corresponding to the diagnostic sample, (iii) the over diagnosis if the ground truth corresponding to the diagnostic sample is a proper subset of the predicted diagnosis, or (iv) the right diagnosis otherwise.
The system as claimed in claim 6, wherein the first penalty is calculated by one of: (i) s_i/n_i if the predicted diagnosis is a right diagnosis, (ii) ?-s?_i/n_i if the predicted diagnosis is a missed diagnosis, (iii) s_i/n^* [1/|y_k | (?_(c_j?y_k)¦w_(i,j) )-1]if the predicted diagnosis is an over diagnosis, and (iv) 0 if the predicted diagnosis is a wrong diagnosis, wherein s_i is pre-defined significance weight corresponding to class of diagnostic conditions in the dataset, n_i is number of occurrences of the predicted diagnosis in the dataset, n^*= max{n_i |?c_i?y_k }, y_k is ground truth corresponding to the diagnostic sample for which prediction is done, w_(i,j) is a weight matrix comprising cost of misclassification, and wherein strict monotonicity of the first penalty is maintained by assigning highest first penalty to wrong diagnosis followed by missed diagnosis and over diagnosis thereby imbibing risk aversion principle of clinical diagnosis.
The system as claimed in claim 6, wherein the contradiction matrix provides contradictory and non-contradictory pairs of diagnostic conditions.
The system as claimed in claim 6, wherein the second penalty is calculated as (i) (-1)/n_i ?_(?j s.t. c_j?x ^_k)¦?s_j·C_ij ? if c_i is a diagnostic condition in the predicted diagnosis or (ii) 0 otherwise, wherein n_i is number of occurrences of the predicted diagnosis in the dataset, x ^_k is the predicted diagnosis, s_j is pre-defined significance weight corresponding to the class of diagnosis and C_ij is an entry in the contradiction matrix.