Suggestion: Solutions beyond "balanced accuracy" to address the prevalence shift. #1

bastiseen · 2024-11-11T09:46:19Z

bastiseen
Nov 11, 2024

I am reaching out to you because I have read the above publication of yours with great interest. I am developing methods/solutions for robust machine learning model validation. As such your discussion on "Evaluating Models as Binary Classifiers" resonated with my own research.

Prevalence shift in the test sets when validating a model is also a cause that can introduce misleading interpretation of validation results. The publication of mine Mind your prevalence! in Journal of Cheminformatics discusses this issue in the context of QSAR and introduces a solution beyond balanced accuracy. Balanced accuracy is often defined as the mean of sensitivity (or recall of the positive class) and specificity (recall of the negative class or true negative rate) in the literature. In "Mind your prevalence!" accuracy is defined as an accuracy calibrated to a balanced test set. This concept is then applied to any metrics and lead to the corresponding balanced metrics: balanced MCC, balanced precision (or balance positive predictivity), balanced Cohen's Kappa coefficient and so on. This work also shows that MCC (and Cohen's Kappa coefficient) is not independent of prevalence as the bulk of the literature suggests. MCC and Cohen's Kappa coefficient can be misleading, in the same way but not to the same extend as precision, at extreme prevalence (i.e. very imbalanced test sets) and prevalence shift. MCC is less dependent on prevalence then precision.

I would be very happy to hear your or colleagues' thoughts on this topic. Would you consider including it in your discussion on "Evaluating Models as Binary Classifiers" from your publication? If you would like to discuss further please do not hesitate to reach out.

cwognum · 2024-11-11T12:45:48Z

cwognum
Nov 11, 2024
Maintainer

Thanks for the feedback @bastiseen ! Very interesting! I'll give it a more detailed read and will get back to you shortly!

In the meantime: @rrodriguezperez If I recall correctly, you raised the topic of prevalence. @jrash I think I remember you leaving some comments about this as well. I would love to hear your thoughts on this one!

0 replies

jrash · 2024-11-27T20:56:30Z

jrash
Nov 27, 2024
Collaborator

Hi @bastiseen, thank you sharing your paper. It is indeed relevant to the MCC metric we propose in D.2. Your paper argues that the MCC metric is not always invariant to class prevalence. We will need to read this paper more carefully as a group and decide how to address.

While we recommend a metric for balanced performance (i.e., prediction on positives and negatives are weighted equally), I find that these often have limited utility in drug discovery evaluations. This is because we often place higher importance on one class over another. We provide two examples in the section 3.3.1 of selecting what compounds to make and what compounds to not make. In both of these cases you only care about performance on one class, and your performance metrics should reflect this.

You are right that if a model’s performance is evaluated on a test set with a specific prevalence and it is then applied to a data set with a very different prevalence, the evaluation performance may not accurately estimate application performance. However, the solution to this problem is to evaluate the model on a test set that is more representative of the application (i.e., the evaluation set should be sampled from the same distribution as the application set). If evaluation is not representative of the application, then you will obtain misleading metrics. I often see papers in cheminformatics using a balanced metric to sweep this problem under the rug. The prevalence problem still exists, but the reader is no longer concerned about it because the performance metric is “balanced”.

The papers below have a good explanation of how a similar issue occurs when using AUROC curve to evaluated highly imbalanced data sets when only the positive class is of interest (e.g. virtual screening). An often cited advantage of the AUROC curve is that it is invariant to changes in prevalence. Saito et al show that the interpretation of AUROC changes with changes of prevalence even though the value of the metric does not. The fact that AUROC does not change with prevalence is misleading because it doesn’t measure the change in performance that is of interest. A similar argument can be made against other balanced metrics.

Saito, T. & Rehmsmeier, M. The precision-recall plot is more informative than the roc plot when evaluating binary classifiers on imbalanced datasets. PloS one 10, e0118432 (2015).
https://probml.github.io/pml-book/book1.html

2 replies

bastiseen Nov 28, 2024
Author

Thanks @jrash for your answer and comments. I'll keep the discussion going and give another read to Saito's publication. In the meantime if you have any questions with regard to "Mind your prevalence!" publication when you discuss it as a goup, do no hesitate to contact me.

bastiseen Dec 3, 2024
Author

Hi @jrash and @cwognum

I believe that the discussion in the publication “The Precision-Recall Plot Is More Informative than the ROC Plot When Evaluating Binary Classifiers on Imbalanced Datasets” by Saito and Rehmsmeier and "Mind your prevalence!" have different views.

Saito and Rehmsmeier seem to regard the prevalence as a characteristic of the model performance. In contrast, "Mind your prevalence!" regards prevalence as a characteristic of the test set only.

A premise of "Mind your prevalence!" is that a confusion matrix fully characterises the performance of a model against a test set. In addition, it demonstrates that the four prediction types of a confusion matrix can be expressed as a function of either sensitivity or specificity, and prevalence. E.g. TP = sensitivity x prevalence x N where N is the number of instances in the data set. See Figure 7 in “Mind your prevalence!”.

Therefore, based on the above premise and demonstration, one can conclude that sensitivity and specificity characterise fully the model performance.

As such a change of prevalence cannot be characteristic of a change in the model performance. Performance metrics which are dependent on the prevalence cannot be compared when a prevalence shift occurs between two test sets. If compared misinterpretation may occur.

The example of confusion matrix in Figure 1C and the corresponding metrics in the Table 2 of the “Theoretical background” in Saito and Rehmsmeier’s publication is an example of prevalence shift. Analysis of the change in positive predictivity, precision recall curve and Matthews’ correlation coefficient without accounting for the prevalence shift (from 0.5 to 0.25) leads to a misinterpretation of the model performance. In this specific example the use of calibrated/balanced metrics would lead to the conclusion that the model performs equally well with both test sets and reflect the fact that sensitivity and specificity do not change.

Therefore, the model performs equally well on both test sets and all metrics should show the same. The prevalence shift forbids the comparison of metrics that depends on prevalence.

Given that the Precision (i.e. Positive predictivity) depends on prevalence, it needs to be balanced or calibrated as well in the Precision-Recall curve to account for a prevalence shift. In this condition, the (balanced/calibrated) Precision-Recall curve and Receiving Operating curve and their area under the curve are different visualisation and metrics, respectively, that show different properties of a model but none of them, I believe, is better than the other.

Notes:
(a) the metric accuracy does not change in Figure 1C even though this metric is well known to be dependent on prevalence. The reason is that when specificity and the sensitivity are the same, which is the case in Figure 1C, accuracy is not dependent on prevalence anymore. See “From accuracy to balanced accuracy” in “Mind your prevalence!”.
(b) the notion of calibrated metrics, as an extension of the balanced metric concept, is succinctly described in the section “From accuracy to balanced accuracy” in “Mind your prevalence!”. Calibrated metrics allows to choose at which prevalence performance metrics are compared.
(c) balanced/calibrated metrics are not necessary when there is no shift in prevalence.

Happy to discuss it further.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Suggestion: Solutions beyond "balanced accuracy" to address the prevalence shift. #1

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Suggestion: Solutions beyond "balanced accuracy" to address the prevalence shift. #1

Uh oh!

bastiseen Nov 11, 2024

Replies: 2 comments · 2 replies

Uh oh!

cwognum Nov 11, 2024 Maintainer

Uh oh!

jrash Nov 27, 2024 Collaborator

Uh oh!

bastiseen Nov 28, 2024 Author

Uh oh!

bastiseen Dec 3, 2024 Author

bastiseen
Nov 11, 2024

Replies: 2 comments 2 replies

cwognum
Nov 11, 2024
Maintainer

jrash
Nov 27, 2024
Collaborator

bastiseen Nov 28, 2024
Author

bastiseen Dec 3, 2024
Author