Suggestion: Solutions beyond "balanced accuracy" to address the prevalence shift. #1
Replies: 2 comments 2 replies
-
Thanks for the feedback @bastiseen ! Very interesting! I'll give it a more detailed read and will get back to you shortly! In the meantime: @rrodriguezperez If I recall correctly, you raised the topic of prevalence. @jrash I think I remember you leaving some comments about this as well. I would love to hear your thoughts on this one! |
Beta Was this translation helpful? Give feedback.
-
Hi @bastiseen, thank you sharing your paper. It is indeed relevant to the MCC metric we propose in D.2. Your paper argues that the MCC metric is not always invariant to class prevalence. We will need to read this paper more carefully as a group and decide how to address. While we recommend a metric for balanced performance (i.e., prediction on positives and negatives are weighted equally), I find that these often have limited utility in drug discovery evaluations. This is because we often place higher importance on one class over another. We provide two examples in the section 3.3.1 of selecting what compounds to make and what compounds to not make. In both of these cases you only care about performance on one class, and your performance metrics should reflect this. You are right that if a model’s performance is evaluated on a test set with a specific prevalence and it is then applied to a data set with a very different prevalence, the evaluation performance may not accurately estimate application performance. However, the solution to this problem is to evaluate the model on a test set that is more representative of the application (i.e., the evaluation set should be sampled from the same distribution as the application set). If evaluation is not representative of the application, then you will obtain misleading metrics. I often see papers in cheminformatics using a balanced metric to sweep this problem under the rug. The prevalence problem still exists, but the reader is no longer concerned about it because the performance metric is “balanced”. The papers below have a good explanation of how a similar issue occurs when using AUROC curve to evaluated highly imbalanced data sets when only the positive class is of interest (e.g. virtual screening). An often cited advantage of the AUROC curve is that it is invariant to changes in prevalence. Saito et al show that the interpretation of AUROC changes with changes of prevalence even though the value of the metric does not. The fact that AUROC does not change with prevalence is misleading because it doesn’t measure the change in performance that is of interest. A similar argument can be made against other balanced metrics. Saito, T. & Rehmsmeier, M. The precision-recall plot is more informative than the roc plot when evaluating binary classifiers on imbalanced datasets. PloS one 10, e0118432 (2015). |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
I am reaching out to you because I have read the above publication of yours with great interest. I am developing methods/solutions for robust machine learning model validation. As such your discussion on "Evaluating Models as Binary Classifiers" resonated with my own research.
Prevalence shift in the test sets when validating a model is also a cause that can introduce misleading interpretation of validation results. The publication of mine Mind your prevalence! in Journal of Cheminformatics discusses this issue in the context of QSAR and introduces a solution beyond balanced accuracy. Balanced accuracy is often defined as the mean of sensitivity (or recall of the positive class) and specificity (recall of the negative class or true negative rate) in the literature. In "Mind your prevalence!" accuracy is defined as an accuracy calibrated to a balanced test set. This concept is then applied to any metrics and lead to the corresponding balanced metrics: balanced MCC, balanced precision (or balance positive predictivity), balanced Cohen's Kappa coefficient and so on. This work also shows that MCC (and Cohen's Kappa coefficient) is not independent of prevalence as the bulk of the literature suggests. MCC and Cohen's Kappa coefficient can be misleading, in the same way but not to the same extend as precision, at extreme prevalence (i.e. very imbalanced test sets) and prevalence shift. MCC is less dependent on prevalence then precision.
I would be very happy to hear your or colleagues' thoughts on this topic. Would you consider including it in your discussion on "Evaluating Models as Binary Classifiers" from your publication? If you would like to discuss further please do not hesitate to reach out.
Beta Was this translation helpful? Give feedback.
All reactions