Clarification Needed: OOD vs. ID as Positive Class in Evaluation Metrics #289

BlackJack0083 · 2025-05-28T14:05:51Z

BlackJack0083
May 28, 2025

Hi everyone,

We're currently working on OOD detection and are encountering some confusion regarding the convention of defining the positive class (i.e., whether OOD or ID samples are labeled as '1') during evaluation. This directly impacts the interpretation and comparability of crucial metrics like FPR95 (False Positive Rate at 95% True Positive Rate).

We've observed different conventions in prominent repositories:

The ViM repository seems to define ID samples as the positive class (1) when calculating metrics. This implies that higher scores might indicate ID, and a "false positive" would refer to an OOD sample incorrectly classified as ID.
In contrast, the OpenOOD repository appears to define OOD samples as the positive class (1). This means higher scores typically indicate OOD, and a "false positive" would refer to an ID sample incorrectly classified as OOD.

This discrepancy has a significant impact on the FPR95 metric. Let's clarify its definition in this context:

FPR (False Positive Rate) is generally calculated as $FP / (FP + TN)$.
TPR (True Positive Rate), also known as Recall or Sensitivity, is calculated as $TP / (TP + FN)$.

FPR95 specifically means the False Positive Rate when the True Positive Rate is fixed at 95%.

If ID is the positive class (1):

$TP$ would be correctly classified ID samples.
$FP$ would be OOD samples incorrectly classified as ID.
$TN$ would be correctly classified OOD samples.
$FN$ would be ID samples incorrectly classified as OOD.
In this scenario, FPR95 would measure the proportion of OOD samples misclassified as ID when 95% of ID samples are correctly identified.

If OOD is the positive class (1):

$TP$ would be correctly classified OOD samples.
$FP$ would be ID samples incorrectly classified as OOD.
$TN$ would be correctly classified ID samples.
$FN$ would be OOD samples incorrectly classified as ID.
In this scenario, FPR95 would measure the proportion of ID samples misclassified as OOD when 95% of OOD samples are correctly identified.

As you can see, these two interpretations lead to entirely different numerical values and meanings for FPR95, making direct comparison across papers or benchmarks using different conventions extremely challenging and potentially misleading.

Our questions are:

What is the recommended/standard convention for defining the positive class (ID or OOD) in the OOD detection community when reporting evaluation metrics, especially for composite metrics like FPR95, AUROC, AUPR, etc.?
Could the maintainers of these repositories (or anyone with insights) clarify their specific motivations for choosing one convention over the other? Are there historical reasons or specific use cases that dictate these choices?
How should we standardize the calculation of metrics like FPR95 to ensure consistent and fair comparisons, given these differing conventions? Perhaps the community could align on a universal definition for the positive class in OOD tasks?

We believe clarifying this point is crucial for the advancement and reproducible research in OOD detection.

Thank you for your time and insights!

Answered by zjysteven

May 28, 2025

Thank you for this post, and indeed this is an accurate observation which we are aware of. Please see my answers below.

Could the maintainers of these repositories (or anyone with insights) clarify their specific motivations for choosing one convention over the other? Are there historical reasons or specific use cases that dictate these choices?

We intentionally choose to treat OOD as positive and ID as negative in OpenOOD v1.5 for convention/historical reason. In conventional ML (more specifically, conventional anomaly detection), it has been a standard to treat something "abnormal" as positive. This is also the practice adopted by the seminal paper for modern OOD detection on neural n…

View full answer

zjysteven · 2025-05-28T16:27:39Z

zjysteven
May 28, 2025
Collaborator

Thank you for this post, and indeed this is an accurate observation which we are aware of. Please see my answers below.

Could the maintainers of these repositories (or anyone with insights) clarify their specific motivations for choosing one convention over the other? Are there historical reasons or specific use cases that dictate these choices?

We intentionally choose to treat OOD as positive and ID as negative in OpenOOD v1.5 for convention/historical reason. In conventional ML (more specifically, conventional anomaly detection), it has been a standard to treat something "abnormal" as positive. This is also the practice adopted by the seminal paper for modern OOD detection on neural networks, "A Baseline for Detecting Misclassified and Out-of-Distribution Examples in Neural Networks".

There are also some early and well-known works that for some (unclear) reason choose the opposite setup; and later works follow them, and then there is a momentum forming up. This is why there is a discrepancy and it could be confusing.

What is the recommended/standard convention for defining the positive class (ID or OOD) in the OOD detection community when reporting evaluation metrics, especially for composite metrics like FPR95, AUROC, AUPR, etc.?

Similar to above discussion, we (or at least I, personally) would recommend defining OOD as positive.

How should we standardize the calculation of metrics like FPR95 to ensure consistent and fair comparisons, given these differing conventions? Perhaps the community could align on a universal definition for the positive class in OOD tasks?

This is exactly why we put up OpenOOD in the first place! Besides this discrepancy in the positive class definition, from paper to paper there are so many other discrepancies in the evaluation data, experimental setup, etc. that makes straight comparison over reported numbers quite difficult. We hope that this project can ultimately motivate and lead towards a universal definition and setup for OOD evaluation.

1 reply

BlackJack0083 May 30, 2025
Author

Thank you so much for your detailed and insightful reply! Your explanation was incredibly helpful and cleared up our team's confusion. We're very glad to hear that OpenOOD is also dedicated to standardizing OOD evaluation; we completely agree with this goal. We look forward to seeing the community achieve unified evaluation standards in the future, as this will greatly benefit the entire field.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Clarification Needed: OOD vs. ID as Positive Class in Evaluation Metrics #289

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Clarification Needed: OOD vs. ID as Positive Class in Evaluation Metrics #289

Uh oh!

BlackJack0083 May 28, 2025

Replies: 1 comment · 1 reply

Uh oh!

Uh oh!

zjysteven May 28, 2025 Collaborator

Uh oh!

BlackJack0083 May 30, 2025 Author

BlackJack0083
May 28, 2025

Replies: 1 comment 1 reply

zjysteven
May 28, 2025
Collaborator

BlackJack0083 May 30, 2025
Author