Test Set Versioning #2735

mmabrouk · 2025-08-12T13:35:10Z

mmabrouk
Aug 12, 2025
Maintainer

Problem

Currently, it's difficult to reliably compare evaluation results. While two evaluations might link to the same test set, the underlying data in that test set could have changed between runs.

For example, a user might run an evaluation, add new data to the test set, and then run a second evaluation. This leads to the incorrect assumption that the results are comparable, when in fact they were generated from two different versions of the data. This lack of versioning can lead to inaccurate conclusions and undermines the reproducibility of experiments.

Proposed Solution

Introduce a version control system for test sets.

When an evaluation is run, it should be permanently linked to a specific, immutable version of the test set. This ensures that any comparison between evaluations is valid and based on a consistent data baseline.

Desired Workflow

A user runs Evaluation A on Test Set v1.
The user then modifies the test set (e.g., adds new items), which automatically creates Test Set v2.
The user runs Evaluation B on the updated Test Set v2.

The system should clearly show that Evaluation A used v1 and Evaluation B used v2, preventing invalid comparisons. This makes the entire workflow more systematic, transparent, and scientifically sound.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Test Set Versioning #2735

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Test Set Versioning #2735

Uh oh!

mmabrouk Aug 12, 2025 Maintainer

Problem

Proposed Solution

Desired Workflow

Replies: 0 comments

mmabrouk
Aug 12, 2025
Maintainer