You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently, it's difficult to reliably compare evaluation results. While two evaluations might link to the same test set, the underlying data in that test set could have changed between runs.
For example, a user might run an evaluation, add new data to the test set, and then run a second evaluation. This leads to the incorrect assumption that the results are comparable, when in fact they were generated from two different versions of the data. This lack of versioning can lead to inaccurate conclusions and undermines the reproducibility of experiments.
Proposed Solution
Introduce a version control system for test sets.
When an evaluation is run, it should be permanently linked to a specific, immutable version of the test set. This ensures that any comparison between evaluations is valid and based on a consistent data baseline.
Desired Workflow
A user runs Evaluation A on Test Set v1.
The user then modifies the test set (e.g., adds new items), which automatically creates Test Set v2.
The user runs Evaluation B on the updated Test Set v2.
The system should clearly show that Evaluation A used v1 and Evaluation B used v2, preventing invalid comparisons. This makes the entire workflow more systematic, transparent, and scientifically sound.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
-
Problem
Currently, it's difficult to reliably compare evaluation results. While two evaluations might link to the same test set, the underlying data in that test set could have changed between runs.
For example, a user might run an evaluation, add new data to the test set, and then run a second evaluation. This leads to the incorrect assumption that the results are comparable, when in fact they were generated from two different versions of the data. This lack of versioning can lead to inaccurate conclusions and undermines the reproducibility of experiments.
Proposed Solution
Introduce a version control system for test sets.
When an evaluation is run, it should be permanently linked to a specific, immutable version of the test set. This ensures that any comparison between evaluations is valid and based on a consistent data baseline.
Desired Workflow
Evaluation A
onTest Set v1
.Test Set v2
.Evaluation B
on the updatedTest Set v2
.The system should clearly show that
Evaluation A
usedv1
andEvaluation B
usedv2
, preventing invalid comparisons. This makes the entire workflow more systematic, transparent, and scientifically sound.Beta Was this translation helpful? Give feedback.
All reactions