-
Notifications
You must be signed in to change notification settings - Fork 23
Description
Hello,
I have a concern about how you calculate the judge score. Currently, it is equal when coherency_win
either is tie
or original
. But in fact, when it’s tie
, then it means the model produces a coherent conversation (but not better than the original conversation), and when it’s original
, it means the model produces an incoherent conversation (in most cases).
Because you are going to have only the judge score and stop doing a separate coherency check, it has to be improved. Imagine two models from miners. Both of them have the same realism_win
and entertainment_win
scores. But one of them has always tie
in the coherency_win
metric, and another one has always original
. With your current implementation, both models would get the same total score, which is totally wrong. It is clear that the model which always gets original
in the coherency_win
metric is incoherent.
Let’s say, when you calculate the score, you can give two points when the generated
conversation wins, one point when it’s tie
, and zero points when the original
conversation wins. For example:
def collect_judge_scores(scores: List):
try:
# Initialize tally structure
tally = {
"realism": {"original": 0, "generated": 0, "tie": 0},
"entertainment": {"original": 0, "generated": 0, "tie": 0},
"coherency": {"original": 0, "generated": 0, "tie": 0},
}
valid = 0
# Process each score entry
for item in scores:
# Skip corrupted/incomplete data
if not all(key in item for key in ["realism_win", "entertainment_win", "coherency_win"]):
continue
valid += 1
# Tally wins for each category
tally["realism"][item["realism_win"]] += 1
tally["entertainment"][item["entertainment_win"]] += 1
tally["coherency"][item["coherency_win"]] += 1
# Calculate individual totals
total_original = sum(cat["original"] for cat in tally.values())
total_generated = sum(cat["generated"] for cat in tally.values())
total_ties = sum(cat["tie"] for cat in tally.values())
# Calculate win rate
win_rate = (total_generated * 2 + total_ties) / (valid * 3 * 2) if valid > 0 else 0
# Combine into totals dict
totals = {
"total_original": total_original,
"total_generated": total_generated,
"total_ties": total_ties,
"by_category": tally,
"valid": valid,
"win_rate": win_rate,
}
return totals
except Exception as e:
print(f"Error parsing file: {str(e)}")
return None
So the change is in this formula:
win_rate = (total_generated * 2 + total_ties) / (valid * 3 * 2) if valid > 0 else 0