Required changes to collect_judge_scores function

Hello,

I have a concern about how you calculate the judge score. Currently, it is equal when `coherency_win` either is `tie` or `original`. But in fact, when it’s `tie`, then it means the model produces a coherent conversation (but not better than the original conversation), and when it’s `original`, it means the model produces an incoherent conversation (in most cases).
Because you are going to have only the judge score and stop doing a separate coherency check, it has to be improved. Imagine two models from miners. Both of them have the same `realism_win` and `entertainment_win` scores. But one of them has always `tie` in the `coherency_win` metric, and another one has always `original`. With your current implementation, both models would get the same total score, which is totally wrong. It is clear that the model which always gets `original` in the `coherency_win` metric is incoherent.

Let’s say, when you calculate the score, you can give two points when the `generated` conversation wins, one point when it’s `tie`, and zero points when the `original` conversation wins. For example:

```python
def collect_judge_scores(scores: List):
    try:

        # Initialize tally structure
        tally = {
            "realism": {"original": 0, "generated": 0, "tie": 0},
            "entertainment": {"original": 0, "generated": 0, "tie": 0},
            "coherency": {"original": 0, "generated": 0, "tie": 0},
        }

        valid = 0

        # Process each score entry
        for item in scores:
            # Skip corrupted/incomplete data
            if not all(key in item for key in ["realism_win", "entertainment_win", "coherency_win"]):
                continue

            valid += 1

            # Tally wins for each category
            tally["realism"][item["realism_win"]] += 1
            tally["entertainment"][item["entertainment_win"]] += 1
            tally["coherency"][item["coherency_win"]] += 1
        # Calculate individual totals
        total_original = sum(cat["original"] for cat in tally.values())
        total_generated = sum(cat["generated"] for cat in tally.values())
        total_ties = sum(cat["tie"] for cat in tally.values())

        # Calculate win rate
        win_rate = (total_generated * 2 + total_ties) / (valid * 3 * 2) if valid > 0 else 0

        # Combine into totals dict
        totals = {
            "total_original": total_original,
            "total_generated": total_generated,
            "total_ties": total_ties,
            "by_category": tally,
            "valid": valid,
            "win_rate": win_rate,
        }

        return totals
    except Exception as e:
        print(f"Error parsing file: {str(e)}")
        return None
```

So the change is in this formula:

```python
win_rate = (total_generated * 2 + total_ties) / (valid * 3 * 2) if valid > 0 else 0
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Required changes to collect_judge_scores function #160

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Required changes to collect_judge_scores function #160

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions