Skip to content

Required changes to collect_judge_scores function #160

@itorgov

Description

@itorgov

Hello,

I have a concern about how you calculate the judge score. Currently, it is equal when coherency_win either is tie or original. But in fact, when it’s tie, then it means the model produces a coherent conversation (but not better than the original conversation), and when it’s original, it means the model produces an incoherent conversation (in most cases).
Because you are going to have only the judge score and stop doing a separate coherency check, it has to be improved. Imagine two models from miners. Both of them have the same realism_win and entertainment_win scores. But one of them has always tie in the coherency_win metric, and another one has always original. With your current implementation, both models would get the same total score, which is totally wrong. It is clear that the model which always gets original in the coherency_win metric is incoherent.

Let’s say, when you calculate the score, you can give two points when the generated conversation wins, one point when it’s tie, and zero points when the original conversation wins. For example:

def collect_judge_scores(scores: List):
    try:

        # Initialize tally structure
        tally = {
            "realism": {"original": 0, "generated": 0, "tie": 0},
            "entertainment": {"original": 0, "generated": 0, "tie": 0},
            "coherency": {"original": 0, "generated": 0, "tie": 0},
        }

        valid = 0

        # Process each score entry
        for item in scores:
            # Skip corrupted/incomplete data
            if not all(key in item for key in ["realism_win", "entertainment_win", "coherency_win"]):
                continue

            valid += 1

            # Tally wins for each category
            tally["realism"][item["realism_win"]] += 1
            tally["entertainment"][item["entertainment_win"]] += 1
            tally["coherency"][item["coherency_win"]] += 1
        # Calculate individual totals
        total_original = sum(cat["original"] for cat in tally.values())
        total_generated = sum(cat["generated"] for cat in tally.values())
        total_ties = sum(cat["tie"] for cat in tally.values())

        # Calculate win rate
        win_rate = (total_generated * 2 + total_ties) / (valid * 3 * 2) if valid > 0 else 0

        # Combine into totals dict
        totals = {
            "total_original": total_original,
            "total_generated": total_generated,
            "total_ties": total_ties,
            "by_category": tally,
            "valid": valid,
            "win_rate": win_rate,
        }

        return totals
    except Exception as e:
        print(f"Error parsing file: {str(e)}")
        return None

So the change is in this formula:

win_rate = (total_generated * 2 + total_ties) / (valid * 3 * 2) if valid > 0 else 0

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions