Judgemark V2 is a benchmark that evaluates how well a language model can judge creative writing. Instead of relying on simple pairwise preferences, Judgemark V2 prompts the judge model to assign numeric scores for multiple literary criteria (e.g., “Nuanced Characters,” “Overwrought,” “Emotionally Engaging”). It then aggregates those scores, measures how consistent and discriminative they are, and derives a final numeric rating of the judge model’s performance.
The Judgemark leaderboard can be found here: https://eqbench.com/judgemark-v2.html
- Complex Numeric Scoring: Requires the judge model to provide 0–10 scores for dozens of criteria, highlighting any shortcomings in following complex instructions.
- Raw & Calibrated Scores: The system calculates a “raw” Judgemark score from the judge’s out-of-the-box distribution, and a “calibrated” score after normalizing the distribution for fairer cross-model comparisons.
- Stability & Separability Metrics: Goes beyond correlation to measure how stable the judge’s rankings are across repeated runs, and how well it separates strong from weak creative outputs.
- Threaded Execution: Supports multi-threaded item processing, drastically reducing the time required to score multiple creative samples.
-
Clone the repository:
git clone https://github.com/EQ-bench/Judgemark-v2.git cd Judgemark-v2
-
Install Python dependencies (make sure you’re on Python 3.9+):
pip install -r requirements.txt
-
Set up environment variables to include your judge model’s API credentials. For example, if you’re using OpenAI-compatible endpoints:
# (in .env or system env) export OPENAI_API_KEY="sk-..." export OPENAI_API_URL="https://openrouter.ai/api/v1/chat/completions"
Run the benchmark via the main script judgemark_v2.py
. For instance:
python judgemark_v2.py \
--judge-model "openai/gpt-4o-mini" \
--samples-file data/judgemark_v2.1_samples.json \
--prompts-file data/judge_prompts.json \
--runs-file my_judgemark_runs.json \
--threads 20 \
--num-runs 1 \
--save-raw-judge-output
--judge-model
(required): The model identifier (e.g.openai/gpt-4
,anthropic/claude-v1
).--samples-file
: Path to the JSON with creative-writing samples to be judged. Default:data/judgemark_v2.1_samples.json
.--prompts-file
: Path to the JSON with partial prompts for the judge. Default:data/judge_prompts.json
.--runs-file
: The output JSON to store final run results. Default:judgemark_v2_runs.json
.--run-id
: A custom run ID for continuing or naming a run (optional).--threads
: Number of threads for parallel scoring. Default:6
.--verbosity
: Log verbosity: one of[DEBUG, INFO, WARNING, ERROR, CRITICAL]
.--num-runs
: Number of times to repeat the entire benchmark. Default:1
.--save-raw-judge-output
: Store the raw text responses from the judge into the results JSON.
-
Reading In Samples
The script loadssamples_file
, which contains completions to creative writing prompts from multiple “writer models.” -
Generating Judge Prompts
For each completion, we load a judge prompt fromprompts_file
. This typically includes instructions like:Please assign numeric scores (0-10) for these criteria: - Nuanced Characters - Overwrought - ... [TEST MODEL RESPONSE] ...
-
Sending Requests to the Judge Model
Each completion + prompt is sent to the--judge-model
via the functions inutils/api.py
. We specify a moderate temperature (often0.5
) and top-k for variability. -
Parsing the Judge Output
The script captures lines likeNuanced Characters: 8
orWeak Dialogue: 3
, extracts the numeric scores, and aggregates them into a single raw score. Negative criteria (like “Weak Dialogue”) are inverted so 10 = worst. -
Storing & Re-Trying
Results are saved in your designatedruns-file
. If an item fails or provides incomplete scores, the script can retry (in subsequent runs) without overwriting previous data. -
Final Judgemark Scores
Once all samples are scored:- A raw Judgemark score is computed from the distribution of assigned scores.
- A calibrated score is computed after normalizing each judge’s “score spread” to a standard distribution anchored to the mean, 25th & 75th percentile, upper & lower range. Calibration linearly transforms the distribution from these anchor points to match an ideal distribution of 0-10 range, 5 mean, and 25th & 75th percentile
- Additional metrics quantify how consistent (stable) and discriminative the judge is.
The output JSON in your --runs-file
will contain many details, including per-model breakdowns, iteration-level stats, and final composite scores:
final_judgemark_score
: The primary benchmark result (based on calibrated distribution). A higher value suggests better correlation with reference preferences, stronger separation between good and weak writing, and higher consistency.final_judgemark_score_raw
: A non-calibrated version that shows how well the judge performs “out of the box.”- Per-model details: Found under
results[MODEL_NAME]
, including each snippet’s aggregated raw score and partial criterion scores.
You can also enable visualization: the code in utils/visualization.py
produces bar charts, heatmaps, and scatter plots illustrating how the judge assigned scores across models.
Contributions and bug reports are welcome! If you’d like to add new features—such as custom scoring criteria, improved calibration, or alternative reference sets—feel free to open a PR or file an issue.
This project is licensed under an MIT License. See the LICENSE
file for more details.
- LMSys Chatbot Arena -- the source for the rankings used in the benchmark for human preference correlation.
Happy Judging! If you have any questions, reach out via GitHub Issues or contact the maintainers.