This is the codebase for J4R: Learning to Judge with Equivalent Initial State Group Relative Policy Optimization
- 📚 Paper: arXiv
- 🔬 ReasoningJudgeBench: Huggingface Dataset
ReasoningJudgeBench is released as CC-BY-NC-4.0.
To keep pace with the increasing pace of large language models (LLM) development, model output evaluation has transitioned away from time-consuming human evaluation to automatic evaluation, where LLMs themselves are tasked with assessing and critiquing other model outputs. LLM-as-judge models are a class of generative evaluators that excel in evaluating relatively simple domains, like chat quality, but struggle in reasoning intensive domains where model responses contain more substantive and challenging content. To remedy existing judge shortcomings, we explore training judges with reinforcement learning (RL). We make three key contributions: (1) We propose the Equivalent Initial State Group Relative Policy Optimization (EIS-GRPO) algorithm, which allows us to train our judge to be robust to positional biases that arise in more complex evaluation settings. (2) We introduce ReasoningJudgeBench, a benchmark that evaluates judges in diverse reasoning settings not covered by prior work. (3) We train Judge for Reasoning (J4R), a 7B judge trained with EIS-GRPO that outperforms GPT-4o and the next best small judge by 6.7% and 9%, matching or exceeding the performance of larger GRPO-trained judges on both JudgeBench and ReasoningJudgeBench.
This code was tested with Python 3.10.16 and PyTorch 2.5.1.
conda create --name rjb python=3.10.16 -y
conda activate rjb
pip install uv
uv pip install torch==2.5.1
uv pip install -r requirements.txt
aime_pairwise
split require 24K model context. All other splits were evaluated with 16K context in the paper.
ReasoningJudgeBench uses VLLM for inference, which is installed as a part of requirements.txt. First, serve your model with VLLM. An example script is provided in scripts/serve_model.sh
export HF_TOKEN=<YOUR_HF_TOKEN_HERE>
judge_model_name=<FULL_JUDGE_PATH_OR_HF_NAME>
CUDA_VISIBLE_DEVICES=0 vllm serve ${judge_model_name} \
--tensor_parallel_size=1 \
--max_model_len=16384 \
--gpu_memory_utilization=0.8 \
--port 8001 \
--disable-log-stats \
--disable-log-requests \
Next, run inference with main.py
. An example script is provided in scripts/run_eval.sh
python main.py \
--output_dir /path/to/output/dir \ # Defaults to ./results
--judge_model <FULL_JUDGE_PATH_OR_HF_NAME> \ # For filepath naming purposes only
--evaluation_protocol pairwise \
--dataset_name reasoningjudgebench \
--base_url http://localhost:${port}/v1 \ # Same port as model served above
--num_proc=64 \ # Number of parallel requests
A few tips:
- After you have run evaluation, you can aggregate results using
aggregate.py
, which computes split-level and overall results. Example usage is inscripts/run_eval.sh
. - You can also run inference with Together.ai by setting
--base_url https://api.together.xyz/v1/
and passing your Together AI api key via--api_key
inmain.py
. - You can modify sampling parameters (temperature, top_p, etc) by passing them to
main.py
. See args for examples.
Adding your own judge takes 2 steps:
- Add a corresponding judge prompt in a separate
.py
file which contains the judge prompt, rendering functions, and judgment parsing code inreasoning_judge_inf/prompts/
. Seereasoning_judge_inf/prompts/default_prompts.py
for examples. This step is only necessary if your judge has its own prompts / parsing code. Else, you can proceed to the next step. - Add the (your judge name, prompt file name) pair to
model_to_file
inreasoning_judge_inf/prompts/__init__.py
Once done, you should be able to run your judge with the above inference code, changing judge model names appropriately.
Model responses were generated using GPT-4o and GPT-4.1 and should not be used to develop models that compete with OpenAI
This release is for research purposes only in support of an academic paper. Our datasets and code are not specifically designed or evaluated for all downstream purposes. We strongly recommend users evaluate and address potential concerns related to accuracy, safety, and fairness before model deployment. We encourage users to consider the common limitations of AI, comply with applicable laws, and leverage best practices when selecting use cases, particularly for high-risk scenarios where errors or misuse could significantly impact people’s lives, rights, or safety. For further guidance on use cases, refer to our AUP and AI AUP.
If you find our project helpful, please consider citing our paper
@article{xu2025j4r,
title={J4R: Learning to Judge with Equivalent Initial State Group Relative Policy Optimization},
author={Xu, Austin and Zhou, Yilun and Nguyen, Xuan-Phi and Xiong, Caiming and Joty, Shafiq},
journal={arXiv preprint arXiv:2505.13346},
year={2025}
}