Skip to content

GPQA eval score of 7B's official checkpoint is much lower than paper #53

@luppx

Description

@luppx

Hi, thanks for your release, I'm evaluating the official checkpoints of Light-R1-7B-DS. The evaluation scores of AIME24 and AIME25 are similar to your paper, but GPQA is much lower than paper. The scores are shown in table below. Could you tell me where the problem is? Thx.

Model AIME24 AIME25 GPQA
DeepSeek-R1-Distill-Qwen-7B 55.21 40.94 35.73 (49.1 reported in your paper)
Light-R1-7B-DS 56.98 45.63 25.36 (49.4 reported in your paper)

Here's my evaluation process, following Light-R1 Evaluation Usage :

  1. Create environment
# Installing Python 3.10 Environment.
conda create -n deepscaler python=3.10 -y
conda activate deepscaler

cd deepscaler
pip install -e ./verl
pip install -e .
  1. Running evaluation script
./scripts/eval/eval_model.sh --model [CHECKPOINT_PATH] --datasets aime aime25 gpqa --output-dir [OUTPUT_DIR]

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions