GPQA eval score of 7B's official checkpoint is much lower than paper

Hi, thanks for your release, I'm evaluating the official checkpoints of Light-R1-7B-DS. The evaluation scores of AIME24 and AIME25 are similar to your paper, but GPQA is much lower than paper. The scores are shown in table below. Could you tell me where the problem is? Thx.

|Model|AIME24|AIME25|GPQA|
| ---- | ---- | ---- | ---- |
|DeepSeek-R1-Distill-Qwen-7B|55.21|40.94|35.73 (49.1 reported in your paper)|
|Light-R1-7B-DS|56.98|45.63|25.36 (49.4 reported in your paper)|

Here's my evaluation process, following [Light-R1 Evaluation Usage](https://github.com/Qihoo360/Light-R1/tree/main/deepscaler-release) :
1. Create environment
```
# Installing Python 3.10 Environment.
conda create -n deepscaler python=3.10 -y
conda activate deepscaler

cd deepscaler
pip install -e ./verl
pip install -e .
```
2. Running [evaluation script](https://github.com/Qihoo360/Light-R1/blob/main/deepscaler-release/scripts/eval/eval_model.sh)
```
./scripts/eval/eval_model.sh --model [CHECKPOINT_PATH] --datasets aime aime25 gpqa --output-dir [OUTPUT_DIR]
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

GPQA eval score of 7B's official checkpoint is much lower than paper #53

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Model	AIME24	AIME25	GPQA
DeepSeek-R1-Distill-Qwen-7B	55.21	40.94	35.73 (49.1 reported in your paper)
Light-R1-7B-DS	56.98	45.63	25.36 (49.4 reported in your paper)

GPQA eval score of 7B's official checkpoint is much lower than paper #53

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions