-
Notifications
You must be signed in to change notification settings - Fork 50
Open
Description
Hi, thanks for your release, I'm evaluating the official checkpoints of Light-R1-7B-DS. The evaluation scores of AIME24 and AIME25 are similar to your paper, but GPQA is much lower than paper. The scores are shown in table below. Could you tell me where the problem is? Thx.
Model | AIME24 | AIME25 | GPQA |
---|---|---|---|
DeepSeek-R1-Distill-Qwen-7B | 55.21 | 40.94 | 35.73 (49.1 reported in your paper) |
Light-R1-7B-DS | 56.98 | 45.63 | 25.36 (49.4 reported in your paper) |
Here's my evaluation process, following Light-R1 Evaluation Usage :
- Create environment
# Installing Python 3.10 Environment.
conda create -n deepscaler python=3.10 -y
conda activate deepscaler
cd deepscaler
pip install -e ./verl
pip install -e .
- Running evaluation script
./scripts/eval/eval_model.sh --model [CHECKPOINT_PATH] --datasets aime aime25 gpqa --output-dir [OUTPUT_DIR]
ElementQiElementQi and xieck13
Metadata
Metadata
Assignees
Labels
No labels