1.5B accuracy 0 after SFT

Dear author, hello, 
I encountered a problem while experimenting with the 1.5b model; I used the following script for training.
```python
uid="$(date +%Y%m%d_%H%M%S)"
base_model=\"Qwen/Qwen2.5-1.5B-Instruct\"
lr=1e-5
min_lr=0
epochs=5
weight_decay=1e-4 # -> the same training pipe as slurm_training
micro_batch_size=1 # -> batch_size will be 16 if 16 gpus
gradient_accumulation_steps=8 # requires more GPU memory, default is 1
max_steps=-1
gpu_count=2 #$(nvidia-smi -L | wc -l)
push_to_hub=false

torchrun --nproc-per-node ${gpu_count} --master_port 12345 \\
    train/sft.py \"
    --block_size=12000 \
    --per_device_train_batch_size=${micro_batch_size} \
    --per_device_eval_batch_size=${micro_batch_size} \
    --gradient_accumulation_steps=${gradient_accumulation_steps} \"
    --num_train_epochs=${epochs} \
    --train_file_path=\"simplescaling/s1K-1.1_tokenized\" \\
    --model_name=${base_model} \\
    --warmup_ratio=0.05 \\
    --fsdp=\"full_shard auto_wrap\" \\
    --fsdp_config=\"train/fsdp_config_qwen.json\" \\
    --bf16=True \\
    --eval_strategy=\"no\" \\
    --logging_steps=1 \
    --save_strategy=\"no\" \
    --lr_scheduler_type=\"cosine\" \
    --learning_rate=${lr} \"
    --weight_decay=${weight_decay} \
    --adam_beta1=0.9 \
    --adam_beta2=0.95 \
    --output_dir=\"ckpts/s1-$(basename \"$base_model\")-${uid}\" \"
    --push_to_hub=${push_to_hub} \\
    --save_only_model=True \\
```
Since I only have two A800s, I set the gradient accumulation to 8 to ensure batch consistency. Since a block size of 20k would cause an OOM, I set it to 12k.

但是不论我使用一下那个测试命令：
```python
CUDA_VISIBLE_DEVICES=5 lm_eval --model vllm --model_args pretrained=/s1-Qwen2.5-1.5B-Instruct-20250419_021608,dtype=bfloat16,tensor_parallel_size=1 --tasks aime24_figures,aime24_nofigures --batch_size auto --output_path dummy --log_samples --gen_kwargs "max_gen_toks=12000

CUDA_VISIBLE_DEVICES=5 lm_eval --model vllm --model_args pretrained=/s1-Qwen2.5-1.5B-Instruct-20250419_021608,dtype=float32,tensor_parallel_size=1 --tasks aime24_nofigures --batch_size auto --apply_chat_template --output_path s1.1forcingignore1wait --log_samples --gen_kwargs "max_gen_toks=12000,max_tokens_thinking=auto,thinking_n_ignore=1,thinking_n_ignore_str=Wait"

CUDA_VISIBLE_DEVICES=5 lm_eval --model vllm --model_args pretrained=/s1-Qwen2.5-1.5B-Instruct-20250419_021608,dtype=float32,tensor_parallel_size=1 --tasks aime25_nofigures,aime24_nofigures --batch_size auto --apply_chat_template --output_path s1.1forcingignore2wait --log_samples --gen_kwargs "max_gen_toks=12000,max_tokens_thinking=auto,thinking_n_ignore=2,thinking_n_ignore_str=Wait"
```
The accuracy on AIME is 0.

|     Tasks      |Version|Filter|n-shot|     Metric      |   |Value|   |Stderr|
|----------------|------:|------|-----:|-----------------|---|----:|---|------|
|aime24_figures  |      1|none  |     0|exact_match      |↑  |    0|±  |   N/A|
|                |       |none  |     0|extracted_answers|↑  |   -1|±  |   N/A|
|aime24_nofigures|      1|none  |     0|exact_match      |↑  |    0|±  |   N/A|
|                |       |none  |     0|extracted_answers|↑  |   -1|±  |   N/A|

I wonder if you have any help regarding my current issue.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

1.5B accuracy 0 after SFT #115

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Tasks	Version	Filter	Metric		Value		Stderr
aime24_figures	1	none	exact_match	↑	0	±	N/A
		none	extracted_answers	↑	-1	±	N/A
aime24_nofigures	1	none	exact_match	↑	0	±	N/A
		none	extracted_answers	↑	-1	±	N/A

1.5B accuracy 0 after SFT #115

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions