Skip to content

1.5B accuracy 0 after SFT #115

@alvinshao0313

Description

@alvinshao0313

Dear author, hello,
I encountered a problem while experimenting with the 1.5b model; I used the following script for training.

uid="$(date +%Y%m%d_%H%M%S)"
base_model=\"Qwen/Qwen2.5-1.5B-Instruct\"
lr=1e-5
min_lr=0
epochs=5
weight_decay=1e-4 # -> the same training pipe as slurm_training
micro_batch_size=1 # -> batch_size will be 16 if 16 gpus
gradient_accumulation_steps=8 # requires more GPU memory, default is 1
max_steps=-1
gpu_count=2 #$(nvidia-smi -L | wc -l)
push_to_hub=false

torchrun --nproc-per-node ${gpu_count} --master_port 12345 \\
    train/sft.py \"
    --block_size=12000 \
    --per_device_train_batch_size=${micro_batch_size} \
    --per_device_eval_batch_size=${micro_batch_size} \
    --gradient_accumulation_steps=${gradient_accumulation_steps} \"
    --num_train_epochs=${epochs} \
    --train_file_path=\"simplescaling/s1K-1.1_tokenized\" \\
    --model_name=${base_model} \\
    --warmup_ratio=0.05 \\
    --fsdp=\"full_shard auto_wrap\" \\
    --fsdp_config=\"train/fsdp_config_qwen.json\" \\
    --bf16=True \\
    --eval_strategy=\"no\" \\
    --logging_steps=1 \
    --save_strategy=\"no\" \
    --lr_scheduler_type=\"cosine\" \
    --learning_rate=${lr} \"
    --weight_decay=${weight_decay} \
    --adam_beta1=0.9 \
    --adam_beta2=0.95 \
    --output_dir=\"ckpts/s1-$(basename \"$base_model\")-${uid}\" \"
    --push_to_hub=${push_to_hub} \\
    --save_only_model=True \\

Since I only have two A800s, I set the gradient accumulation to 8 to ensure batch consistency. Since a block size of 20k would cause an OOM, I set it to 12k.

但是不论我使用一下那个测试命令:

CUDA_VISIBLE_DEVICES=5 lm_eval --model vllm --model_args pretrained=/s1-Qwen2.5-1.5B-Instruct-20250419_021608,dtype=bfloat16,tensor_parallel_size=1 --tasks aime24_figures,aime24_nofigures --batch_size auto --output_path dummy --log_samples --gen_kwargs "max_gen_toks=12000

CUDA_VISIBLE_DEVICES=5 lm_eval --model vllm --model_args pretrained=/s1-Qwen2.5-1.5B-Instruct-20250419_021608,dtype=float32,tensor_parallel_size=1 --tasks aime24_nofigures --batch_size auto --apply_chat_template --output_path s1.1forcingignore1wait --log_samples --gen_kwargs "max_gen_toks=12000,max_tokens_thinking=auto,thinking_n_ignore=1,thinking_n_ignore_str=Wait"

CUDA_VISIBLE_DEVICES=5 lm_eval --model vllm --model_args pretrained=/s1-Qwen2.5-1.5B-Instruct-20250419_021608,dtype=float32,tensor_parallel_size=1 --tasks aime25_nofigures,aime24_nofigures --batch_size auto --apply_chat_template --output_path s1.1forcingignore2wait --log_samples --gen_kwargs "max_gen_toks=12000,max_tokens_thinking=auto,thinking_n_ignore=2,thinking_n_ignore_str=Wait"

The accuracy on AIME is 0.

Tasks Version Filter n-shot Metric Value Stderr
aime24_figures 1 none 0 exact_match 0 ± N/A
none 0 extracted_answers -1 ± N/A
aime24_nofigures 1 none 0 exact_match 0 ± N/A
none 0 extracted_answers -1 ± N/A

I wonder if you have any help regarding my current issue.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions