-
Notifications
You must be signed in to change notification settings - Fork 758
Open
Description
Dear author, hello,
I encountered a problem while experimenting with the 1.5b model; I used the following script for training.
uid="$(date +%Y%m%d_%H%M%S)"
base_model=\"Qwen/Qwen2.5-1.5B-Instruct\"
lr=1e-5
min_lr=0
epochs=5
weight_decay=1e-4 # -> the same training pipe as slurm_training
micro_batch_size=1 # -> batch_size will be 16 if 16 gpus
gradient_accumulation_steps=8 # requires more GPU memory, default is 1
max_steps=-1
gpu_count=2 #$(nvidia-smi -L | wc -l)
push_to_hub=false
torchrun --nproc-per-node ${gpu_count} --master_port 12345 \\
train/sft.py \"
--block_size=12000 \
--per_device_train_batch_size=${micro_batch_size} \
--per_device_eval_batch_size=${micro_batch_size} \
--gradient_accumulation_steps=${gradient_accumulation_steps} \"
--num_train_epochs=${epochs} \
--train_file_path=\"simplescaling/s1K-1.1_tokenized\" \\
--model_name=${base_model} \\
--warmup_ratio=0.05 \\
--fsdp=\"full_shard auto_wrap\" \\
--fsdp_config=\"train/fsdp_config_qwen.json\" \\
--bf16=True \\
--eval_strategy=\"no\" \\
--logging_steps=1 \
--save_strategy=\"no\" \
--lr_scheduler_type=\"cosine\" \
--learning_rate=${lr} \"
--weight_decay=${weight_decay} \
--adam_beta1=0.9 \
--adam_beta2=0.95 \
--output_dir=\"ckpts/s1-$(basename \"$base_model\")-${uid}\" \"
--push_to_hub=${push_to_hub} \\
--save_only_model=True \\
Since I only have two A800s, I set the gradient accumulation to 8 to ensure batch consistency. Since a block size of 20k would cause an OOM, I set it to 12k.
但是不论我使用一下那个测试命令:
CUDA_VISIBLE_DEVICES=5 lm_eval --model vllm --model_args pretrained=/s1-Qwen2.5-1.5B-Instruct-20250419_021608,dtype=bfloat16,tensor_parallel_size=1 --tasks aime24_figures,aime24_nofigures --batch_size auto --output_path dummy --log_samples --gen_kwargs "max_gen_toks=12000
CUDA_VISIBLE_DEVICES=5 lm_eval --model vllm --model_args pretrained=/s1-Qwen2.5-1.5B-Instruct-20250419_021608,dtype=float32,tensor_parallel_size=1 --tasks aime24_nofigures --batch_size auto --apply_chat_template --output_path s1.1forcingignore1wait --log_samples --gen_kwargs "max_gen_toks=12000,max_tokens_thinking=auto,thinking_n_ignore=1,thinking_n_ignore_str=Wait"
CUDA_VISIBLE_DEVICES=5 lm_eval --model vllm --model_args pretrained=/s1-Qwen2.5-1.5B-Instruct-20250419_021608,dtype=float32,tensor_parallel_size=1 --tasks aime25_nofigures,aime24_nofigures --batch_size auto --apply_chat_template --output_path s1.1forcingignore2wait --log_samples --gen_kwargs "max_gen_toks=12000,max_tokens_thinking=auto,thinking_n_ignore=2,thinking_n_ignore_str=Wait"
The accuracy on AIME is 0.
Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
---|---|---|---|---|---|---|---|---|
aime24_figures | 1 | none | 0 | exact_match | ↑ | 0 | ± | N/A |
none | 0 | extracted_answers | ↑ | -1 | ± | N/A | ||
aime24_nofigures | 1 | none | 0 | exact_match | ↑ | 0 | ± | N/A |
none | 0 | extracted_answers | ↑ | -1 | ± | N/A |
I wonder if you have any help regarding my current issue.
leocnj
Metadata
Metadata
Assignees
Labels
No labels