-
Notifications
You must be signed in to change notification settings - Fork 41
Description
Reminder
- I have read the README and searched the existing issues.
System Info
系统:Ubuntu20.04
环境:
deepspeed=0.15.2
transformers==4.46.1
python==3.10
ring-flash-attn==0.1.3
flash_attn==2.7.4.post1
torch==2.5.1+cu124
硬件:
8*A800 80G
报错信息
[INFO|trainer.py:2313] 2025-03-22 16:47:06,100 >> ***** Running training ***** [INFO|trainer.py:2314] 2025-03-22 16:47:06,100 >> Num examples = 64,048 [INFO|trainer.py:2315] 2025-03-22 16:47:06,100 >> Num Epochs = 3 [INFO|trainer.py:2316] 2025-03-22 16:47:06,100 >> Instantaneous batch size per device = 2
[INFO|trainer.py:2319] 2025-03-22 16:47:06,100 >> Total train batch size (w. parallel, distributed & accumulation) = 32
[INFO|trainer.py:2320] 2025-03-22 16:47:06,100 >> Gradient Accumulation steps = 2
[INFO|trainer.py:2321] 2025-03-22 16:47:06,100 >> Total optimization steps = 6,003
[INFO|trainer.py:2322] 2025-03-22 16:47:06,102 >> Number of trainable parameters = 7,615,616,512
0%| | 0/6003 [00:00<?, ?it/s]
[rank5]: Traceback (most recent call last):
[rank5]: File "/home/xygxzs28/exam/360-LLaMA-Factory/src/llamafactory/launcher.py", line 23, in
[rank5]: launch() [rank5]: File "/home/xygxzs28/exam/360-LLaMA-Factory/src/llamafactory/launcher.py", line 19, in launch [rank5]: run_exp()
[rank5]: File "/home/xygxzs28/exam/360-LLaMA-Factory/src/llamafactory/train/tuner.py", line 50, in run_exp
[rank5]: run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
[rank5]: File "/home/xygxzs28/exam/360-LLaMA-Factory/src/llamafactory/train/sft/workflow.py", line 102, in run_sft
[rank5]: train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
[rank5]: File "/home/xygxzs28/anaconda3/envs/360lf/lib/python3.10/site-packages/transformers/trainer.py", line 2122, in train
[rank5]: return inner_training_loop(
[rank5]: File "/home/xygxzs28/anaconda3/envs/360lf/lib/python3.10/site-packages/transformers/trainer.py", line 2474, in _inner_training_loop
[rank5]: tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
[rank5]: File "/home/xygxzs28/exam/360-LLaMA-Factory/src/llamafactory/train/sft/trainer.py", line 97, in training_step
[rank5]: return super().training_step(model, inputs, *args, **kwargs) [rank5]: File "/home/xygxzs28/anaconda3/envs/360lf/lib/python3.10/site-packages/transformers/trainer.py", line 3572, in training_step[rank5]: loss = self.compute_loss(model, inputs, num_items_in_batch=num_items_in_batch) [rank5]: File "/home/xygxzs28/exam/360-LLaMA-Factory/src/llamafactory/train/sft/trainer.py", line 119, in compute_loss [rank5]: logits, labels = outputs["logits"] if isinstance(outputs, dict) else outputs[1], inputs["labels"]
[rank5]: File "/home/xygxzs28/anaconda3/envs/360lf/lib/python3.10/site-packages/transformers/utils/generic.py", line 431, in getitem
[rank5]: return inner_dict[k]
[rank5]: KeyError: 'logits'
[rank4]: Traceback (most recent call last):
[rank4]: File "/home/xygxzs28/exam/360-LLaMA-Factory/src/llamafactory/launcher.py", line 23, in
[rank4]: launch()
[rank4]: File "/home/xygxzs28/exam/360-LLaMA-Factory/src/llamafactory/launcher.py", line 19, in launch
[rank4]: run_exp() [rank4]: File "/home/xygxzs28/exam/360-LLaMA-Factory/src/llamafactory/train/tuner.py", line 50, in run_exp
[rank4]: run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks) [rank4]: File "/home/xygxzs28/exam/360-LLaMA-Factory/src/llamafactory/train/sft/workflow.py", line 102, in run_sft [rank4]: train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint) [rank4]: File "/home/xygxzs28/anaconda3/envs/360lf/lib/python3.10/site-packages/transformers/trainer.py", line 2122, in train [rank4]: return inner_training_loop(
[rank4]: File "/home/xygxzs28/anaconda3/envs/360lf/lib/python3.10/site-packages/transformers/trainer.py", line 2474, in _inner_training_loop
[rank4]: tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
[rank4]: File "/home/xygxzs28/exam/360-LLaMA-Factory/src/llamafactory/train/sft/trainer.py", line 97, in training_step [rank4]: return super().training_step(model, inputs, *args, **kwargs)
[rank4]: File "/home/xygxzs28/anaconda3/envs/360lf/lib/python3.10/site-packages/transformers/trainer.py", line 3572, in training_step
Reproduction
FORCE_TORCHRUN=1 llamafactory-cli train examples/train_full/qwen2_full_sft_ds.yaml
qwen2_full_sft_ds.yaml:
model_name_or_path: Qwen2.5-7B-Instruct-1M
sequence_parallel_size: 4
stage: sft
do_train: true
finetuning_type: full
ddp_timeout: 180000000
deepspeed: examples/deepspeed/ds_z3_config.json
dataset: xxxx
template: qwen
cutoff_len: 65536
overwrite_cache: true
preprocessing_num_workers: 96
output_dir: xxxx
logging_steps: 10
save_steps: 36983
overwrite_output_dir: true
per_device_train_batch_size: 2
gradient_accumulation_steps: 2
learning_rate: 3.0e-6
num_train_epochs: 3.0
lr_scheduler_type: cosine
warmup_ratio: 0.1
bf16: true
flash_attn: fa2
use_unsloth_gc: true
enable_liger_kernel: true
Expected behavior
希望能给出具体解决方案,依赖库版本或者其他
Others
No response