Skip to content

SP does NOT work with liger kernel #36

@XD-BDIV-NLP

Description

@XD-BDIV-NLP

Reminder

  • I have read the README and searched the existing issues.

System Info

系统:Ubuntu20.04

环境:
deepspeed=0.15.2
transformers==4.46.1
python==3.10
ring-flash-attn==0.1.3
flash_attn==2.7.4.post1
torch==2.5.1+cu124

硬件:
8*A800 80G

报错信息
[INFO|trainer.py:2313] 2025-03-22 16:47:06,100 >> ***** Running training ***** [INFO|trainer.py:2314] 2025-03-22 16:47:06,100 >> Num examples = 64,048 [INFO|trainer.py:2315] 2025-03-22 16:47:06,100 >> Num Epochs = 3 [INFO|trainer.py:2316] 2025-03-22 16:47:06,100 >> Instantaneous batch size per device = 2
[INFO|trainer.py:2319] 2025-03-22 16:47:06,100 >> Total train batch size (w. parallel, distributed & accumulation) = 32
[INFO|trainer.py:2320] 2025-03-22 16:47:06,100 >> Gradient Accumulation steps = 2
[INFO|trainer.py:2321] 2025-03-22 16:47:06,100 >> Total optimization steps = 6,003
[INFO|trainer.py:2322] 2025-03-22 16:47:06,102 >> Number of trainable parameters = 7,615,616,512
0%| | 0/6003 [00:00<?, ?it/s]
[rank5]: Traceback (most recent call last):
[rank5]: File "/home/xygxzs28/exam/360-LLaMA-Factory/src/llamafactory/launcher.py", line 23, in
[rank5]: launch() [rank5]: File "/home/xygxzs28/exam/360-LLaMA-Factory/src/llamafactory/launcher.py", line 19, in launch [rank5]: run_exp()
[rank5]: File "/home/xygxzs28/exam/360-LLaMA-Factory/src/llamafactory/train/tuner.py", line 50, in run_exp
[rank5]: run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
[rank5]: File "/home/xygxzs28/exam/360-LLaMA-Factory/src/llamafactory/train/sft/workflow.py", line 102, in run_sft
[rank5]: train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
[rank5]: File "/home/xygxzs28/anaconda3/envs/360lf/lib/python3.10/site-packages/transformers/trainer.py", line 2122, in train
[rank5]: return inner_training_loop(
[rank5]: File "/home/xygxzs28/anaconda3/envs/360lf/lib/python3.10/site-packages/transformers/trainer.py", line 2474, in _inner_training_loop
[rank5]: tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
[rank5]: File "/home/xygxzs28/exam/360-LLaMA-Factory/src/llamafactory/train/sft/trainer.py", line 97, in training_step
[rank5]: return super().training_step(model, inputs, *args, **kwargs) [rank5]: File "/home/xygxzs28/anaconda3/envs/360lf/lib/python3.10/site-packages/transformers/trainer.py", line 3572, in training_step[rank5]: loss = self.compute_loss(model, inputs, num_items_in_batch=num_items_in_batch) [rank5]: File "/home/xygxzs28/exam/360-LLaMA-Factory/src/llamafactory/train/sft/trainer.py", line 119, in compute_loss [rank5]: logits, labels = outputs["logits"] if isinstance(outputs, dict) else outputs[1], inputs["labels"]
[rank5]: File "/home/xygxzs28/anaconda3/envs/360lf/lib/python3.10/site-packages/transformers/utils/generic.py", line 431, in getitem
[rank5]: return inner_dict[k]
[rank5]: KeyError: 'logits'
[rank4]: Traceback (most recent call last):
[rank4]: File "/home/xygxzs28/exam/360-LLaMA-Factory/src/llamafactory/launcher.py", line 23, in
[rank4]: launch()
[rank4]: File "/home/xygxzs28/exam/360-LLaMA-Factory/src/llamafactory/launcher.py", line 19, in launch
[rank4]: run_exp() [rank4]: File "/home/xygxzs28/exam/360-LLaMA-Factory/src/llamafactory/train/tuner.py", line 50, in run_exp
[rank4]: run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks) [rank4]: File "/home/xygxzs28/exam/360-LLaMA-Factory/src/llamafactory/train/sft/workflow.py", line 102, in run_sft [rank4]: train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint) [rank4]: File "/home/xygxzs28/anaconda3/envs/360lf/lib/python3.10/site-packages/transformers/trainer.py", line 2122, in train [rank4]: return inner_training_loop(
[rank4]: File "/home/xygxzs28/anaconda3/envs/360lf/lib/python3.10/site-packages/transformers/trainer.py", line 2474, in _inner_training_loop
[rank4]: tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
[rank4]: File "/home/xygxzs28/exam/360-LLaMA-Factory/src/llamafactory/train/sft/trainer.py", line 97, in training_step [rank4]: return super().training_step(model, inputs, *args, **kwargs)
[rank4]: File "/home/xygxzs28/anaconda3/envs/360lf/lib/python3.10/site-packages/transformers/trainer.py", line 3572, in training_step

Reproduction

FORCE_TORCHRUN=1 llamafactory-cli train examples/train_full/qwen2_full_sft_ds.yaml

qwen2_full_sft_ds.yaml:

model_name_or_path: Qwen2.5-7B-Instruct-1M
sequence_parallel_size: 4
stage: sft
do_train: true
finetuning_type: full
ddp_timeout: 180000000
deepspeed: examples/deepspeed/ds_z3_config.json
dataset: xxxx
template: qwen
cutoff_len: 65536
overwrite_cache: true
preprocessing_num_workers: 96
output_dir: xxxx
logging_steps: 10
save_steps: 36983
overwrite_output_dir: true
per_device_train_batch_size: 2
gradient_accumulation_steps: 2
learning_rate: 3.0e-6
num_train_epochs: 3.0
lr_scheduler_type: cosine
warmup_ratio: 0.1
bf16: true
flash_attn: fa2
use_unsloth_gc: true
enable_liger_kernel: true

Expected behavior

希望能给出具体解决方案,依赖库版本或者其他

Others

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions