Skip to content

使用deepseed训练ppo阶段时候出现报错 #7989

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
1 task done
shenxiaolong57 opened this issue May 8, 2025 · 2 comments
Closed
1 task done

使用deepseed训练ppo阶段时候出现报错 #7989

shenxiaolong57 opened this issue May 8, 2025 · 2 comments
Labels
solved This problem has been already solved

Comments

@shenxiaolong57
Copy link

Reminder

  • I have read the above rules and searched the existing issues.

System Info

Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
[WARNING|trainer.py:816] 2025-05-08 19:10:43,588 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
[WARNING|trainer.py:816] 2025-05-08 19:10:43,589 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
0%| | 0/2 [01:49<?, ?it/s]
[rank0]: Traceback (most recent call last):
[rank0]: File "/home/aa/Desktop/SXL/LLaMA-Factory-main/src/llamafactory/launcher.py", line 23, in
[rank0]: launch()
[rank0]: File "/home/aa/Desktop/SXL/LLaMA-Factory-main/src/llamafactory/launcher.py", line 19, in launch
[rank0]: run_exp()
[rank0]: File "/home/aa/Desktop/SXL/LLaMA-Factory-main/src/llamafactory/train/tuner.py", line 107, in run_exp
[rank0]: _training_function(config={"args": args, "callbacks": callbacks})
[rank0]: File "/home/aa/Desktop/SXL/LLaMA-Factory-main/src/llamafactory/train/tuner.py", line 73, in _training_function
[rank0]: run_ppo(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
[rank0]: File "/home/aa/Desktop/SXL/LLaMA-Factory-main/src/llamafactory/train/ppo/workflow.py", line 73, in run_ppo
[rank0]: ppo_trainer.ppo_train(resume_from_checkpoint=training_args.resume_from_checkpoint)
[rank0]: File "/home/aa/Desktop/SXL/LLaMA-Factory-main/src/llamafactory/train/ppo/trainer.py", line 384, in ppo_train
[rank0]: mini_batch_queries, mini_batch_responses = self.get_inputs(mini_batch)
[rank0]: File "/home/aa/miniconda3/envs/py310sxl2/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank0]: return func(*args, **kwargs)
[rank0]: File "/home/aa/Desktop/SXL/LLaMA-Factory-main/src/llamafactory/train/ppo/trainer.py", line 481, in get_inputs
[rank0]: with unwrap_model_for_generation(self.model, self.accelerator) as unwrapped_model:
[rank0]: File "/home/aa/miniconda3/envs/py310sxl2/lib/python3.10/contextlib.py", line 142, in exit
[rank0]: next(self.gen)
[rank0]: File "/home/aa/miniconda3/envs/py310sxl2/lib/python3.10/site-packages/trl/models/utils.py", line 165, in unwrap_model_for_generation
[rank0]: add_hooks(model)
[rank0]: File "/home/aa/miniconda3/envs/py310sxl2/lib/python3.10/site-packages/trl/models/utils.py", line 148, in add_hooks
[rank0]: optimizer_offload._register_hooks_recursively(optimizer_offload.module)
[rank0]: AttributeError: 'DeepSpeedZeRoOffload' object has no attribute '_register_hooks_recursively'
[rank1]: Traceback (most recent call last):
[rank1]: File "/home/aa/Desktop/SXL/LLaMA-Factory-main/src/llamafactory/launcher.py", line 23, in
[rank1]: launch()
[rank1]: File "/home/aa/Desktop/SXL/LLaMA-Factory-main/src/llamafactory/launcher.py", line 19, in launch
[rank1]: run_exp()
[rank1]: File "/home/aa/Desktop/SXL/LLaMA-Factory-main/src/llamafactory/train/tuner.py", line 107, in run_exp
[rank1]: _training_function(config={"args": args, "callbacks": callbacks})
[rank1]: File "/home/aa/Desktop/SXL/LLaMA-Factory-main/src/llamafactory/train/tuner.py", line 73, in _training_function
[rank1]: run_ppo(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
[rank1]: File "/home/aa/Desktop/SXL/LLaMA-Factory-main/src/llamafactory/train/ppo/workflow.py", line 73, in run_ppo
[rank1]: ppo_trainer.ppo_train(resume_from_checkpoint=training_args.resume_from_checkpoint)
[rank1]: File "/home/aa/Desktop/SXL/LLaMA-Factory-main/src/llamafactory/train/ppo/trainer.py", line 384, in ppo_train
[rank1]: mini_batch_queries, mini_batch_responses = self.get_inputs(mini_batch)
[rank1]: File "/home/aa/miniconda3/envs/py310sxl2/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank1]: return func(*args, **kwargs)
[rank1]: File "/home/aa/Desktop/SXL/LLaMA-Factory-main/src/llamafactory/train/ppo/trainer.py", line 481, in get_inputs
[rank1]: with unwrap_model_for_generation(self.model, self.accelerator) as unwrapped_model:
[rank1]: File "/home/aa/miniconda3/envs/py310sxl2/lib/python3.10/contextlib.py", line 142, in exit
[rank1]: next(self.gen)
[rank1]: File "/home/aa/miniconda3/envs/py310sxl2/lib/python3.10/site-packages/trl/models/utils.py", line 165, in unwrap_model_for_generation
[rank1]: add_hooks(model)
[rank1]: File "/home/aa/miniconda3/envs/py310sxl2/lib/python3.10/site-packages/trl/models/utils.py", line 148, in add_hooks
[rank1]: optimizer_offload._register_hooks_recursively(optimizer_offload.module)
[rank1]: AttributeError: 'DeepSpeedZeRoOffload' object has no attribute '_register_hooks_recursively'
W0508 19:12:35.072421 2938175 site-packages/torch/distributed/elastic/multiprocessing/api.py:900] Sending process 2938242 closing signal SIGTERM
E0508 19:12:35.187296 2938175 site-packages/torch/distributed/elastic/multiprocessing/api.py:874] failed (exitcode: 1) local_rank: 1 (pid: 2938243) of binary: /home/aa/miniconda3/envs/py310sxl2/bin/python3.10
Traceback (most recent call last):
File "/home/aa/miniconda3/envs/py310sxl2/bin/torchrun", line 8, in
sys.exit(main())
File "/home/aa/miniconda3/envs/py310sxl2/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 355, in wrapper
return f(*args, **kwargs)
File "/home/aa/miniconda3/envs/py310sxl2/lib/python3.10/site-packages/torch/distributed/run.py", line 892, in main
run(args)
File "/home/aa/miniconda3/envs/py310sxl2/lib/python3.10/site-packages/torch/distributed/run.py", line 883, in run
elastic_launch(
File "/home/aa/miniconda3/envs/py310sxl2/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 139, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/aa/miniconda3/envs/py310sxl2/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 270, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

/home/aa/Desktop/SXL/LLaMA-Factory-main/src/llamafactory/launcher.py FAILED

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2025-05-08_19:12:35
host : aa-Super-Server
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 2938243)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Reproduction

Put your message here.

model

model_name_or_path: /home/aa/Desktop/SXL/Qwen/Qwen2___5-7B-Instruct
trust_remote_code: true

method

stage: ppo
do_train: true
finetuning_type: lora
lora_target: all # 或具体模块
deepspeed: examples/deepspeed/ds_z3_config.json

dataset

dataset: train_dataset_## # 你的PPO数据集名称
template: qwen
cutoff_len: 4096 # 可以先设小一点
max_samples: 300 # 设得非常小用于测试

output

output_dir: saves/qwen2.5/lora/ppo_test
logging_steps: 1
overwrite_output_dir: true
report_to: none

train

per_device_train_batch_size: 1
gradient_accumulation_steps: 1
learning_rate: 1.0e-5
num_train_epochs: 0.01 # 或者 max_steps: 5
gradient_checkpointing: true
fp16: true

PPO (顶级参数尝试)

reward_model_type: api
reward_model: "http://localhost:1234/placeholder"

ppo_epochs: 1 # 可以先用默认值

ppo_score_norm: true

{
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
"gradient_accumulation_steps": "auto",
"gradient_clipping": "auto",
"zero_allow_untested_optimizer": true,
"fp16": {
"enabled": "auto",
"loss_scale": 0,
"loss_scale_window": 1000,
"initial_scale_power": 16,
"hysteresis": 2,
"min_loss_scale": 1
},
"bf16": {
"enabled": "auto"
},
"zero_optimization": {
"stage": 3,
"overlap_comm": false,
"contiguous_gradients": true,
"sub_group_size": 1e9,
"reduce_bucket_size": "auto",
"stage3_prefetch_bucket_size": "auto",
"stage3_param_persistence_threshold": "auto",
"stage3_max_live_parameters": 1e9,
"stage3_max_reuse_distance": 1e9,
"stage3_gather_16bit_weights_on_model_save": true
}
}

Others

No response

@shenxiaolong57 shenxiaolong57 added bug Something isn't working pending This problem is yet to be addressed labels May 8, 2025
@shenxiaolong57
Copy link
Author

我找到了解决问题的方法,就是将deepspeed降级到0.16.3

@shenxiaolong57
Copy link
Author

我找到了解决问题的方法,就是将deepspeed降级到0.16.3

是因为0.16.4以后版本的deepspeed重写了一些方法名称,希望能更新这个问题

@hiyouga hiyouga added solved This problem has been already solved and removed bug Something isn't working pending This problem is yet to be addressed labels May 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
solved This problem has been already solved
Projects
None yet
Development

No branches or pull requests

2 participants