You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have read the above rules and searched the existing issues.
System Info
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
[WARNING|trainer.py:816] 2025-05-08 19:10:43,588 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
[WARNING|trainer.py:816] 2025-05-08 19:10:43,589 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
0%| | 0/2 [01:49<?, ?it/s]
[rank0]: Traceback (most recent call last):
[rank0]: File "/home/aa/Desktop/SXL/LLaMA-Factory-main/src/llamafactory/launcher.py", line 23, in
[rank0]: launch()
[rank0]: File "/home/aa/Desktop/SXL/LLaMA-Factory-main/src/llamafactory/launcher.py", line 19, in launch
[rank0]: run_exp()
[rank0]: File "/home/aa/Desktop/SXL/LLaMA-Factory-main/src/llamafactory/train/tuner.py", line 107, in run_exp
[rank0]: _training_function(config={"args": args, "callbacks": callbacks})
[rank0]: File "/home/aa/Desktop/SXL/LLaMA-Factory-main/src/llamafactory/train/tuner.py", line 73, in _training_function
[rank0]: run_ppo(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
[rank0]: File "/home/aa/Desktop/SXL/LLaMA-Factory-main/src/llamafactory/train/ppo/workflow.py", line 73, in run_ppo
[rank0]: ppo_trainer.ppo_train(resume_from_checkpoint=training_args.resume_from_checkpoint)
[rank0]: File "/home/aa/Desktop/SXL/LLaMA-Factory-main/src/llamafactory/train/ppo/trainer.py", line 384, in ppo_train
[rank0]: mini_batch_queries, mini_batch_responses = self.get_inputs(mini_batch)
[rank0]: File "/home/aa/miniconda3/envs/py310sxl2/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank0]: return func(*args, **kwargs)
[rank0]: File "/home/aa/Desktop/SXL/LLaMA-Factory-main/src/llamafactory/train/ppo/trainer.py", line 481, in get_inputs
[rank0]: with unwrap_model_for_generation(self.model, self.accelerator) as unwrapped_model:
[rank0]: File "/home/aa/miniconda3/envs/py310sxl2/lib/python3.10/contextlib.py", line 142, in exit
[rank0]: next(self.gen)
[rank0]: File "/home/aa/miniconda3/envs/py310sxl2/lib/python3.10/site-packages/trl/models/utils.py", line 165, in unwrap_model_for_generation
[rank0]: add_hooks(model)
[rank0]: File "/home/aa/miniconda3/envs/py310sxl2/lib/python3.10/site-packages/trl/models/utils.py", line 148, in add_hooks
[rank0]: optimizer_offload._register_hooks_recursively(optimizer_offload.module)
[rank0]: AttributeError: 'DeepSpeedZeRoOffload' object has no attribute '_register_hooks_recursively'
[rank1]: Traceback (most recent call last):
[rank1]: File "/home/aa/Desktop/SXL/LLaMA-Factory-main/src/llamafactory/launcher.py", line 23, in
[rank1]: launch()
[rank1]: File "/home/aa/Desktop/SXL/LLaMA-Factory-main/src/llamafactory/launcher.py", line 19, in launch
[rank1]: run_exp()
[rank1]: File "/home/aa/Desktop/SXL/LLaMA-Factory-main/src/llamafactory/train/tuner.py", line 107, in run_exp
[rank1]: _training_function(config={"args": args, "callbacks": callbacks})
[rank1]: File "/home/aa/Desktop/SXL/LLaMA-Factory-main/src/llamafactory/train/tuner.py", line 73, in _training_function
[rank1]: run_ppo(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
[rank1]: File "/home/aa/Desktop/SXL/LLaMA-Factory-main/src/llamafactory/train/ppo/workflow.py", line 73, in run_ppo
[rank1]: ppo_trainer.ppo_train(resume_from_checkpoint=training_args.resume_from_checkpoint)
[rank1]: File "/home/aa/Desktop/SXL/LLaMA-Factory-main/src/llamafactory/train/ppo/trainer.py", line 384, in ppo_train
[rank1]: mini_batch_queries, mini_batch_responses = self.get_inputs(mini_batch)
[rank1]: File "/home/aa/miniconda3/envs/py310sxl2/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank1]: return func(*args, **kwargs)
[rank1]: File "/home/aa/Desktop/SXL/LLaMA-Factory-main/src/llamafactory/train/ppo/trainer.py", line 481, in get_inputs
[rank1]: with unwrap_model_for_generation(self.model, self.accelerator) as unwrapped_model:
[rank1]: File "/home/aa/miniconda3/envs/py310sxl2/lib/python3.10/contextlib.py", line 142, in exit
[rank1]: next(self.gen)
[rank1]: File "/home/aa/miniconda3/envs/py310sxl2/lib/python3.10/site-packages/trl/models/utils.py", line 165, in unwrap_model_for_generation
[rank1]: add_hooks(model)
[rank1]: File "/home/aa/miniconda3/envs/py310sxl2/lib/python3.10/site-packages/trl/models/utils.py", line 148, in add_hooks
[rank1]: optimizer_offload._register_hooks_recursively(optimizer_offload.module)
[rank1]: AttributeError: 'DeepSpeedZeRoOffload' object has no attribute '_register_hooks_recursively'
W0508 19:12:35.072421 2938175 site-packages/torch/distributed/elastic/multiprocessing/api.py:900] Sending process 2938242 closing signal SIGTERM
E0508 19:12:35.187296 2938175 site-packages/torch/distributed/elastic/multiprocessing/api.py:874] failed (exitcode: 1) local_rank: 1 (pid: 2938243) of binary: /home/aa/miniconda3/envs/py310sxl2/bin/python3.10
Traceback (most recent call last):
File "/home/aa/miniconda3/envs/py310sxl2/bin/torchrun", line 8, in
sys.exit(main())
File "/home/aa/miniconda3/envs/py310sxl2/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 355, in wrapper
return f(*args, **kwargs)
File "/home/aa/miniconda3/envs/py310sxl2/lib/python3.10/site-packages/torch/distributed/run.py", line 892, in main
run(args)
File "/home/aa/miniconda3/envs/py310sxl2/lib/python3.10/site-packages/torch/distributed/run.py", line 883, in run
elastic_launch(
File "/home/aa/miniconda3/envs/py310sxl2/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 139, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/aa/miniconda3/envs/py310sxl2/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 270, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
hiyouga
added
solved
This problem has been already solved
and removed
bug
Something isn't working
pending
This problem is yet to be addressed
labels
May 9, 2025
Reminder
System Info
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
[WARNING|trainer.py:816] 2025-05-08 19:10:43,588 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
[WARNING|trainer.py:816] 2025-05-08 19:10:43,589 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
0%| | 0/2 [01:49<?, ?it/s]
[rank0]: Traceback (most recent call last):
[rank0]: File "/home/aa/Desktop/SXL/LLaMA-Factory-main/src/llamafactory/launcher.py", line 23, in
[rank0]: launch()
[rank0]: File "/home/aa/Desktop/SXL/LLaMA-Factory-main/src/llamafactory/launcher.py", line 19, in launch
[rank0]: run_exp()
[rank0]: File "/home/aa/Desktop/SXL/LLaMA-Factory-main/src/llamafactory/train/tuner.py", line 107, in run_exp
[rank0]: _training_function(config={"args": args, "callbacks": callbacks})
[rank0]: File "/home/aa/Desktop/SXL/LLaMA-Factory-main/src/llamafactory/train/tuner.py", line 73, in _training_function
[rank0]: run_ppo(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
[rank0]: File "/home/aa/Desktop/SXL/LLaMA-Factory-main/src/llamafactory/train/ppo/workflow.py", line 73, in run_ppo
[rank0]: ppo_trainer.ppo_train(resume_from_checkpoint=training_args.resume_from_checkpoint)
[rank0]: File "/home/aa/Desktop/SXL/LLaMA-Factory-main/src/llamafactory/train/ppo/trainer.py", line 384, in ppo_train
[rank0]: mini_batch_queries, mini_batch_responses = self.get_inputs(mini_batch)
[rank0]: File "/home/aa/miniconda3/envs/py310sxl2/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank0]: return func(*args, **kwargs)
[rank0]: File "/home/aa/Desktop/SXL/LLaMA-Factory-main/src/llamafactory/train/ppo/trainer.py", line 481, in get_inputs
[rank0]: with unwrap_model_for_generation(self.model, self.accelerator) as unwrapped_model:
[rank0]: File "/home/aa/miniconda3/envs/py310sxl2/lib/python3.10/contextlib.py", line 142, in exit
[rank0]: next(self.gen)
[rank0]: File "/home/aa/miniconda3/envs/py310sxl2/lib/python3.10/site-packages/trl/models/utils.py", line 165, in unwrap_model_for_generation
[rank0]: add_hooks(model)
[rank0]: File "/home/aa/miniconda3/envs/py310sxl2/lib/python3.10/site-packages/trl/models/utils.py", line 148, in add_hooks
[rank0]: optimizer_offload._register_hooks_recursively(optimizer_offload.module)
[rank0]: AttributeError: 'DeepSpeedZeRoOffload' object has no attribute '_register_hooks_recursively'
[rank1]: Traceback (most recent call last):
[rank1]: File "/home/aa/Desktop/SXL/LLaMA-Factory-main/src/llamafactory/launcher.py", line 23, in
[rank1]: launch()
[rank1]: File "/home/aa/Desktop/SXL/LLaMA-Factory-main/src/llamafactory/launcher.py", line 19, in launch
[rank1]: run_exp()
[rank1]: File "/home/aa/Desktop/SXL/LLaMA-Factory-main/src/llamafactory/train/tuner.py", line 107, in run_exp
[rank1]: _training_function(config={"args": args, "callbacks": callbacks})
[rank1]: File "/home/aa/Desktop/SXL/LLaMA-Factory-main/src/llamafactory/train/tuner.py", line 73, in _training_function
[rank1]: run_ppo(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
[rank1]: File "/home/aa/Desktop/SXL/LLaMA-Factory-main/src/llamafactory/train/ppo/workflow.py", line 73, in run_ppo
[rank1]: ppo_trainer.ppo_train(resume_from_checkpoint=training_args.resume_from_checkpoint)
[rank1]: File "/home/aa/Desktop/SXL/LLaMA-Factory-main/src/llamafactory/train/ppo/trainer.py", line 384, in ppo_train
[rank1]: mini_batch_queries, mini_batch_responses = self.get_inputs(mini_batch)
[rank1]: File "/home/aa/miniconda3/envs/py310sxl2/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank1]: return func(*args, **kwargs)
[rank1]: File "/home/aa/Desktop/SXL/LLaMA-Factory-main/src/llamafactory/train/ppo/trainer.py", line 481, in get_inputs
[rank1]: with unwrap_model_for_generation(self.model, self.accelerator) as unwrapped_model:
[rank1]: File "/home/aa/miniconda3/envs/py310sxl2/lib/python3.10/contextlib.py", line 142, in exit
[rank1]: next(self.gen)
[rank1]: File "/home/aa/miniconda3/envs/py310sxl2/lib/python3.10/site-packages/trl/models/utils.py", line 165, in unwrap_model_for_generation
[rank1]: add_hooks(model)
[rank1]: File "/home/aa/miniconda3/envs/py310sxl2/lib/python3.10/site-packages/trl/models/utils.py", line 148, in add_hooks
[rank1]: optimizer_offload._register_hooks_recursively(optimizer_offload.module)
[rank1]: AttributeError: 'DeepSpeedZeRoOffload' object has no attribute '_register_hooks_recursively'
W0508 19:12:35.072421 2938175 site-packages/torch/distributed/elastic/multiprocessing/api.py:900] Sending process 2938242 closing signal SIGTERM
E0508 19:12:35.187296 2938175 site-packages/torch/distributed/elastic/multiprocessing/api.py:874] failed (exitcode: 1) local_rank: 1 (pid: 2938243) of binary: /home/aa/miniconda3/envs/py310sxl2/bin/python3.10
Traceback (most recent call last):
File "/home/aa/miniconda3/envs/py310sxl2/bin/torchrun", line 8, in
sys.exit(main())
File "/home/aa/miniconda3/envs/py310sxl2/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 355, in wrapper
return f(*args, **kwargs)
File "/home/aa/miniconda3/envs/py310sxl2/lib/python3.10/site-packages/torch/distributed/run.py", line 892, in main
run(args)
File "/home/aa/miniconda3/envs/py310sxl2/lib/python3.10/site-packages/torch/distributed/run.py", line 883, in run
elastic_launch(
File "/home/aa/miniconda3/envs/py310sxl2/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 139, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/aa/miniconda3/envs/py310sxl2/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 270, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
/home/aa/Desktop/SXL/LLaMA-Factory-main/src/llamafactory/launcher.py FAILED
Failures:
<NO_OTHER_FAILURES>
Root Cause (first observed failure):
[0]:
time : 2025-05-08_19:12:35
host : aa-Super-Server
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 2938243)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
Reproduction
model
model_name_or_path: /home/aa/Desktop/SXL/Qwen/Qwen2___5-7B-Instruct
trust_remote_code: true
method
stage: ppo
do_train: true
finetuning_type: lora
lora_target: all # 或具体模块
deepspeed: examples/deepspeed/ds_z3_config.json
dataset
dataset: train_dataset_## # 你的PPO数据集名称
template: qwen
cutoff_len: 4096 # 可以先设小一点
max_samples: 300 # 设得非常小用于测试
output
output_dir: saves/qwen2.5/lora/ppo_test
logging_steps: 1
overwrite_output_dir: true
report_to: none
train
per_device_train_batch_size: 1
gradient_accumulation_steps: 1
learning_rate: 1.0e-5
num_train_epochs: 0.01 # 或者 max_steps: 5
gradient_checkpointing: true
fp16: true
PPO (顶级参数尝试)
reward_model_type: api
reward_model: "http://localhost:1234/placeholder"
ppo_epochs: 1 # 可以先用默认值
ppo_score_norm: true
{
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
"gradient_accumulation_steps": "auto",
"gradient_clipping": "auto",
"zero_allow_untested_optimizer": true,
"fp16": {
"enabled": "auto",
"loss_scale": 0,
"loss_scale_window": 1000,
"initial_scale_power": 16,
"hysteresis": 2,
"min_loss_scale": 1
},
"bf16": {
"enabled": "auto"
},
"zero_optimization": {
"stage": 3,
"overlap_comm": false,
"contiguous_gradients": true,
"sub_group_size": 1e9,
"reduce_bucket_size": "auto",
"stage3_prefetch_bucket_size": "auto",
"stage3_param_persistence_threshold": "auto",
"stage3_max_live_parameters": 1e9,
"stage3_max_reuse_distance": 1e9,
"stage3_gather_16bit_weights_on_model_save": true
}
}
Others
No response
The text was updated successfully, but these errors were encountered: