-
Notifications
You must be signed in to change notification settings - Fork 40
拉起7B 8k训练报错:assert num_micro_batches <= len(seq_len_effective)报错 #38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
我好像没见过这个错误;可以提供一下你的config么? |
@hejujie 好的,我采用的都是main分支的默认代码,没有修改,下面是我的启动脚本: |
没发现太多异常,可能需要有更多信息 / Debug Print一下东西出来,你可以看看如下的操作是否有帮助:
|
@hejujie 好的谢谢,我试一试 |
@hejujie 你好,我在训练脚本里关闭了dynamic batch size,跑完step3后不再报之前的错误,但报了新的错,请问有什么解决思路吗?
|
另外还观察到,在训练的时候会在grader中报一些错但不影响继续训练,这个现象是否正常? |
使用默认脚本拉起8卡训练任务,跑完3个step后报错如下,看上去貌似是seq packing时出错,想问下如何解决?
Traceback (most recent call last):
File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/Skywork-OR1/verl/trainer/main_ppo.py", line 147, in
main()
File "/usr/local/lib/python3.10/dist-packages/hydra/main.py", line 94, in decorated_main
_run_hydra(
File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/utils.py", line 394, in _run_hydra
_run_app(
File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/utils.py", line 457, in _run_app
run_and_report(
File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/utils.py", line 223, in run_and_report
raise ex
File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/utils.py", line 220, in run_and_report
return func()
File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/utils.py", line 458, in
lambda: hydra.run(
File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/hydra.py", line 132, in run
_ = ret.return_value
File "/usr/local/lib/python3.10/dist-packages/hydra/core/utils.py", line 260, in return_value
raise self._return_value
File "/usr/local/lib/python3.10/dist-packages/hydra/core/utils.py", line 186, in run_job
ret.return_value = task_function(task_cfg)
File "/Skywork-OR1/verl/trainer/main_ppo.py", line 26, in main
run_ppo(config, code_compute_score)
File "/Skywork-OR1/verl/trainer/main_ppo.py", line 46, in run_ppo
ray.get(main_task.remote(config, compute_score))
File "/usr/local/lib/python3.10/dist-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper
return fn(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
return func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/ray/_private/worker.py", line 2822, in get
values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
File "/usr/local/lib/python3.10/dist-packages/ray/_private/worker.py", line 930, in get_objects
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(AssertionError): ^[[36mray::main_task()^[[39m (pid=42437, ip=172.17.0.2)
File "/Skywork-OR1/verl/trainer/main_ppo.py", line 143, in main_task
trainer.fit()
File "/Skywork-OR1/verl/trainer/ppo/ray_trainer.py", line 1101, in fit
old_log_prob = self.actor_rollout_wg.compute_log_prob(batch)
File "/Skywork-OR1/verl/single_controller/ray/base.py", line 41, in func
if blocking: output = ray.get(output)
ray.exceptions.RayTaskError(AssertionError): ^[[36mray::WorkerDict.actor_rollout_compute_log_prob()^[[39m (pid=43189, ip=172.17.0.2, actor_id=f1eed443cb5fb73601366e3601000000, repr=<verl.single_controller.ray.base.WorkerDict object at 0x7fc6c8a81060>)
File "/Skywork-OR1/verl/single_controller/ray/base.py", line 398, in func
return getattr(self.worker_dict[key], name)(*args, **kwargs)
File "/Skywork-OR1/verl/single_controller/base/decorator.py", line 404, in inner
return func(*args, **kwargs)
File "/Skywork-OR1/verl/workers/fsdp_workers.py", line 499, in compute_log_prob
output = self.actor.compute_log_prob(data=data)
File "/Skywork-OR1/verl/workers/actor/dp_actor.py", line 184, in compute_log_prob
micro_batches, indices = rearrange_micro_batches(batch=batch, max_token_len=max_token_len)
File "/Skywork-OR1/verl/utils/seqlen_balancing.py", line 242, in rearrange_micro_batches
assert num_micro_batches <= len(seq_len_effective)
AssertionError
The text was updated successfully, but these errors were encountered: