-
Notifications
You must be signed in to change notification settings - Fork 95
Open
Description
I tried to run the script "distil_logits.py" and modified some code but I still can't get it running because of "no DeviceMesh from dtensor" error.
What I changed:
# 1. Fix he undefined key arguments of "num_Items_in_batch"
def compute_loss(self, model, inputs, return_outputs=False, num_items_in_batch=None):
# 2. Change the function sharegpt_format so that it matches with my dataset
# 3. Add device map so that the model is loaded to GPU
teacher_model_kwargs["device_map"] = "auto"
student_model_kwargs["device_map"] = "auto"
# 4. Change datasets and tokenizers in config.
My script:
accelerate launch distil_logits.py
accelerate config:
compute_environment: LOCAL_MACHINE
debug: true
deepspeed_config:
gradient_accumulation_steps: 8
gradient_clipping: 1.0
offload_optimizer_device: cpu
offload_param_device: cpu
zero3_init_flag: false
zero3_save_16bit_model: false
zero_stage: 2
distributed_type: DEEPSPEED
downcast_bf16: 'no'
enable_cpu_affinity: false
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 8
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
[rank3]: Traceback (most recent call last):
[rank3]: File "/workspaces/projects/train/distill/DistillKit/distil_logits.py", line 230, in <module>
[rank3]: trainer.train(resume_from_checkpoint=config["training"]["resume_from_checkpoint"])
[rank3]: File "/usr/local/lib/python3.12/dist-packages/trl/trainer/sft_trainer.py", line 451, in train
[rank3]: output = super().train(*args, **kwargs)
[rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]: File "/usr/local/lib/python3.12/dist-packages/transformers/trainer.py", line 2245, in train
[rank3]: return inner_training_loop(
[rank3]: ^^^^^^^^^^^^^^^^^^^^
[rank3]: File "/usr/local/lib/python3.12/dist-packages/transformers/trainer.py", line 2369, in _inner_training_loop
[rank3]: model, self.optimizer = self.accelerator.prepare(self.model, self.optimizer)
[rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]: File "/usr/local/lib/python3.12/dist-packages/accelerate/accelerator.py", line 1392, in prepare
[rank3]: result = self._prepare_deepspeed(*args)
[rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]: File "/usr/local/lib/python3.12/dist-packages/accelerate/accelerator.py", line 1953, in _prepare_deepspeed
[rank3]: engine, optimizer, _, lr_scheduler = ds_initialize(**kwargs)
[rank3]: ^^^^^^^^^^^^^^^^^^^^^^^
[rank3]: File "/usr/local/lib/python3.12/dist-packages/deepspeed/__init__.py", line 193, in initialize
[rank3]: engine = DeepSpeedEngine(args=args,
[rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]: File "/usr/local/lib/python3.12/dist-packages/deepspeed/runtime/engine.py", line 273, in __init__
[rank3]: self._configure_distributed_model(model)
[rank3]: File "/usr/local/lib/python3.12/dist-packages/deepspeed/runtime/engine.py", line 1287, in _configure_distributed_model
[rank3]: self._broadcast_model()
[rank3]: File "/usr/local/lib/python3.12/dist-packages/deepspeed/runtime/engine.py", line 1205, in _broadcast_model
[rank3]: dist.broadcast(p.data, groups._get_broadcast_src_rank(), group=self.seq_data_parallel_group)
[rank3]: File "/usr/local/lib/python3.12/dist-packages/deepspeed/comm/comm.py", line 117, in log_wrapper
[rank3]: return func(*args, **kwargs)
[rank3]: ^^^^^^^^^^^^^^^^^^^^^
[rank3]: File "/usr/local/lib/python3.12/dist-packages/deepspeed/comm/comm.py", line 224, in broadcast
[rank3]: return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op)
[rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]: File "/usr/local/lib/python3.12/dist-packages/deepspeed/comm/torch.py", line 206, in broadcast
[rank3]: return torch.distributed.broadcast(tensor=tensor, src=src, group=group, async_op=async_op)
[rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]: File "/usr/local/lib/python3.12/dist-packages/torch/distributed/c10d_logger.py", line 81, in wrapper
[rank3]: return func(*args, **kwargs)
[rank3]: ^^^^^^^^^^^^^^^^^^^^^
[rank3]: File "/usr/local/lib/python3.12/dist-packages/torch/distributed/distributed_c10d.py", line 2726, in broadcast
[rank3]: work = group.broadcast([tensor], opts)
[rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]: File "/usr/local/lib/python3.12/dist-packages/torch/_compile.py", line 32, in inner
[rank3]: return disable_fn(*args, **kwargs)
[rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]: File "/usr/local/lib/python3.12/dist-packages/torch/_dynamo/eval_frame.py", line 745, in _fn
[rank3]: return fn(*args, **kwargs)
[rank3]: ^^^^^^^^^^^^^^^^^^^
[rank3]: File "/usr/local/lib/python3.12/dist-packages/torch/distributed/tensor/_api.py", line 346, in __torch_dispatch__
[rank3]: return DTensor._op_dispatcher.dispatch(
[rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]: File "/usr/local/lib/python3.12/dist-packages/torch/distributed/tensor/_dispatch.py", line 167, in dispatch
[rank3]: op_info = self.unwrap_to_op_info(op_call, args, kwargs)
[rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]: File "/usr/local/lib/python3.12/dist-packages/torch/distributed/tensor/_dispatch.py", line 400, in unwrap_to_op_info
[rank3]: assert mesh is not None, f"found no DeviceMesh from dtensor args for {op_call}!"
[rank3]: ^^^^^^^^^^^^^^^^
[rank3]: AssertionError: found no DeviceMesh from dtensor args for c10d.broadcast_.default!
Other ranks report the same error.
torch version: 2.6.0+cu124
Python version: 3.12.3
GPU num: 8.
Teacher model: Qwen2.5-32B
Student model: Qwen2.5-7B
Metadata
Metadata
Assignees
Labels
No labels