Skip to content

no DeviceMesh from dtensor #28

@AtoshDustosh

Description

@AtoshDustosh

I tried to run the script "distil_logits.py" and modified some code but I still can't get it running because of "no DeviceMesh from dtensor" error.

What I changed:

# 1. Fix he undefined key arguments of "num_Items_in_batch"
def compute_loss(self, model, inputs, return_outputs=False, num_items_in_batch=None):

# 2. Change the function sharegpt_format so that it matches with my dataset

# 3. Add device map so that the model is loaded to GPU
teacher_model_kwargs["device_map"] = "auto"
student_model_kwargs["device_map"] = "auto"

# 4. Change datasets and tokenizers in config.

My script:

 accelerate launch distil_logits.py

accelerate config:

compute_environment: LOCAL_MACHINE
debug: true
deepspeed_config:
  gradient_accumulation_steps: 8
  gradient_clipping: 1.0
  offload_optimizer_device: cpu
  offload_param_device: cpu
  zero3_init_flag: false
  zero3_save_16bit_model: false
  zero_stage: 2
distributed_type: DEEPSPEED
downcast_bf16: 'no'
enable_cpu_affinity: false
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 8
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
[rank3]: Traceback (most recent call last):
[rank3]:   File "/workspaces/projects/train/distill/DistillKit/distil_logits.py", line 230, in <module>
[rank3]:     trainer.train(resume_from_checkpoint=config["training"]["resume_from_checkpoint"])
[rank3]:   File "/usr/local/lib/python3.12/dist-packages/trl/trainer/sft_trainer.py", line 451, in train
[rank3]:     output = super().train(*args, **kwargs)
[rank3]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/usr/local/lib/python3.12/dist-packages/transformers/trainer.py", line 2245, in train
[rank3]:     return inner_training_loop(
[rank3]:            ^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/usr/local/lib/python3.12/dist-packages/transformers/trainer.py", line 2369, in _inner_training_loop
[rank3]:     model, self.optimizer = self.accelerator.prepare(self.model, self.optimizer)
[rank3]:                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/usr/local/lib/python3.12/dist-packages/accelerate/accelerator.py", line 1392, in prepare
[rank3]:     result = self._prepare_deepspeed(*args)
[rank3]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/usr/local/lib/python3.12/dist-packages/accelerate/accelerator.py", line 1953, in _prepare_deepspeed
[rank3]:     engine, optimizer, _, lr_scheduler = ds_initialize(**kwargs)
[rank3]:                                          ^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/usr/local/lib/python3.12/dist-packages/deepspeed/__init__.py", line 193, in initialize
[rank3]:     engine = DeepSpeedEngine(args=args,
[rank3]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/usr/local/lib/python3.12/dist-packages/deepspeed/runtime/engine.py", line 273, in __init__
[rank3]:     self._configure_distributed_model(model)
[rank3]:   File "/usr/local/lib/python3.12/dist-packages/deepspeed/runtime/engine.py", line 1287, in _configure_distributed_model
[rank3]:     self._broadcast_model()
[rank3]:   File "/usr/local/lib/python3.12/dist-packages/deepspeed/runtime/engine.py", line 1205, in _broadcast_model
[rank3]:     dist.broadcast(p.data, groups._get_broadcast_src_rank(), group=self.seq_data_parallel_group)
[rank3]:   File "/usr/local/lib/python3.12/dist-packages/deepspeed/comm/comm.py", line 117, in log_wrapper
[rank3]:     return func(*args, **kwargs)
[rank3]:            ^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/usr/local/lib/python3.12/dist-packages/deepspeed/comm/comm.py", line 224, in broadcast
[rank3]:     return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op)
[rank3]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/usr/local/lib/python3.12/dist-packages/deepspeed/comm/torch.py", line 206, in broadcast
[rank3]:     return torch.distributed.broadcast(tensor=tensor, src=src, group=group, async_op=async_op)
[rank3]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/usr/local/lib/python3.12/dist-packages/torch/distributed/c10d_logger.py", line 81, in wrapper
[rank3]:     return func(*args, **kwargs)
[rank3]:            ^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/usr/local/lib/python3.12/dist-packages/torch/distributed/distributed_c10d.py", line 2726, in broadcast
[rank3]:     work = group.broadcast([tensor], opts)
[rank3]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/usr/local/lib/python3.12/dist-packages/torch/_compile.py", line 32, in inner
[rank3]:     return disable_fn(*args, **kwargs)
[rank3]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/usr/local/lib/python3.12/dist-packages/torch/_dynamo/eval_frame.py", line 745, in _fn
[rank3]:     return fn(*args, **kwargs)
[rank3]:            ^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/usr/local/lib/python3.12/dist-packages/torch/distributed/tensor/_api.py", line 346, in __torch_dispatch__
[rank3]:     return DTensor._op_dispatcher.dispatch(
[rank3]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/usr/local/lib/python3.12/dist-packages/torch/distributed/tensor/_dispatch.py", line 167, in dispatch
[rank3]:     op_info = self.unwrap_to_op_info(op_call, args, kwargs)
[rank3]:               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/usr/local/lib/python3.12/dist-packages/torch/distributed/tensor/_dispatch.py", line 400, in unwrap_to_op_info
[rank3]:     assert mesh is not None, f"found no DeviceMesh from dtensor args for {op_call}!"
[rank3]:            ^^^^^^^^^^^^^^^^
[rank3]: AssertionError: found no DeviceMesh from dtensor args for c10d.broadcast_.default!

Other ranks report the same error.

torch version: 2.6.0+cu124
Python version: 3.12.3
GPU num: 8.

Teacher model: Qwen2.5-32B
Student model: Qwen2.5-7B

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions