no DeviceMesh from dtensor

I tried to run the script "distil_logits.py" and modified some code but I still can't get it running because of "no DeviceMesh from dtensor" error.

What I changed:
```
# 1. Fix he undefined key arguments of "num_Items_in_batch"
def compute_loss(self, model, inputs, return_outputs=False, num_items_in_batch=None):

# 2. Change the function sharegpt_format so that it matches with my dataset

# 3. Add device map so that the model is loaded to GPU
teacher_model_kwargs["device_map"] = "auto"
student_model_kwargs["device_map"] = "auto"

# 4. Change datasets and tokenizers in config.
```

My script:
```
 accelerate launch distil_logits.py
```
accelerate config:
```
compute_environment: LOCAL_MACHINE
debug: true
deepspeed_config:
  gradient_accumulation_steps: 8
  gradient_clipping: 1.0
  offload_optimizer_device: cpu
  offload_param_device: cpu
  zero3_init_flag: false
  zero3_save_16bit_model: false
  zero_stage: 2
distributed_type: DEEPSPEED
downcast_bf16: 'no'
enable_cpu_affinity: false
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 8
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
````
```
[rank3]: Traceback (most recent call last):
[rank3]:   File "/workspaces/projects/train/distill/DistillKit/distil_logits.py", line 230, in <module>
[rank3]:     trainer.train(resume_from_checkpoint=config["training"]["resume_from_checkpoint"])
[rank3]:   File "/usr/local/lib/python3.12/dist-packages/trl/trainer/sft_trainer.py", line 451, in train
[rank3]:     output = super().train(*args, **kwargs)
[rank3]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/usr/local/lib/python3.12/dist-packages/transformers/trainer.py", line 2245, in train
[rank3]:     return inner_training_loop(
[rank3]:            ^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/usr/local/lib/python3.12/dist-packages/transformers/trainer.py", line 2369, in _inner_training_loop
[rank3]:     model, self.optimizer = self.accelerator.prepare(self.model, self.optimizer)
[rank3]:                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/usr/local/lib/python3.12/dist-packages/accelerate/accelerator.py", line 1392, in prepare
[rank3]:     result = self._prepare_deepspeed(*args)
[rank3]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/usr/local/lib/python3.12/dist-packages/accelerate/accelerator.py", line 1953, in _prepare_deepspeed
[rank3]:     engine, optimizer, _, lr_scheduler = ds_initialize(**kwargs)
[rank3]:                                          ^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/usr/local/lib/python3.12/dist-packages/deepspeed/__init__.py", line 193, in initialize
[rank3]:     engine = DeepSpeedEngine(args=args,
[rank3]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/usr/local/lib/python3.12/dist-packages/deepspeed/runtime/engine.py", line 273, in __init__
[rank3]:     self._configure_distributed_model(model)
[rank3]:   File "/usr/local/lib/python3.12/dist-packages/deepspeed/runtime/engine.py", line 1287, in _configure_distributed_model
[rank3]:     self._broadcast_model()
[rank3]:   File "/usr/local/lib/python3.12/dist-packages/deepspeed/runtime/engine.py", line 1205, in _broadcast_model
[rank3]:     dist.broadcast(p.data, groups._get_broadcast_src_rank(), group=self.seq_data_parallel_group)
[rank3]:   File "/usr/local/lib/python3.12/dist-packages/deepspeed/comm/comm.py", line 117, in log_wrapper
[rank3]:     return func(*args, **kwargs)
[rank3]:            ^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/usr/local/lib/python3.12/dist-packages/deepspeed/comm/comm.py", line 224, in broadcast
[rank3]:     return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op)
[rank3]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/usr/local/lib/python3.12/dist-packages/deepspeed/comm/torch.py", line 206, in broadcast
[rank3]:     return torch.distributed.broadcast(tensor=tensor, src=src, group=group, async_op=async_op)
[rank3]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/usr/local/lib/python3.12/dist-packages/torch/distributed/c10d_logger.py", line 81, in wrapper
[rank3]:     return func(*args, **kwargs)
[rank3]:            ^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/usr/local/lib/python3.12/dist-packages/torch/distributed/distributed_c10d.py", line 2726, in broadcast
[rank3]:     work = group.broadcast([tensor], opts)
[rank3]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/usr/local/lib/python3.12/dist-packages/torch/_compile.py", line 32, in inner
[rank3]:     return disable_fn(*args, **kwargs)
[rank3]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/usr/local/lib/python3.12/dist-packages/torch/_dynamo/eval_frame.py", line 745, in _fn
[rank3]:     return fn(*args, **kwargs)
[rank3]:            ^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/usr/local/lib/python3.12/dist-packages/torch/distributed/tensor/_api.py", line 346, in __torch_dispatch__
[rank3]:     return DTensor._op_dispatcher.dispatch(
[rank3]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/usr/local/lib/python3.12/dist-packages/torch/distributed/tensor/_dispatch.py", line 167, in dispatch
[rank3]:     op_info = self.unwrap_to_op_info(op_call, args, kwargs)
[rank3]:               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/usr/local/lib/python3.12/dist-packages/torch/distributed/tensor/_dispatch.py", line 400, in unwrap_to_op_info
[rank3]:     assert mesh is not None, f"found no DeviceMesh from dtensor args for {op_call}!"
[rank3]:            ^^^^^^^^^^^^^^^^
[rank3]: AssertionError: found no DeviceMesh from dtensor args for c10d.broadcast_.default!
```
Other ranks report the same error.

torch version: 2.6.0+cu124
Python version: 3.12.3
GPU num: 8.

Teacher model: Qwen2.5-32B
Student model: Qwen2.5-7B

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

no DeviceMesh from dtensor #28

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

no DeviceMesh from dtensor #28

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions