Data seems not sharded across processes in multi-host single-slice setting #271

weirayao · 2025-05-31T09:28:02Z

weirayao
May 31, 2025

Hi, thanks for the great work. We are running the sample script train.py on v4-16 with mesh as follows:

XLA_IR_DEBUG=1 XLA_HLO_DEBUG=1 python torchprime/torch_xla_models/train.py
model=llama-3-8b
global_batch_size=8
block_size=4096
max_steps=1000
ici_mesh.fsdp=8
ici_mesh.tensor=1
ici_mesh.data=1
ici_mesh.expert=1

We run the above script with gcloud:

gcloud alpha compute tpus tpu-vm ssh xxx-v4-16
--zone=xxx
--project=xxx
--tunnel-through-iap
--worker=all
--command="bash train.sh"

When printing out the batch on each host/process, we get the following logs:

[2025-05-30 23:22:00,939][main][INFO] - Logical mesh shape: OrderedDict([('data', 1), ('fsdp', 8), ('tensor', 1), ('expert', 1)])
[2025-05-30 23:22:00,939][main][INFO] - Logical mesh device assignments: [0 1 2 3 4 5 6 7]
[2025-05-30 23:22:00,940][main][INFO] - Minibatch dataloading: True
[2025-05-30 23:22:51,649][main][INFO] - All processes synchronized, starting training
[2025-05-30 23:22:51,649][main][INFO] - Num replicas: 2
[2025-05-30 23:22:51,649][main][INFO] - Num replicas: 2

Host: 0, batch: {'input_ids': tensor([[ 284, 23947, 1231, ..., 220, 19, 51857],
[ 364, 82, 11094, ..., 1365, 448, 45556],
[ 323, 30739, 311, ..., 9886, 87152, 8933],
...,
[79323, 14336, 304, ..., 1393, 24917, 307],
[ 284, 715, 10751, ..., 315, 98098, 62940],
[ 389, 5813, 220, ..., 11678, 39294, 892]], device='xla:0'), 'attention_mask': tensor([[1, 1, 1, ..., 1, 1, 1],
[1, 1, 1, ..., 1, 1, 1],
[1, 1, 1, ..., 1, 1, 1],
...,
[1, 1, 1, ..., 1, 1, 1],
[1, 1, 1, ..., 1, 1, 1],
[1, 1, 1, ..., 1, 1, 1]], device='xla:0'), 'labels': tensor([[ 284, 23947, 1231, ..., 220, 19, 51857],
[ 364, 82, 11094, ..., 1365, 448, 45556],
[ 323, 30739, 311, ..., 9886, 87152, 8933],
...,
[79323, 14336, 304, ..., 1393, 24917, 307],
[ 284, 715, 10751, ..., 315, 98098, 62940],
[ 389, 5813, 220, ..., 11678, 39294, 892]], device='xla:0')}, shape: torch.Size([8, 4096])
Host: 1, batch: {'input_ids': tensor([[ 284, 23947, 1231, ..., 220, 19, 51857],
[ 364, 82, 11094, ..., 1365, 448, 45556],
[ 323, 30739, 311, ..., 9886, 87152, 8933],
...,
[79323, 14336, 304, ..., 1393, 24917, 307],
[ 284, 715, 10751, ..., 315, 98098, 62940],
[ 389, 5813, 220, ..., 11678, 39294, 892]], device='xla:0'), 'attention_mask': tensor([[1, 1, 1, ..., 1, 1, 1],
[1, 1, 1, ..., 1, 1, 1],
[1, 1, 1, ..., 1, 1, 1],
...,
[1, 1, 1, ..., 1, 1, 1],
[1, 1, 1, ..., 1, 1, 1],
[1, 1, 1, ..., 1, 1, 1]], device='xla:0'), 'labels': tensor([[ 284, 23947, 1231, ..., 220, 19, 51857],
[ 364, 82, 11094, ..., 1365, 448, 45556],
[ 323, 30739, 311, ..., 9886, 87152, 8933],
...,
[79323, 14336, 304, ..., 1393, 24917, 307],
[ 284, 715, 10751, ..., 315, 98098, 62940],
[ 389, 5813, 220, ..., 11678, 39294, 892]], device='xla:0')}, shape: torch.Size([8, 4096])

This seems to imply we are getting the global batch (8 samples) on each host even if the Minibatch is enabled. Any idea why this happened? Also the printed loss is the same on the two hosts in the VM. Many thanks!!!

Answered by tengyifei

Jun 2, 2025

Hi @weirayao , thanks for checking out our code base. I think what you're seeing is consistent with the distributed execution model of torchprime, although the behavior is not intuitive. Here's what's happening:

torchprime uses SPMD instead of multi-processing based distributed execution.
The torchprime trainer instantiates a stack of dataloaders each wrapping the previous one.
The base DataLoader outputs local tensors stored on the cpu.
The MpDeviceLoader outputs sharded global tensors stored on the TPU. If we print its output, that would give the appearance of duplication.

We can get a better understanding of this process by instrumenting the data loaders. I've uploaded an example at

View full answer

tengyifei · 2025-06-02T20:35:34Z

tengyifei
Jun 2, 2025
Maintainer

Hi @weirayao , thanks for checking out our code base. I think what you're seeing is consistent with the distributed execution model of torchprime, although the behavior is not intuitive. Here's what's happening:

torchprime uses SPMD instead of multi-processing based distributed execution.
The torchprime trainer instantiates a stack of dataloaders each wrapping the previous one.
The base DataLoader outputs local tensors stored on the cpu.
The MpDeviceLoader outputs sharded global tensors stored on the TPU. If we print its output, that would give the appearance of duplication.

We can get a better understanding of this process by instrumenting the data loaders. I've uploaded an example at

torchprime/torchprime/torch_xla_models/train.py

Lines 178 to 219 in 21633ae

    
             dataloader = self._instrument_dataloader(dataloader, "base") 
        
             loader = pl.MpDeviceLoader( 
        
               dataloader, self.device, input_sharding=self.input_sharding_spec 
        
             ) 
        
             loader = self._instrument_dataloader(loader, "mp_device_loader") 
        
             return loader 
        
           def _instrument_dataloader(self, dataloader, name): 
        
             """Instrument the shapes from the dataloader with a name for logging.""" 
        
             class WrapperDataLoader: 
        
               def __init__(self, dataloader, name): 
        
                 self.dataloader = dataloader 
        
                 self.name = name 
        
               def __iter__(self): 
        
                 for batch in self.dataloader: 
        
                   # Log the shapes of the inputs and targets. 
        
                   self._log_shapes(batch) 
        
                   yield batch 
        
               def _log_shapes(self, batch): 
        
                 import torch.utils._pytree as pytree 
        
                 shapes = pytree.tree_map( 
        
                   lambda x: x.shape if isinstance(x, torch.Tensor) else None, batch 
        
                 ) 
        
                 logger.info(f"[{self.name}] data shapes: {shapes}") 
        
                 # Visualize one example tensor. 
        
                 t = next(iter(pytree.tree_iter(batch))) 
        
                 if t.device.type == "xla": 
        
                   import click 
        
                   from torch_xla.distributed.spmd.debugging import visualize_tensor_sharding 
        
                   generated_table = visualize_tensor_sharding(t, use_color=False) 
        
                   click.echo(generated_table) 
        
               def __len__(self): 
        
                 return len(self.dataloader) 
        
             return WrapperDataLoader(dataloader, name)

With that branch checked out when I run

tp run --num-slices=2 torchprime/torch_xla_models/train.py model=llama-3-8b global_batch_size=8 ici_mesh.fsdp=4 dcn_mesh.fsdp=2

to train on two 4-chip TPU VMs, I see the following logs:

INFO [2025-06-02 19:43:07,592][__main__][INFO] - [base] data shapes: {'input_ids': torch.Size([4, 8192]), 'attention_mask': torch.Size([4, 8192]), 'labels': torch.Size([4, 8192])}
INFO [2025-06-02 19:43:07,631][__main__][INFO] - [base] data shapes: {'input_ids': torch.Size([4, 8192]), 'attention_mask': torch.Size([4, 8192]), 'labels': torch.Size([4, 8192])}
...
INFO [2025-06-02 19:43:08,065][__main__][INFO] - [base] data shapes: {'input_ids': torch.Size([4, 8192]), 'attention_mask': torch.Size([4, 8192]), 'labels': torch.Size([4, 8192])}
INFO [2025-06-02 19:43:08,094][__main__][INFO] - [mp_device_loader] data shapes: {'input_ids': torch.Size([8, 8192]), 'attention_mask': torch.Size([8, 8192]), 'labels': torch.Size([8, 8192])}
┌───────┐
│       │
│ TPU 0 │
│       │
├───────┤
│       │
│ TPU 1 │
│       │
├───────┤
│       │
│ TPU 2 │
│       │
├───────┤
│       │
│ TPU 3 │
│       │
├───────┤
│       │
│ TPU 4 │
│       │
├───────┤
│       │
│ TPU 5 │
│       │
├───────┤
│       │
│ TPU 6 │
│       │
├───────┤
│       │
│ TPU 7 │
│       │
└───────┘

This shows that the base data loader is outputting local tensors, but the pl.MpDeviceLoader outputs global tensors. In short, MpDeviceLoader combines the local tensors from each host into a larger sharded global tensor without communication overhead. The console table printed by visualize_tensor_sharding further proves that the MpDeviceLoader output is sharded.

When you print(batch), that will pull all the sharded data into the host that is doing the print. So it would appear that each host stores all the data, when in fact the data is sharded over all TPUs, unless a print forces the all-gathering of shards.

A lot of this is hidden by the GSPMD execution model (https://docs.pytorch.org/xla/master/perf/spmd_basic.html). In this model, a tensor's data may be backed by several TPU chips including those from other hosts, but a layer of abstraction presents you the global shape as if the tensor lives locally.

This is the same reason why the loss values printed by different hosts are the same. When we print the loss, all the hosts will reduce-scatter or all-reduce the loss and end up computing the same loss.

LMK if this answer makes sense.

0 replies

tengyifei · 2025-06-02T20:38:50Z

tengyifei
Jun 2, 2025
Maintainer

Hi @weirayao I turned this question into a discussion thread. Feel free to ask more questions here and/or mark my answer as accepted.

1 reply

weirayao Jun 2, 2025
Author

Thanks a lot @tengyifei !!!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Data seems not sharded across processes in multi-host single-slice setting #271

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Data seems not sharded across processes in multi-host single-slice setting #271

Uh oh!

Uh oh!

weirayao May 31, 2025

Replies: 2 comments · 1 reply

Uh oh!

Uh oh!

tengyifei Jun 2, 2025 Maintainer

Uh oh!

tengyifei Jun 2, 2025 Maintainer

Uh oh!

weirayao Jun 2, 2025 Author

weirayao
May 31, 2025

Replies: 2 comments 1 reply

tengyifei
Jun 2, 2025
Maintainer

tengyifei
Jun 2, 2025
Maintainer

weirayao Jun 2, 2025
Author