Understanding the generated MLIR in multi-devices context #30271

yuanfz98 · 2025-07-17T05:35:47Z

yuanfz98
Jul 17, 2025

Hello Community,

I have a custom PJRT plugin to compile a MLIR:

@pytest.mark.parametrize("shape", [(512, 512)])
def test_shared_elementwise_compute(shape):
    devices = np.array(jax.devices()[:4]).reshape((2, 2))
    mesh = Mesh(devices, ("x", "y"))
    sharding = NamedSharding(mesh, PartitionSpec("x", "y"))

    with jax.default_device(jax.devices("cpu")[0]):
        x = jnp.ones(shape)
    x = jax.device_put(x, sharding)
    x = 2 * jnp.sin(x)

    with jax.default_device(jax.devices("cpu")[0]):
        x_cpu = _to_cpu(x)
        print("result x on CPU:", x_cpu)
        assert jnp.allclose(x_cpu, jnp.sum(jnp.sin(jnp.ones(shape))))

The first generated MLIR Module is:

module @jit__multi_slice attributes {mhlo.num_partitions = 1 : i32, mhlo.num_replicas = 1 : i32} {
  func.func public @main(%arg0: tensor<512x512xf32>) -> (tensor<256x256xf32> {jax.result_info = "result[0]"}, tensor<256x256xf32> {jax.result_info = "result[1]"}, tensor<256x256xf32> {jax.result_info = "result[2]"}, tensor<256x256xf32> {jax.result_info = "result[3]"}) {
    %0 = stablehlo.slice %arg0 [0:256, 0:256] : (tensor<512x512xf32>) -> tensor<256x256xf32>
    %1 = stablehlo.slice %arg0 [0:256, 256:512] : (tensor<512x512xf32>) -> tensor<256x256xf32>
    %2 = stablehlo.slice %arg0 [256:512, 0:256] : (tensor<512x512xf32>) -> tensor<256x256xf32>
    %3 = stablehlo.slice %arg0 [256:512, 256:512] : (tensor<512x512xf32>) -> tensor<256x256xf32>
    return %0, %1, %2, %3 : tensor<256x256xf32>, tensor<256x256xf32>, tensor<256x256xf32>, tensor<256x256xf32>
  }
}

Which I understand that XLA is doing partitioning in the original tensor.

But the second MLIR gives:

module @jit_sin attributes {mhlo.num_partitions = 4 : i32, mhlo.num_replicas = 1 : i32} {
  func.func public @main(%arg0: tensor<512x512xf32> {mhlo.sharding = "{devices=[2,2]<=[4]}"}) -> (tensor<512x512xf32> {jax.result_info = "result"}) {
    %0 = stablehlo.sine %arg0 : tensor<512x512xf32>
    return %0 : tensor<512x512xf32>
  }

I am confused here as every device holds only tensor of 256x256xf32. It should be something like:

module @jit_sin attributes {
  func.func public @main(%arg0: tensor<256x256xf32> -> (tensor<256x256xf32> {jax.result_info = "result"}) {
    %0 = stablehlo.sine %arg0 : tensor<256x256xf32>
    ...  maybe a collective op to transfer %0 to device 0 ...
    return %0 : tensor<256x256xf32>
  }

Any help will be appreciated, thanks.

Answered by guy-singer

Jul 17, 2025

You're observing the difference between logical and physical tensor representations in JAX's compilation pipeline.

The second MLIR module is correct.

The second module shows the logical view where:

The function signature uses the full tensor shape tensor<512x512xf32>
The sharding annotation {devices=[2,2]<=[4]} tells the compiler how to physically distribute this logical tensor
Each device will only store and compute on its 256x256 shard

This is because the compiler can perform cross-shard optimizations when it sees the full logical computation graph. The compiler automatically inserts collective operations where needed based on sharding annotations. Functions remain readable without ex…

View full answer

yuanfz98 · 2025-07-17T10:03:49Z

yuanfz98
Jul 17, 2025
Author

Well I see that pjrt_c_api_client.cc doesn't lower module like cpu_client.cc for CompileAndLoad. It means that SPMD Partitioning may not be used.
If it is true, it means that xla doesn't provide a neutral SPMD partitioner. Partitioners work only for cpus and gpus as they were added to pass pipeline in cpu_compiler.cc and gpu_compiler.cc.

1 reply

yuanfz98 Jul 17, 2025
Author

I wonder if it is a drawback of PJRT mecanism. Is there any existing implementation of PJRT in distributed context ?

guy-singer · 2025-07-17T11:29:37Z

guy-singer
Jul 17, 2025

You're observing the difference between logical and physical tensor representations in JAX's compilation pipeline.

The second MLIR module is correct.

The second module shows the logical view where:

The function signature uses the full tensor shape tensor<512x512xf32>
The sharding annotation {devices=[2,2]<=[4]} tells the compiler how to physically distribute this logical tensor
Each device will only store and compute on its 256x256 shard

This is because the compiler can perform cross-shard optimizations when it sees the full logical computation graph. The compiler automatically inserts collective operations where needed based on sharding annotations. Functions remain readable without explicit per-device logic.

At runtime with your sharding:

Device 0: processes arg0[0:256, 0:256]
Device 1: processes arg0[0:256, 256:512]
Device 2: processes arg0[256:512, 0:256]
Device 3: processes arg0[256:512, 256:512]

Each device computes sine on its local 256x256 shard, but the MLIR represents this as a logical operation on the full tensor.

The first module (jit__multi_slice) explicitly slices the tensor because that's likely part of a different compilation path or a utility function that needs to expose the individual shards.

Your expected MLIR with explicit 256x256 signatures would require the compiler to generate separate functions for each device, which would complicate optimization and communication insertion. The current approach lets the compiler handle distribution automatically based on sharding annotations.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Understanding the generated MLIR in multi-devices context #30271

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Understanding the generated MLIR in multi-devices context #30271

Uh oh!

yuanfz98 Jul 17, 2025

Replies: 2 comments · 1 reply

Uh oh!

Uh oh!

yuanfz98 Jul 17, 2025 Author

Uh oh!

Uh oh!

yuanfz98 Jul 17, 2025 Author

Uh oh!

guy-singer Jul 17, 2025

yuanfz98
Jul 17, 2025

Replies: 2 comments 1 reply

yuanfz98
Jul 17, 2025
Author

yuanfz98 Jul 17, 2025
Author

guy-singer
Jul 17, 2025