Problems with Poor Scaling of Jitted Code to Multiple GPUs #27738

leo1200 · 2025-04-04T14:53:17Z

leo1200
Apr 4, 2025

Dear Community,

I'm writing an astrophysical magnetohydrodynamics code (https://github.com/leo1200/jf1uids/) and am currently working on scaling it to multiple GPUs. As a test bench, I've written a simpler fluid code (https://github.com/leo1200/tinyfluids).

Communication is necessary for spatial finite differencing as well as interface calculations which require data from the cell left and right to the interface. To get left and right states along the different spatial dimensions, I use

primitive_state_left = jax.lax.slice_in_dim(primitive_state, 0, -1, axis = axis)
primitive_state_right = jax.lax.slice_in_dim(primitive_state, 1, None, axis = axis)

(see https://github.com/leo1200/tinyfluids/blob/main/tinyfluids/jax_tinyfluids/fluid.py#L385). For finite differencing only, one might also use an appropriate convolution.

Sharding the input onto multiple GPUs and just relying on jit however, yields very poor scaling - especially compared to my custom solution based on shard_map and halo_exchange at shard interfaces, as demonstrated below (running on four H100s)

"speedup" refers to using just jit and sharded data, shard mapped to using a shard map with halo exchange.

From also testing on other hardware, it seems as if sharding the data + jit for this case has a pretty significant memory exchange overhead, limiting scaling.

My question now is: Is there a way to obtain better scaling without using shard_map or other custom solutions? Upstreaming shard mapping and halo exchange to jf1uids would make our MHD code more complex and less user-friendly - and at the end of the day our goal is to enable a broad community of astrophysicists to contribute to the physics.

I'm very much looking forward to suggestions, and it would be super cool if someone is interested in playing around with the codes a bit - animating those (astro)-fluid flows can be quite satisfying (they're super beautiful), especially if the scaling is nice and the simulation itself is fast :-)

Best wishes
Leonard

Answered by leo1200

Apr 5, 2025

It looks like using jnp.roll instead of the slicing gets me to pretty much perfect scaling without any custom code.

View full answer

leo1200 · 2025-04-05T12:06:26Z

leo1200
Apr 5, 2025
Author

It looks like using jnp.roll instead of the slicing gets me to pretty much perfect scaling without any custom code.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Problems with Poor Scaling of Jitted Code to Multiple GPUs #27738

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Problems with Poor Scaling of Jitted Code to Multiple GPUs #27738

Uh oh!

leo1200 Apr 4, 2025

Replies: 1 comment

Uh oh!

leo1200 Apr 5, 2025 Author

leo1200
Apr 4, 2025

leo1200
Apr 5, 2025
Author