Shard map and host local data #28729

jaketae · 2025-05-14T01:44:54Z

jaketae
May 14, 2025

TL;DR: What is the recommended way to use shard_map with host local dataloading in muti-host training?

I'm trying to migrate from pmap to the shard_map API. First, we create a mesh:

num_hosts = jax.process_count()
num_local_devices = jax.local_device_count()
device_array - np.asarray(jax.devices()).reshape(num_hosts, num_local_devices)
mesh = jax.sharding.Mesh(device_array, (HOST_AXIS, DEVICE_AXIS))

We need to define the partition spec for the model and data. I'm using simple data parallelism, so the model sharding is easy:

model_spec = P() # replicate across HOST_AXIS, DEVICE_AXIS

I'm confused on how to partition host-local data. Since each host loads its own data, I simply want each host to shard along the local device axis:

data_spec = P(DEVICE_AXIS)

Surprisingly, this works in my setup (loss matches the pmap case). However, I'm curious if I'm relying on some undefined or poorly defined behavior. Given a global mesh, it seems that P(DEVICE_AXIS) would imply that the data should be sharded across the device axis, then be replicated across hosts. Obviously this is undesirable since I'm using host-local dataloading.

Is my above understanding of partition specs and meshes correct?
Should I use host_local_array_to_global_array to collect host-local data along the batch dimension, then partition with P((HOST_AXIS, DEVICE_AXIS)) to shard the aggregated batch dimension across devices?

Thanks in advance.

yashk2810 · 2025-05-14T01:52:24Z

yashk2810
May 14, 2025
Collaborator

to shard the aggregated batch dimension across devices?

Yes! That is correct since each device gets a unique piece of the data, you shard the batch across all axes of the mesh.

But I would recomment using jax.make_array_from_process_local_data instead.

5 replies

jaketae May 14, 2025
Author

Thanks for the quick response! If I may ask some follow-ups:

Is P(DEVICE_AXIS) incorrect / undefined? I'm hoping to understand how it worked to begin with.
pmap seemed to be able to perform inter-host lax collective ops out of the box, only with one named axis. Am I correct in understanding that shard_map requires a 2D global mesh if I want any inter-host lax aggregation?
Is make_array_from_process_local_data interchangeable with host_local_array_to_global_array, modulo the optional shape argument?
It seems inefficient to perform host-local dataloading, only to trigger inter-host communication by building a global array. Is there a more common or recommended pattern?

Thanks a lot.

yashk2810 May 14, 2025
Collaborator

Is P(DEVICE_AXIS) incorrect / undefined?

It's not undefined. It's probably incorrect though.

Am I correct in understanding that shard_map requires a 2D global mesh if I want any inter-host lax aggregation?

No, your mesh can be 1D or any number of dimensions.

modulo the optional shape argument?

It's not interchangeable in some situations where your mesh has non-contiguous topology. make_array_from_process_local_data is the recommended API.

only to trigger inter-host communication by building a global array

Where is the inter-host communication coming from? Can you expand on this a bit?

jaketae May 16, 2025
Author

Hi @yashk2810, sorry for the belated reply.

I've been thinking about how to frame this question as generally addressable as possible, but I figured the easiest way would be to share more directly what I'm trying to achieve.

I created an issue under google/flax#4751 as I'm using flax.nnx, but I still it's possible that this is a relevant JAX question as it concerns the general semantics and behavior of jax.shard_map vs jax.pmap. Would you be able to take a look?

If there's anything I can clarify or do to make the question better addressable, please let me know. Thanks again!

jaketae Jun 5, 2025
Author

Hi @yashk2810, gently bumping -- would appreciate your input. Thanks a lot!

yashk2810 Jun 5, 2025
Collaborator

Sorry, that question on flax is too long. Can you paste the concrete question here?

jax.shard_map requires global inputs and not host local inputs. You can use make_array_from_process_local_data to create that.

jax.pmap doesn't use shard_map underneath yet. We are working on making that happen but it's slow progress. So I would suggest to move to global inputs.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Shard map and host local data #28729

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 5 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Shard map and host local data #28729

Uh oh!

jaketae May 14, 2025

Replies: 1 comment · 5 replies

Uh oh!

yashk2810 May 14, 2025 Collaborator

Uh oh!

Uh oh!

jaketae May 14, 2025 Author

Uh oh!

yashk2810 May 14, 2025 Collaborator

Uh oh!

Uh oh!

jaketae May 16, 2025 Author

Uh oh!

jaketae Jun 5, 2025 Author

Uh oh!

yashk2810 Jun 5, 2025 Collaborator

jaketae
May 14, 2025

Replies: 1 comment 5 replies

yashk2810
May 14, 2025
Collaborator

jaketae May 14, 2025
Author

yashk2810 May 14, 2025
Collaborator

jaketae May 16, 2025
Author

jaketae Jun 5, 2025
Author

yashk2810 Jun 5, 2025
Collaborator