Pmap always slower than sharding? #18188

IrishWhiskey · 2023-10-19T15:12:24Z

IrishWhiskey
Oct 19, 2023

I used pmap multiple times to parallelize JAX models but it looks like it makes the code slower. I also noticed that sharding always makes it faster. What's the reason? I would expect the sharding to use pmap somehow.

I created an example to show this phenomena.
Consider the following piece of code:

import jax
jax.config.update("jax_enable_x64", True)
import jax.numpy as jnp

from jax import random
import numpy as np

def generate_vectors(shapes: dict, seed = 42):
    vectors = dict()
    
    np.random.seed(seed)
    for name, shape in shapes.items():
        vectors[name] = jax.device_put(np.random.rand(*shape))
        
    return vectors

@jax.jit
def loss(params: dict, data: dict):
    return jnp.exp(jnp.mean(
        data["X_a"] @ params["w_a"] +
        data["X_b"] @ params["w_b"]
    ))

param_shapes = dict(
    w_a=(21,),
    w_b=(15,)
)

data_shapes = dict(
    X_a=(10000000, 21),
    X_b=(10000000, 15)
)

params = generate_vectors(param_shapes)
data = generate_vectors(data_shapes)

After running this, I measured the execution time of loss(params, data) and I got ~5ms.

I tried to parallelize the model by using pmap and I found the time to be ~40ms:

num_devices = 4
pmap_data = dict()
for k in data.keys():
    pmap_data[k] = data[k].reshape((num_devices, data[k].shape[0] // num_devices,) + data[k].shape[1:])
    
pmap_loss = jax.pmap(loss, in_axes=(
    dict(w_a=None, w_b=None),
    dict(X_a=0, X_b=0)
))

I then tried to shard the input data and I got execution time of ~2ms.

from jax.experimental import mesh_utils
from jax.sharding import PositionalSharding

def distribute_data(dataset: dict, num_devices: int = len(jax.local_devices())) -> dict:
    sharding = PositionalSharding(mesh_utils.create_device_mesh((num_devices,)))
    dataset_sharded = dict()

    for k, v in dataset.items():
        sharding_shape = (num_devices,) + (1,) * (len(v.shape) - 1)
        dataset_sharded[k] = jax.device_put(v, sharding.reshape(*sharding_shape))
            
    return dataset_sharded

data = distribute_data(data, num_devices=4)

Does my code contain a bug or is pmap really slower than the sharding? Moreover, in this case the pmapped function seems to be much slower than the original one.

I ran these tests on an AWS SageMaker Notebook instance of type ml.p3.8xlarge. I used the tensorflow2_p310 kernel and installed jax by running pip install -U "jax[cuda]==v0.4.18" -f https://storage.googleapis.com/jax-releases/jax_cuda_releases.html.
I attach the jupyter notebook I used.

yashk2810 · 2023-10-19T19:33:49Z

yashk2810
Oct 19, 2023
Collaborator

I would recommend using jit for parallelizing and sharding your computation instead of pmap: https://jax.readthedocs.io/en/latest/notebooks/Distributed_arrays_and_automatic_parallelization.html

If you want to write manual collectives, then you can use shard_map: https://jax.readthedocs.io/en/latest/jep/14273-shard-map.html

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Pmap always slower than sharding? #18188

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Pmap always slower than sharding? #18188

Uh oh!

IrishWhiskey Oct 19, 2023

Replies: 1 comment

Uh oh!

yashk2810 Oct 19, 2023 Collaborator

IrishWhiskey
Oct 19, 2023

yashk2810
Oct 19, 2023
Collaborator