Using `jax.pmap/vmap` and `jax.lax.switch` to execute a simulation in parallel. Is it an anti-pattern? #20916

simon-bachhuber · 2024-04-24T14:27:13Z

simon-bachhuber
Apr 24, 2024

I have $M$ different simulations. Each simulation should be executed $N_m$ times. Each single simulation generates some data from RNG.
I am using a nested operation of jax.pmap and jax.vmap to parallelize the $\sum_m N_m$ simulations.
It works just fine but the problem is that compilation time is huge (several hours).

My question: Is this some form of an anti-pattern?

Toy Example Code

from typing import Any, Protocol, Tuple

import jax
import jax.numpy as jnp

PyTree = Any


class Generator(Protocol):
    def __call__(self, key: jax.Array) -> PyTree:
        ...


def _build_batch_matrix(batchsizes: list[int]) -> jax.Array:
    arr = []
    for i, l in enumerate(batchsizes):
        arr += [i] * l
    return jnp.array(arr)


def _distribute_batchsize(batchsize: int) -> Tuple[int, int]:
    vmap_size_min = 8
    if batchsize <= vmap_size_min:
        return 1, batchsize
    else:
        n_devices = jax.local_device_count()
        assert (
            batchsize % n_devices
        ) == 0, f"Your GPU count of {n_devices} does not split batchsize {batchsize}"
        vmap_size = int(batchsize / n_devices)
        return int(batchsize / vmap_size), vmap_size


def _merge_batchsize(tree: PyTree, pmap_size: int, vmap_size: int) -> PyTree:
    return jax.tree_map(
        lambda arr: arr.reshape((pmap_size * vmap_size,) + arr.shape[2:]), tree
    )


def batch_generators_lazy(
    generators: list[Generator],
    batchsizes: list[int],
) -> Generator:
    """Create a large generator by stacking multiple generators lazily."""
    assert len(generators) == len(batchsizes)

    batch_arr = _build_batch_matrix(batchsizes)
    bs_total = len(batch_arr)
    pmap, vmap = _distribute_batchsize(bs_total)
    batch_arr = batch_arr.reshape((pmap, vmap))

    @jax.pmap
    @jax.vmap
    def _generator(key, which_gen: int):
        return jax.lax.switch(which_gen, generators, key)

    def generator(key):
        pmap_vmap_keys = jax.random.split(key, bs_total).reshape((pmap, vmap, 2))
        data = _generator(pmap_vmap_keys, batch_arr)
        data = _merge_batchsize(data, pmap, vmap)
        return data

    return generator


def generator_factory(hyperparams) -> Generator:
    def generator(key):
        # expensive simulation
        return dict(X=jnp.array(0.0) + hyperparams)

    return generator


M = 4
N_m = 16
generators = [generator_factory(hyperparam) for hyperparam in range(M)]
batchsizes = M * [N_m]
batched_generator = batch_generators_lazy(generators, batchsizes)

batched_generator(jax.random.PRNGKey(1))

Answered by simon-bachhuber

Feb 4, 2025

No, i never found a better/faster solution to this. I ended up just optimizing other parts to avoid having to call this logic too often and accepted that it takes a couple of hours to generate the data.

If i would do it again, i would probably not implement this part in JAX. If you need lots of branching (via e.g. jax.lax.switch), then JAX might not be the best option.

View full answer

JohannesAck · 2024-05-02T04:53:18Z

JohannesAck
May 2, 2024

As nobody else has answered yet I'll just suggest a comment, although I can not test it without a reproducible example.
I had a rather similar issue before, with one of multiple simulators being selected by using switch, which became very slow.

I believe this is due to switch being transformed to select when using vmap (as mentioned in its documentation), which then evaluates all branches, i.e. all generators.

My solution was to replace

return jax.lax.switch(which_gen, generators, key)

with list indexing

return generators[which_gen](key)

This avoids the evaluation of all branches during the switch statement.
Not sure if this works in your specific case, but it did work for me, massively reducing compile time and runtime.

1 reply

simon-bachhuber Feb 4, 2025
Author

Substituting the proposed change in the above code breaks it. Because as @kerupp pointed out, you can not index the array with the traced object which_gen. At the same time it must be a traced object because it needs to be in 1:1 correspondence with the PRNGKeys pmap_vmap_keys.

kerupp · 2025-01-31T11:28:12Z

kerupp
Jan 31, 2025

Did you manage to resolve this issue?
I agree with @JohannesAck's solution, but i think this workaround breaks down if which_gen becomes a tracer, as you can't index into Python datastructures with tracers.

1 reply

simon-bachhuber Feb 4, 2025
Author

No, i never found a better/faster solution to this. I ended up just optimizing other parts to avoid having to call this logic too often and accepted that it takes a couple of hours to generate the data.

If i would do it again, i would probably not implement this part in JAX. If you need lots of branching (via e.g. jax.lax.switch), then JAX might not be the best option.

Answer selected by simon-bachhuber

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Using `jax.pmap/vmap` and `jax.lax.switch` to execute a simulation in parallel. Is it an anti-pattern? #20916

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Using jax.pmap/vmap and jax.lax.switch to execute a simulation in parallel. Is it an anti-pattern? #20916

Uh oh!

Uh oh!

simon-bachhuber Apr 24, 2024

Replies: 2 comments · 2 replies

Uh oh!

JohannesAck May 2, 2024

Uh oh!

simon-bachhuber Feb 4, 2025 Author

Uh oh!

kerupp Jan 31, 2025

Uh oh!

Uh oh!

simon-bachhuber Feb 4, 2025 Author

Using `jax.pmap/vmap` and `jax.lax.switch` to execute a simulation in parallel. Is it an anti-pattern? #20916

simon-bachhuber
Apr 24, 2024

Replies: 2 comments 2 replies

JohannesAck
May 2, 2024

simon-bachhuber Feb 4, 2025
Author

kerupp
Jan 31, 2025

simon-bachhuber Feb 4, 2025
Author