Understanding asynchronous dispatch and running multiple functions in parallel #26312

markus7800 · 2025-02-04T17:11:33Z

markus7800
Feb 4, 2025

Hello,

as explained in https://jax.readthedocs.io/en/latest/async_dispatch.html JAX does not wait for the operation to complete before returning control to the Python program.

I tested this with following test function

def test1(N):
    X = jax.random.normal(jax.random.PRNGKey(0), (N, N))
    X.block_until_ready()
    t0 = time.time()
    out = X @ X @ X @ X @ X
    t1 = time.time()
    print(f"t1 = {t1-t0:.3f}s")
    out.block_until_ready()
    t2 = time.time()
    print(f"t2 = {t2-t0:.3f}s")

and for test1(10_000) I got the expected output (first print statement is executed immediately because of asyncronous dispatch):

t1 = 0.001s
t2 = 0.137s

Now consider following test setup, where I hoped to asynchronously dispatch the scan operation:

def test2(N, M):
    # setup mock data
    key = jax.random.PRNGKey(0)
    key1, key2 = jax.random.split(key)

    X1 = jax.random.normal(key1, (N, N))
    X2 = jax.random.normal(key2, (N, N))

    Xs1 = jax.numpy.stack([X1]*M)
    Xs2 = jax.numpy.stack([X2]*M)

    @jax.jit
    def step(val, X):
        X = X @ X @ X @ X @ X
        return val + X, None

    print("start computation")
    t0 = time.time()
    out1, _ = jax.lax.scan(step, X1, Xs1)
    t1 = time.time()
    print(f"t1 = {t1-t0:.3f}s")
    out2, _ = jax.lax.scan(step, X2, Xs2)
    t2 = time.time()
    print(f"t2 = {t2-t0:.3f}s")
    out1.block_until_ready()
    out2.block_until_ready()
    t3 = time.time()
    print(f"t3 = {t3-t0:.3f}s")

When calling with a computationally expensive step and small number of iterations, test2(10_000, 2), we have

start computation
t1 = 0.056s
t2 = 0.056s
t3 = 0.625s

When calling with an inexpensive step and large number of iterations, test2(100, 10_000), we have

start computation
t1 = 0.273s
t2 = 0.512s
t3 = 0.536s

So seemingly, the scan operation was only asynchronously dispatched in the first case and sequentially executed in the second case.

I know that I can vectorise the computation with

def test3(N, M):
    # setup mock data
    key = jax.random.PRNGKey(0)
    key1, key2 = jax.random.split(key)

    X1 = jax.random.normal(key1, (N, N))
    X2 = jax.random.normal(key2, (N, N))

    Xs1 = jax.numpy.stack([X1]*M)
    Xs2 = jax.numpy.stack([X2]*M)

    @jax.jit
    def step(val, X):
        X = X @ X @ X @ X @ X
        return val + X, None

    @jax.jit
    @jax.vmap
    def scan(X, Xs):
        return jax.lax.scan(step, X, Xs)[0]

    X = jax.numpy.stack([X1, X2])
    Xs = jax.numpy.stack([Xs1, Xs2])
    Xs.block_until_ready()

    print("start computation")
    t0 = time.time()
    out = scan(X, Xs)
    t1 = time.time()
    print(f"t1 = {t1-t0:.3f}s")
    out.block_until_ready()
    t2 = time.time()
    print(f"t2 = {t2-t0:.3f}s")

and executing test3(100, 10_000) now outputs

start computation
t1 = 0.254s
t2 = 0.274s

So I have confirmed that with vectorising, my GPU is able to perform both scan operations in parallel in the same time.

My question is: how would I parallelise two scan operations that use different step functions, where vmap in this form is not available.

This question is related to #673, #25630, #20916, and #23306.
In the answers, often a switch operation is recommended.
But vmap over switch makes it a select where all branches are executed.
In my case, the step functions are expensive and I want to apply them to only a subset of the data.

On CPU, pmap would probably work to distribute the step functions to different cores, but I want to run the operations on one GPU, where pmap (or shard_map) does not help, I think.

I would appreciate any advice for my situation or a confirmation that what I want to achieve is not possible with JAX.

Thanks!

ezhulenev · 2025-02-05T20:16:05Z

ezhulenev
Feb 5, 2025

On GPU backend we have to wait for the completion of every loop iteration to copy back to host the predicate to decide if we should execute the next loop iteration, so effectively only computations after the loop body are "asynchronous". Copying predicate back basically forces XLA to sync CUDA stream with host.

We do have an internal bug (b/382117736, sorry link for googlers only) that fixes this by launching pjrt/xla operations in a dedicated thread pool, but not sure when it will be fixed.

2 replies

markus7800 Feb 7, 2025
Author

Thank you for your answer.
Is this then due to the fact that scan is compiled to a WhileOp?
Because from my understanding, the number of iterations of scan can be statically determined and could be dispatched as a single kernel, at least in theory.

ezhulenev Feb 7, 2025

Hm, that's true and we actually have a pass that detects static loop iterations. It might not work in your case for some reason, I'll check what's going in with optimized HLO.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Understanding asynchronous dispatch and running multiple functions in parallel #26312

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Understanding asynchronous dispatch and running multiple functions in parallel #26312

Uh oh!

Uh oh!

markus7800 Feb 4, 2025

Replies: 1 comment · 2 replies

Uh oh!

ezhulenev Feb 5, 2025

Uh oh!

markus7800 Feb 7, 2025 Author

Uh oh!

ezhulenev Feb 7, 2025

markus7800
Feb 4, 2025

Replies: 1 comment 2 replies

ezhulenev
Feb 5, 2025

markus7800 Feb 7, 2025
Author