Data dependency heuristics for in-place operations are too conservative #19165

stephen-huan · 2024-01-02T17:53:28Z

stephen-huan
Jan 2, 2024

Consider the following MWE.

import time
from functools import partial

import jax.numpy as jnp
from jax import Array, jit, random


@partial(jit, donate_argnums=0)
def f(x: Array) -> tuple[Array, Array]:
    x = x.at[0, 0].add(1)
    y = x[0, 0]
    return x, y


if __name__ == "__main__":
    n = 10**4

    rng = random.key(0)
    rng, subkey = random.split(rng)
    x = random.uniform(subkey, shape=(n, n))

    f(jnp.copy(x))[0].block_until_ready()
    start = time.time()
    x = f(x)[0].block_until_ready()
    print(f"{time.time() - start:.3e}")

As expected, the update happens in-place and runs quickly (< 1e-4 seconds) on my computer.

However, if the assignment of x and y are swapped as in

@partial(jit, donate_argnums=0)
def f(x: Array) -> tuple[Array, Array]:
    y = x[0, 0]
    x = x.at[0, 0].add(1)
    return x, y

the update happens out of place and is very slow (> 1e-1 seconds).

Note that this happens even if y has no dependence on the update in x, as in y = x[1, 1] for example.

My intuition for what's happening is that in the first example, y only depends on the updated x so XLA reasons that it's safe to update x in-place. However, in the second example y depends on the original x, and y is returned with the updated x, so XLA thinks that both the original and updated x need to co-exist and does not update x in-place.

Is there any way to declare that the requisite data from the original x is already stored in y? Or in other words, to force XLA to recognize that sequentially computing y first and then updating x in-place is significantly more efficient (in time and space) than computing in parallel y and the out of place update to x.

(Based on what I've read so far I suspect the answer is "no", but I'd be happy to be proven wrong!)

If the MWE seems contrived, the exact code I'm trying to optimize is as follows.

@partial(jit, donate_argnums=(0, 1))
def f(x: Array, y: Array, i: int, k: int) -> tuple[Array, Array]:
    v = x[k, i]
    x = x.at[k, i].set(0.0)
    x = x.at[:, i].add(-(x @ x[k]))
    x = x.at[:, i].divide(jnp.sqrt(v + x[k, i]))
    y = y.at[:].add(-jnp.square(x[:, i]))
    return x, y

where x is a n x m matrix, y is a n length vector, and 0 <= i < m, 0 <= k < n are indices.

Ideally all of these updates to x could be performed in-place, modulo the need to copy the intermediate vector x @ x[k]. However, since v depends on the original x, all of these operations are performed out of place, which is quite inefficient (as it requires updating nm entries on each of the three updates rather than 1, n, and n entries, respectively).

Possibly related to #17845, #17640, and #10197 but much simpler (scan and autograd are not involved).

Answered by stephen-huan

Dec 11, 2024

This MWE, at least, is fixed by the XLA flag --xla_cpu_copy_insertion_use_region_analysis=true. See #25399.

View full answer

stephen-huan · 2024-12-11T11:01:25Z

stephen-huan
Dec 11, 2024
Author

This MWE, at least, is fixed by the XLA flag --xla_cpu_copy_insertion_use_region_analysis=true. See #25399.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Data dependency heuristics for in-place operations are too conservative #19165

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Data dependency heuristics for in-place operations are too conservative #19165

Uh oh!

stephen-huan Jan 2, 2024

Replies: 1 comment

Uh oh!

stephen-huan Dec 11, 2024 Author

stephen-huan
Jan 2, 2024

stephen-huan
Dec 11, 2024
Author