Jax 0.4.1 requires much more memory than jax 0.2.17 for this code #14537

tomdemeyere · 2023-02-16T21:22:49Z

tomdemeyere
Feb 16, 2023

Hello,

Could someone explain why this code takes much more memory (50GB+) on jax/jaxlib 0.41 than jax 0.2.17/jaxlib (0.1.68) (~ 2GB)

T = np.random.rand(5000, 566, 3)
@jax.jit
def jit_bar(Y):
   u, v = jnp.triu_indices(Y.shape[0], 1)
   return jnp.sqrt((3 * (Y[u] - Y[v]) ** 2).mean(axis=(-1, -2)))
msd = jit_bar(T)

Thank you

jakevdp · 2023-02-17T17:53:26Z

jakevdp
Feb 17, 2023
Maintainer

Hi - I answered this question here. Answer copied below for posterity:

If Y is of shape (10000, 566, 3) Then triu_indices returns arrays of length (10000 * 10001) / 2, and so Y[u] and Y[v] are each of size (50005000, 566, 3). If they are float32 values, then that size is about 316 GB each. I would not expect this code to run well anywhere!

I suspect that older JAX versions may have had some additional optimization that was removed in later versions; given the form of your computation, the only thing that could have been is a factorization of the square difference to avoid instantiating the full matrix sum, which I vaguely recall was previously an XLA optimization but was removed because it's numerically unstable.

But you can do such an optimization manually if you wish; here's an approach that seems to work, and the largest intermediate array it generates for the original inputs is of shape [10000, 10000], about ~380MB in float32:

@jax.jit
def jit_bar2(Y):
   u, v = jnp.triu_indices(Y.shape[0], 1)
   Y = Y.reshape(Y.shape[0], -1)
   Y2m = (Y ** 2).mean(-1)
   YYTm = (Y @ Y.T) / Y.shape[1]
   return jnp.sqrt(3 * (Y2m[u] + Y2m[v] - 2 * YYTm[u, v]))

T = np.random.rand(50, 6, 3)  # test with a smaller input
np.testing.assert_allclose(jit_bar(T), jit_bar2(T), atol=1E-5)

4 replies

tomdemeyere Feb 17, 2023
Author

Thank you for your answer, I must admit I don't have this kind of thinking yet. I will probably use JAX more intensively in the future. I understand that experience probably play a big role, but would you have any material that would help? (I don't do any ML, I just need to compute simple stuff on big arrays most of the time)

Also, it seems that the newer version is still slower, even using your function, without considering compilation time a run with T = np.random.rand(2000, 566, 3) gives:

jax 0.2.17/jaxlib (0.1.68): 0.0001995563507080078 sec
jax/jaxlib 0.41: 0.0947587490081787 sec

approximately 1000 times slower, on big arrays I get a warning:

2023-02-17 20:51:04.938968: E external/org_tensorflow/tensorflow/compiler/xla/service/slow_operation_alarm.cc:65] Constant folding an instruction is taking > 1s:

add.34 (displaying the full instruction incurs a runtime overhead. Raise your logging level to 4 or above).

This isn't necessarily a bug; constant-folding is inherently a trade-off between compilation time and speed at runtime. XLA has some guards that attempt to keep constant folding from taking too long, but fundamentally you'll always be able to come up with an input program that takes a long time.

If you'd like to file a bug, run with envvar XLA_FLAGS=--xla_dump_to=/tmp/foo and attach the results.
2023-02-17 20:51:04.962233: E external/org_tensorflow/tensorflow/compiler/xla/service/slow_operation_alarm.cc:133] The operation took 1.023467129s
Constant folding an instruction is taking > 1s:

add.34 (displaying the full instruction incurs a runtime overhead. Raise your logging level to 4 or above).

This isn't necessarily a bug; constant-folding is inherently a trade-off between compilation time and speed at runtime. XLA has some guards that attempt to keep constant folding from taking too long, but fundamentally you'll always be able to come up with an input program that takes a long time.

If you'd like to file a bug, run with envvar XLA_FLAGS=--xla_dump_to=/tmp/foo and attach the results.
2023-02-17 20:51:12.914402: E external/org_tensorflow/tensorflow/compiler/xla/service/slow_operation_alarm.cc:65] Constant folding an instruction is taking > 2s:

concatenate.47 (displaying the full instruction incurs a runtime overhead. Raise your logging level to 4 or above).

This isn't necessarily a bug; constant-folding is inherently a trade-off between compilation time and speed at runtime. XLA has some guards that attempt to keep constant folding from taking too long, but fundamentally you'll always be able to come up with an input program that takes a long time.

If you'd like to file a bug, run with envvar XLA_FLAGS=--xla_dump_to=/tmp/foo and attach the results.
2023-02-17 20:51:12.966697: E external/org_tensorflow/tensorflow/compiler/xla/service/slow_operation_alarm.cc:133] The operation took 2.052669146s
Constant folding an instruction is taking > 2s:

concatenate.47 (displaying the full instruction incurs a runtime overhead. Raise your logging level to 4 or above).

This isn't necessarily a bug; constant-folding is inherently a trade-off between compilation time and speed at runtime. XLA has some guards that attempt to keep constant folding from taking too long, but fundamentally you'll always be able to come up with an input program that takes a long time.

If you'd like to file a bug, run with envvar XLA_FLAGS=--xla_dump_to=/tmp/foo and attach the results.

Should I stick to the old version?

jakevdp Feb 17, 2023
Maintainer

It's hard to say without more information. Can you share your code, including the method you're using to run the benchmark?

But in general, jax 0.2.17 is quite old and you should move to a more recent version if possible.

tomdemeyere Feb 20, 2023
Author

My benchmark simply consists of running the function two times and calculate the time taken to run the second time (without compilation).

Adding block_until_ready() seems to show that indeed the newest version is faster (by a factor of two).

Thank you for your precious advice and time.

jakevdp Feb 20, 2023
Maintainer

OK, thanks! The reason I ask is that there are several ways you can easily go wrong in running benchmark comparisons of straightforward JAX code. See https://jax.readthedocs.io/en/latest/faq.html#benchmarking-jax-code for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Jax 0.4.1 requires much more memory than jax 0.2.17 for this code #14537

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 4 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Jax 0.4.1 requires much more memory than jax 0.2.17 for this code #14537

Uh oh!

tomdemeyere Feb 16, 2023

Replies: 1 comment · 4 replies

Uh oh!

Uh oh!

jakevdp Feb 17, 2023 Maintainer

Uh oh!

tomdemeyere Feb 17, 2023 Author

Uh oh!

Uh oh!

jakevdp Feb 17, 2023 Maintainer

Uh oh!

tomdemeyere Feb 20, 2023 Author

Uh oh!

jakevdp Feb 20, 2023 Maintainer

tomdemeyere
Feb 16, 2023

Replies: 1 comment 4 replies

jakevdp
Feb 17, 2023
Maintainer

tomdemeyere Feb 17, 2023
Author

jakevdp Feb 17, 2023
Maintainer

tomdemeyere Feb 20, 2023
Author

jakevdp Feb 20, 2023
Maintainer