jnp.dot vs explicit dot product followed by jnp.sum results in drastically different memory uses #17924

FilipeMaia · 2023-10-04T14:18:28Z

FilipeMaia
Oct 4, 2023

import jax
import jax.numpy as jnp
import numpy as np

n = 100000
d = 50000
pos = np.random.random((n, 3)).astype(np.float32)-0.5
HKL = np.random.random((d,3)).astype(np.float32)-0.5
pos = jnp.array(pos)
HKL = jnp.array(HKL)

def explicit(pos, HKL):
  phase = (HKL[:,0]*pos[0]+HKL[:,1]*pos[1]+HKL[:,2]*pos[2])
  return jnp.cos(phase)

def dot(pos, HKL):
  phase = jnp.dot(HKL, pos)
  return jnp.cos(phase)

def f1(pos, HKL):
  return jnp.sum(jax.vmap(explicit, in_axes=(0, None))(pos, HKL), axis=0)

def f2(pos, HKL):
  return jnp.sum(jax.vmap(dot, in_axes=(0, None))(pos, HKL), axis=0)

print(jax.jit(f1)(pos, HKL))
print(jax.jit(f2)(pos, HKL))

Why does the explicit dot product works while calling jnp.dot, which should result in the same operations, ends up with the following error:

---------------------------------------------------------------------------
XlaRuntimeError                           Traceback (most recent call last)
[<ipython-input-3-5ba125f53746>](https://localhost:8080/#) in <cell line: 27>()
     25 
     26 print(jax.jit(f1)(pos, HKL))
---> 27 print(jax.jit(f2)(pos, HKL))

    [... skipping hidden 14 frame]

[/usr/local/lib/python3.10/dist-packages/jax/_src/dispatch.py](https://localhost:8080/#) in backend_compile(backend, module, options, host_callbacks)
    462   # TODO(sharadmv): remove this fallback when all backends allow `compile`
    463   # to take in `host_callbacks`
--> 464   return backend.compile(built_c, compile_options=options)
    465 
    466 _ir_dump_counter = itertools.count()

XlaRuntimeError: RESOURCE_EXHAUSTED: Out of memory while trying to allocate 20016777216 bytes.

It seems in the second case the full result out of vmap is generated before the sum, instead of doing the sum as the vmap proceeds.

Answered by jakevdp

Oct 4, 2023

Hi - thanks for the question! I think in some sense this is expected, in that different ways of expressing computations will in general lead to different usage of computational and memory resources. In a perfect world the XLA compiler would be able to recognize that these two sequences of operations are equivalent and choose the best approach given the resources available, but no perfect compiler exists.

I'll address one of your points directly:

It seems in the second case the full result out of vmap is generated before the sum, instead of doing the sum as the vmap proceeds.

As far as I know, the XLA computational model will never "do the sum as the vmap proceeds". XLA is designed with …

View full answer

jakevdp · 2023-10-04T16:00:33Z

jakevdp
Oct 4, 2023
Maintainer

Hi - thanks for the question! I think in some sense this is expected, in that different ways of expressing computations will in general lead to different usage of computational and memory resources. In a perfect world the XLA compiler would be able to recognize that these two sequences of operations are equivalent and choose the best approach given the resources available, but no perfect compiler exists.

I'll address one of your points directly:

It seems in the second case the full result out of vmap is generated before the sum, instead of doing the sum as the vmap proceeds.

As far as I know, the XLA computational model will never "do the sum as the vmap proceeds". XLA is designed with vectorized computation in mind, where by "vectorized" I mean that the entire array is stored in memory, and a kernel is called that computes and stores the results. The compiler can fuse operations in many cases, but definitely not all cases. This computational model built on vectorized kernels is well-suited to accelerators like GPU and TPU, where the chip architecture allows such computations to complete very quickly. If you instead want a computation to be done serially or in batches, it's up to you to express your computation in that manner. Your "explicit" function is essentially that.

Does that answer your question?

3 replies

FilipeMaia Oct 4, 2023
Author

Thanks for the answer!
So if I understand correctly in the explicit function the compiler is smart enough to fuse the sum and avoid allocating too much memory, but the same does not happen for dot (or matmult which I also tried). I noticed that outer does get fused (#9505 (comment)).
Is there a way to have an idea when a function gets fused without trying it out?

jakevdp Oct 4, 2023
Maintainer

You can look at the compiled HLO to see which operations the compiler is choosing to fuse in a particular program; see https://jax.readthedocs.io/en/latest/aot.html

FilipeMaia Oct 4, 2023
Author

That page is really useful, many thanks! Maybe you could add a link to it from https://jax.readthedocs.io/en/latest/_autosummary/jax.jit.html

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

jnp.dot vs explicit dot product followed by jnp.sum results in drastically different memory uses #17924

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

jnp.dot vs explicit dot product followed by jnp.sum results in drastically different memory uses #17924

Uh oh!

FilipeMaia Oct 4, 2023

Replies: 1 comment · 3 replies

Uh oh!

jakevdp Oct 4, 2023 Maintainer

Uh oh!

Uh oh!

FilipeMaia Oct 4, 2023 Author

Uh oh!

jakevdp Oct 4, 2023 Maintainer

Uh oh!

Uh oh!

FilipeMaia Oct 4, 2023 Author

FilipeMaia
Oct 4, 2023

Replies: 1 comment 3 replies

jakevdp
Oct 4, 2023
Maintainer

FilipeMaia Oct 4, 2023
Author

jakevdp Oct 4, 2023
Maintainer

FilipeMaia Oct 4, 2023
Author