vectorization and parallelization of code #9017

thom-popovici · 2021-12-20T05:49:14Z

thom-popovici
Dec 20, 2021

I have a 2 dimensional array of size 2621440x4 and I want to vectorize. This translates to an object of the shape [2621440, 4], where the 4 is in the fastest dimension ("C" style representation).

I am applying some computation on each of the elements. The computation is pointwise and embarrassingly parallel. Hence, I specify the vmap in the second dimension, however i do not see any difference in execution time. I try with and without vmap and get the same execution time. I do jitting for both cases.

Also if i want to parallelize this over the threads of a CPU or the sockets of a CPU, will pmap help with that?

Thanks.

jakevdp · 2021-12-20T17:48:26Z

jakevdp
Dec 20, 2021
Maintainer

vmap is a vectorizing transform, not a parallelizing transform. You can see its effect by using the make_jaxpr on your function; for example:

from jax import make_jaxpr, vmap
import jax.numpy as jnp

def f(x):
  return x ** 2

x = jnp.arange(10)

# implicit vectorization via numpy-style broadcasting
print(make_jaxpr(f)(x))
# { lambda ; a:i32[10]. let b:i32[10] = integer_pow[y=2] a in (b,) }

# explicit vectorization via vmap
print(make_jaxpr(vmap(f))(x))
# { lambda ; a:i32[10]. let b:i32[10] = integer_pow[y=2] a in (b,) }

As you can see, the two versions of the function result in the exact same computation being sent to the XLA compiler. I suspect this is similar to what's happening when you wrap your function in vmap.

pmap is a parallelizing transform (in the SPMD sense) that is designed to target multiple devices. That said, for very small operations like this one, the overhead associated with sharding your data to those devices will likely outweigh any potential savings from computing this in parallel.

As for taking advantage of in-device parallelism, XLA already does this to an extent, without any explicit action on the user's part. For example, this is one reason why computations can run so much faster on GPU and TPU. I'm not sure the extent to which XLA analogously takes advantage of CPU threads; someone else may be able to answer that.

1 reply

thom-popovici Dec 20, 2021
Author

Thanks for the fast answer.

So XLA automatically applies SIMD vectorization and thread level parallelism. So you cannot control the SIMD vector length and the number of omp threads for example?

Thanks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

vectorization and parallelization of code #9017

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

vectorization and parallelization of code #9017

Uh oh!

Uh oh!

thom-popovici Dec 20, 2021

Replies: 1 comment · 1 reply

Uh oh!

Uh oh!

jakevdp Dec 20, 2021 Maintainer

Uh oh!

thom-popovici Dec 20, 2021 Author

thom-popovici
Dec 20, 2021

Replies: 1 comment 1 reply

jakevdp
Dec 20, 2021
Maintainer

thom-popovici Dec 20, 2021
Author