Why does jit make this function 600 times slower? (bug!) #16934

RezaRob · 2023-08-02T20:58:56Z

RezaRob
Aug 2, 2023

This code shows a simple function that becomes 600 times slower when jit'ed. Note that the function is "static," meaning that it returns a constant value and has no inputs.

This happens after compilation.

If, instead of output, you input such a list into the function (a more typical pattern), accessing it is still very slow.

Often "call overhead" is blamed for this sort of thing, but I cannot imagine how any call overhead could possibly account for 600X slowdown factor? What explains that?

Placing the compute on CPU doesn't solve the issue.

I wanted to report this as a bug, but I want to understand the situation better.

Thanks.

import jax
import time

# jax.config.update('jax_platform_name', 'cpu')

def f():
        a=[None] * 10000
        for i in range(10000):
                a[i] = i
        return a

def test_f():
        t = time.time()
        f()
        t = time.time() - t
        print(t)

test_f()

f=jax.jit(f)

#warmup
f()
f()

test_f()

Answered by jakevdp

Aug 3, 2023

It's not clear why jit needs to be like ~600X slower when using deep Pytrees and accessing python structures. We're comparing it to raw Python (already the slowest language on the planet!) and this is compiled jit making things even slower in some situations.

The compiled JIT execution only happens once all values are placed on the device, and this device placement happens in Python, the slowest language on the planet 😁

The non-jit version here is "construct a Python list of 10000 Python integers". The JIT version is "construct a Python list of 10000 Python integers, and then allocate a space on the XLA device for each of them & copy the bits over". There's no possible way the second pr…

View full answer

jakevdp · 2023-08-02T21:16:53Z

jakevdp
Aug 2, 2023
Maintainer

Hi - thanks for the question! There are a couple things going on here:

JIT and for loops: when JAX's JIT encounters a Python for loop, it effectively unrolls the loop, creating a very large linear program representing the computations encoded by the loop. Such large programs can often lead to long compile times. In general, the best way to address this is using structured control flow primitives in place of Python control flow.
Device transfer: when a JAX function returns a python collection like a list, tuple, or dict, each individual value in that collection is a JAX array object. In your case, you have a list of 10000 Python integers, each of which must be individually pushed to device and wrapped in a jax.Array object within the JIT computation. This is the main issue causing long runtimes in your program; you could see this more directly by benchmarking something like this:

values = [jax.device_put(i) for i in range(10000)]

With these things in mind, I'd consider your example program working as expected. If I were trying to optimize the actual function being used here, I'd write it like this instead:

def f():
  return jnp.arange(10000)

This returns a single array rather than a list of scalars, and avoids both problems (1) and (2) above. I suspect your real use-case is more complicated than this toy function, but hopefully the ideas mentioned here can help you figure out how to more effectively implement your own use-case.

Hope that helps!

0 replies

RezaRob · 2023-08-03T01:23:53Z

RezaRob
Aug 3, 2023
Author

Thanks for replying, but this doesn't answer the question.

Regarding (1) and (2):
(1) I specified in my question that this happens after compilation.

(2) I specified in my question that setting device to CPU doesn't solve this issue. jax.config.update('jax_platform_name', 'cpu').
A 600X slowdown factor cannot even be explained by mere GPU transfers.

return arange(10000) isn't an answer to this either.

This code is testing jit Pytree access efficiency. I just want to know why that's very slow.

1 reply

jakevdp Aug 3, 2023
Maintainer

Hi, thanks for the response! Sorry my answer was unclear. I pointed out the compilation time issue because it's relevant to code that does large for loops – I realize your question separated runtime from compilation time, so that may have been confusing.

Still, I think the main issue here is my point 2: your code is slow because of the overhead of creating 10,000 individual scalar objects on the XLA device (even if that device is a CPU rather than GPU or TPU). When you JIT-compile a function, the returned values are not python scalars, but rather on-device XLA scalars, and those scalar objects must be created when you run the function.

You can see this cost by creating those objects directly (note that this is run on CPU):

%timeit [jax.device_put(i) for i in range(10000)]
# 954 ms ± 168 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Similarly, when you run a jit-compiled function that returns 10000 scalars, those scalars must be placed on the device:

@jax.jit
def f():
  return list(range(10000))

_ = f()  # compile...
%timeit f()
# 311 ms ± 27.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

You see that this is 3x faster than manually creating these on-device scalars – this is because the team has put in a lot of work to make the JIT dispatch path as fast as possible. So, your real question should have been "why is my JIT function so fast compared to manually placing all these Python scalars on the device?" 😁 (side-note: there is work underway this week to optimize device_put so it can more often take advantage of the fast path that jit uses).

That said, your code does show performance issues: how do you address this slowdown? Mainly, the answer is to avoid writing JIT-compiled functions that result in creating 10,000 on-device scalars. That's why I suggested jnp.arange(10000) rather than list(range(10000)): using large python collections in JAX will generally lead to slow execution, because this is not a domain that JAX has been optimized for; using large arrays in JAX will generally lead to fast execution, because this is a domain that JAX has been optimized for.

Does that make sense?

RezaRob · 2023-08-03T21:12:16Z

RezaRob
Aug 3, 2023
Author

Thank you very much for your reply.

That's really interesting! Actually, that's comparing apples and oranges. Those two expressions behave very differently. Jit performs really well on the first one [device_put()] and really poorly on the second list(range()). Here's the code and the results:

import jax
import time

jax.config.update('jax_platform_name', 'cpu')

def test_f():
        global f
        f()
        t = time.time()
        f()
        t = time.time() - t
        print(t)
        
        f = jax.jit(f)
        
        f()
        t = time.time()
        f()
        t = time.time() - t
        print(t)


def f():
        [jax.device_put(i) for i in range(10000)]
print("device_put non-jit/jit")
test_f()
print()

def f():
        return list(range(10000))
print("list(range()) non-jit/jit")
test_f()

Results:

1.2685680389404297
0.0004496574401855469

list(range()) non-jit/jit
0.00018787384033203125
0.5302731990814209

Here's my guess as to what's happening (I'm not sure how to disassemble/view the generated code; so, I'll take a guess here)...

In the 2nd expression list(range()), there is just "raw" python expressions, so the iterator(range) is fed one-by-one into the list.append() function. This is very slow under jit probably because jit is extremely slow in communicating with python functions.

device_put() is really interesting though, and jit is very fast here. I'm guessing that somehow, jit iterates and produces all the arrays "internally" without communicating with Python. Then all the sudden it makes one single call to convert this to a Python list.

The question is why is jit very slow in accessing Python functions(it seems)?

I think we both agree that this is also an issue when the list is input into (rather than output from) the function; which is the more typical use-case for jax. Let me stress again: the differences in performance are extremely large by percentage, not negligible.

You're absolutely right, this can be mitigated by batching (large arrays) and small Pytree sizes. So, let me say why I'm a little concerned about this. We have very deep ResNet models; over 100 layers. There is even a report of a 1000+ layer ResNet trained on CIFAR!
(See "Wider or Deeper: Revisiting the ResNet Model for Visual Recognition," Zifeng Wu).

(I know, that sounds really weird!). But people experiment with all sorts of things. Even for "shallow" ResNets, those convolutions are deep and complex enough that jax/lax offer specialized ops just for generalized convs. Numpy doesn't even have such an op.

Consequently, it's very conceivable that without such special ops, jax code(Pytrees) might get more deep and complex, and the overhead could be non-negligible.

It's not clear why jit needs to be like ~600X slower when using deep Pytrees and accessing python structures. We're comparing it to raw Python (already the slowest language on the planet!) and this is compiled jit making things even slower in some situations.

Is there a huge cost to removing this overhead? What's causing it?

2 replies

jakevdp Aug 3, 2023
Maintainer

I suspect the reason it differs is that in the device_put version, all the scalars are put to device at trace/compile time, while in the list version, all the scalars are put to device at runtime.

jakevdp Aug 3, 2023
Maintainer

It's not clear why jit needs to be like ~600X slower when using deep Pytrees and accessing python structures. We're comparing it to raw Python (already the slowest language on the planet!) and this is compiled jit making things even slower in some situations.

The compiled JIT execution only happens once all values are placed on the device, and this device placement happens in Python, the slowest language on the planet 😁

The non-jit version here is "construct a Python list of 10000 Python integers". The JIT version is "construct a Python list of 10000 Python integers, and then allocate a space on the XLA device for each of them & copy the bits over". There's no possible way the second program could be faster than the first, because the second program contains the first!

Answer selected by RezaRob

RezaRob · 2023-08-03T23:18:13Z

RezaRob
Aug 3, 2023
Author

Jake, I really appreciate your help with this very much. The code below confirms that you're apparently right. In summary, device array creation has significant overhead relative to the case of just accessing existing device arrays, and significantly higher than raw Python code. However, this is a very Micky Mouse test intended to stress test just the overhead and in practice large arrays on device will hide (much of) this overhead. I believe you also said that the folks are working on making this array creation more efficient. I'm still a little surprised about the amount of difference between raw Python and device arrays, but I understand that memory block alignments and all can make a difference and my test is unrealistically tiny.

This code inputs the arrays into the function after they're created (for device_put), resolving that confusion.

Also note, this code is crashing on my CPU. Works fine on GPU. Not sure why.

Thank you so much for Jax, Jit, and for your help.

import jax
import time
from jax import block_until_ready

# jax.config.update('jax_platform_name', 'cpu')

def test_f():
        global f
        _ = f(l)
        _ = block_until_ready(_)
        t = time.time()
        _ = f(l)
        _ = block_until_ready(_)
        t = time.time() - t
        print(t)
        
        g = jax.jit(f)
        
        _ = g(l)
        _ = block_until_ready(_)
        t = time.time()
        _ = g(l)
        _ = block_until_ready(_)
        t = time.time() - t
        print(t)

def f(l):
        x = 0
        for i in l:
                x = x + l[i]
        return x

l = {str(i):jax.device_put(i) for i in range(10000)}
l = block_until_ready(l)
print("jax device arrays, non-jit/jit")
_ = test_f()
_ = block_until_ready(_)
print()



# redefining f to be sure 
del f

def f(l):
        x = 0
        for i in l:
                x = x + l[i]
        return x

l = {str(i):i for i in range(10000)}
print("python scalars, non-jit/jit")
_ = test_f()

0 replies

Why does jit make this function 600 times slower? (bug!) #16934

Uh oh!

RezaRob Aug 2, 2023

Replies: 4 comments · 3 replies

Uh oh!

Uh oh!

jakevdp Aug 2, 2023 Maintainer

Uh oh!

RezaRob Aug 3, 2023 Author

Uh oh!

Uh oh!

jakevdp Aug 3, 2023 Maintainer

Uh oh!

RezaRob Aug 3, 2023 Author

Uh oh!

Uh oh!

jakevdp Aug 3, 2023 Maintainer

Uh oh!

Uh oh!

jakevdp Aug 3, 2023 Maintainer

Uh oh!

RezaRob Aug 3, 2023 Author

RezaRob
Aug 2, 2023

Replies: 4 comments 3 replies

jakevdp
Aug 2, 2023
Maintainer

RezaRob
Aug 3, 2023
Author

jakevdp Aug 3, 2023
Maintainer

RezaRob
Aug 3, 2023
Author

jakevdp Aug 3, 2023
Maintainer

jakevdp Aug 3, 2023
Maintainer

RezaRob
Aug 3, 2023
Author