Performance of `jax.grad` when taking the derivative with respect to many inputs #26562

EmilianoG-byte · 2025-02-17T16:01:12Z

EmilianoG-byte
Feb 17, 2025

Hi,

I am currently using jax.grad to take the derivative of a cost function $f(\mathbb{C}^{d_a}, \mathbb{C}^{d_b}, \mathbb{C}^{d_c}) \rightarrow \mathbb{R}$. With $d_a > d_b > d_c $

Initially, I was taking the derivative with respect to each input by calling jax.grad once for each input. Now, instead I am passing argnum = (0,1,2). A simple performance check with%timeit showed that the latter approach takes only approximately the same as the most expensive individual gradient (corresponding to $d_a$).

Now, my question is: why does this happen? Is there an underlying principle of automatic differentiation that explains this or is it mostly hyper-optimization of the jax internal functions that (for instance) reuses previously computed graphs or something else?.

Thank you for your help!

Note: for the comparisons I did not JIT any function to try and make it "fair".

Answered by jakevdp

Feb 18, 2025

The reason for this is likely that the gradient with respect to a single element requires just about as much computation as the gradient with respect to all three, because intermediate computations will be reused in the latter case. We can show this with a simplified example by looking at the jaxpr of the computation:

import jax

def f(x, y):
  return jnp.sin(x) * jnp.exp(y)

x, y = 2.0, 3.0

print(jax.make_jaxpr(jax.grad(f, argnums=0))(x, y))
# { lambda ; a:f64[] b:f64[]. let
#     c:f64[] = sin a
#     d:f64[] = cos a
#     e:f64[] = exp b
#     _:f64[] = mul c e
#     f:f64[] = mul 1.0 e
#     g:f64[] = mul f d
#   in (g,) }

print(jax.make_jaxpr(jax.grad(f, argnums=(0, 1)))(x, y))
# {…

View full answer

jakevdp · 2025-02-18T18:50:56Z

jakevdp
Feb 18, 2025
Maintainer

The reason for this is likely that the gradient with respect to a single element requires just about as much computation as the gradient with respect to all three, because intermediate computations will be reused in the latter case. We can show this with a simplified example by looking at the jaxpr of the computation:

import jax

def f(x, y):
  return jnp.sin(x) * jnp.exp(y)

x, y = 2.0, 3.0

print(jax.make_jaxpr(jax.grad(f, argnums=0))(x, y))
# { lambda ; a:f64[] b:f64[]. let
#     c:f64[] = sin a
#     d:f64[] = cos a
#     e:f64[] = exp b
#     _:f64[] = mul c e
#     f:f64[] = mul 1.0 e
#     g:f64[] = mul f d
#   in (g,) }

print(jax.make_jaxpr(jax.grad(f, argnums=(0, 1)))(x, y))
# { lambda ; a:f64[] b:f64[]. let
#     c:f64[] = sin a
#     d:f64[] = cos a
#     e:f64[] = exp b
#     _:f64[] = mul c e
#     f:f64[] = mul c 1.0
#     g:f64[] = mul 1.0 e
#     h:f64[] = mul f e
#     i:f64[] = mul g d
#   in (i, h) }

for argnums=0, we compute three expensive terms (sin, cos, exp) and three inexpensive multiplies. For argnums=(0, 1), we also require three expensive terms, and five multiplies. The cost of the two additional multiplications is dwarfed by the cost of the trig and exponential functions, and so we wouldn't expect the second approach to be all that more expensive than the first (basically the cost of a couple multiplies, plus a couple dispatches if not under JIT).

I suspect your case is similar: the bulk of the expensive computation applies to all three outputs, so computing all three gradients at once is not all that more expensive than computing a single gradient.

2 replies

EmilianoG-byte Feb 20, 2025
Author

Thank you so much for the prompt reply!

I see, this makes sense. If I understand correctly, it seems like one can argue that this reuse of the intermediate computations comes naturally from the way back propagation computes the derivatives, right?

jakevdp Feb 20, 2025
Maintainer

Yeah, exactly

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Performance of `jax.grad` when taking the derivative with respect to many inputs #26562

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Performance of jax.grad when taking the derivative with respect to many inputs #26562

Uh oh!

EmilianoG-byte Feb 17, 2025

Replies: 1 comment · 2 replies

Uh oh!

Uh oh!

jakevdp Feb 18, 2025 Maintainer

Uh oh!

EmilianoG-byte Feb 20, 2025 Author

Uh oh!

jakevdp Feb 20, 2025 Maintainer

Performance of `jax.grad` when taking the derivative with respect to many inputs #26562

EmilianoG-byte
Feb 17, 2025

Replies: 1 comment 2 replies

jakevdp
Feb 18, 2025
Maintainer

EmilianoG-byte Feb 20, 2025
Author

jakevdp Feb 20, 2025
Maintainer