Understanding behavior of reduce_axes in jax.grad #16847

vroulet · 2023-07-26T10:12:40Z

vroulet
Jul 26, 2023

Hello!

Thank you very much for developing JAX, it's quite beautiful!
I wanted to reuse the reduced_axes option for jax.grad (also present in jax.value_and_grad and jax.vjp).
According to the documentation

reduce_axes (Sequence[AxisName]) – Optional, tuple of axis names. If an axis is listed here, and fun implicitly broadcasts a value over that axis, the backward pass will perform a psum of the corresponding gradient. Otherwise, the gradient will be per-example over named axes. For example, if 'batch' is a named batch axis, grad(f, reduce_axes=('batch',)) will create a function that computes the total gradient while grad(f) will create one that computes the per-example gradient.

But it seems that I obtain per example gradients even when using reduce_axes. Here is a minimal example:

import os
import jax
from jax import numpy as jnp

os.environ['XLA_FLAGS'] = "--xla_force_host_platform_device_count=8"

def fun(x, a):
  return a*x

reduced_grad_fun = jax.grad(fun, reduce_axes=('batch',))

def manual_reduced_grad_fun(x, a):
  return jax.lax.psum(jax.grad(fun)(x, a), axis_name='batch')

x = jnp.asarray(1.)
a = jnp.arange(8)

g1 = jax.pmap(reduced_grad_fun, axis_name='batch', in_axes=(None, 0))(x, a)
g2 = jax.pmap(manual_reduced_grad_fun, axis_name='batch', in_axes=(None, 0))(x, a)

print(g1)
# Returns per-example gradients:
# [0. 1. 2. 3. 4. 5. 6. 7.] 
print(g2)
# Returns total gradient:
# [28. 28. 28. 28. 28. 28. 28. 28.]

I would have though that the two implementations should match.

Any help would be greatly appreciated!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Understanding behavior of reduce_axes in jax.grad #16847

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Understanding behavior of reduce_axes in jax.grad #16847

Uh oh!

vroulet Jul 26, 2023

Replies: 0 comments

vroulet
Jul 26, 2023