How to implement batching rule for custom cuda op? #16840

frskplis · 2023-07-25T19:45:55Z

frskplis
Jul 25, 2023

Following extending-jax and jax tutorial I created a custom cuda kernel that calculates sum of cumulative product. I have uploaded it here https://github.com/frskplis/sumcumprod_jax.

I am having trouble defining batching rule for my primitive. Original extending-jax repo had kernel that performed element wise operation therefore the batching rule was the same as original primitive.

In my case my cuda kernel performs calculation that is equivalent to this simplified code:

x = jnp.ones(10) # example vector, can be different

def func1(x):
  res_all = []
  for i in range(x.shape[0]):
     res = jnp.sum(jnp.cumprod(x[i:]))
     res_all.append(res)
  
  result1 = jnp.stack(res_all)
  return result1

result1 = func1(x)

And my actual kernel exposed to python:

# my cuda kernel
from sumcumprod_jax import sumcumprod
result2 = sumcumprod(x)[0]

They are the same as seen here:

In [20]: jnp.allclose(result1, result2)
Out[20]: Array(True, dtype=bool)

I would like to use vmap to map this kernel over first axis of 2D array such that I would get something like:

x = jnp.ones((10,10))
vmap(func1)(x)
Out[10]: 
Array([[10.,  9.,  8.,  7.,  6.,  5.,  4.,  3.,  2.,  1.],
       [10.,  9.,  8.,  7.,  6.,  5.,  4.,  3.,  2.,  1.],
       [10.,  9.,  8.,  7.,  6.,  5.,  4.,  3.,  2.,  1.],
       [10.,  9.,  8.,  7.,  6.,  5.,  4.,  3.,  2.,  1.],
       [10.,  9.,  8.,  7.,  6.,  5.,  4.,  3.,  2.,  1.],
       [10.,  9.,  8.,  7.,  6.,  5.,  4.,  3.,  2.,  1.],
       [10.,  9.,  8.,  7.,  6.,  5.,  4.,  3.,  2.,  1.],
       [10.,  9.,  8.,  7.,  6.,  5.,  4.,  3.,  2.,  1.],
       [10.,  9.,  8.,  7.,  6.,  5.,  4.,  3.,  2.,  1.],
       [10.,  9.,  8.,  7.,  6.,  5.,  4.,  3.,  2.,  1.]], dtype=float32)

But for my kernel this is totally wrong:

In [11]: vmap(sumcumprod)(x)
[Array([[100.,  99.,  98.,  97.,  96.,  95.,  94.,  93.,  92.,  91.],
        [ 90.,  89.,  88.,  87.,  86.,  85.,  84.,  83.,  82.,  81.],
        [ 80.,  79.,  78.,  77.,  76.,  75.,  74.,  73.,  72.,  71.],
        [ 70.,  69.,  68.,  67.,  66.,  65.,  64.,  63.,  62.,  61.],
        [ 60.,  59.,  58.,  57.,  56.,  55.,  54.,  53.,  52.,  51.],
        [ 50.,  49.,  48.,  47.,  46.,  45.,  44.,  43.,  42.,  41.],
        [ 40.,  39.,  38.,  37.,  36.,  35.,  34.,  33.,  32.,  31.],
        [ 30.,  29.,  28.,  27.,  26.,  25.,  24.,  23.,  22.,  21.],
        [ 20.,  19.,  18.,  17.,  16.,  15.,  14.,  13.,  12.,  11.],
        [ 10.,   9.,   8.,   7.,   6.,   5.,   4.,   3.,   2.,   1.]],      dtype=float32)]

This is because I have incorrect _sumcumprod_batch in sumcumprod_jax.py source so effectively there is no batching in my case. I tried to look at batching rules in JAX source code but no avail. Please help me - what changes should I make in order for this to work correctly?

Additional info:
In order to compile sumcumprod kernel you have to run this commands:
git clone https://github.com/frskplis/sumcumprod_jax
cd sumcumprod_jax
export SUMCUMPROD_JAX_CUDA="yes"
pip install .

cc: @dfm

Answered by jakevdp

Jul 25, 2023

The semantics of your batching rule should be something like this:

def _sumcumprod_batch(args, axes):
    x, = args
    bd, = axes
    x = jnp.moveaxis(x, bd, 0)
    x_slices = [x[i] for i in range(x.shape[-1])]
    result_slices = [sumcumprod(x_slice) for x_slice in x_slices]
    return jnp.stack(result_slices), 0

I think if you plug this in, it will work for your case, but unfortunately it's not very efficient due to the for loop within the list comprehension. But unless you generalize your primitive to be closed under batching, it's hard to do much better than this.

What do I mean by "closed under batching"? Consider the example of computing a vector product: in JAX, a simple vector pr…

View full answer

jakevdp · 2023-07-25T21:11:55Z

jakevdp
Jul 25, 2023
Maintainer

The semantics of your batching rule should be something like this:

def _sumcumprod_batch(args, axes):
    x, = args
    bd, = axes
    x = jnp.moveaxis(x, bd, 0)
    x_slices = [x[i] for i in range(x.shape[-1])]
    result_slices = [sumcumprod(x_slice) for x_slice in x_slices]
    return jnp.stack(result_slices), 0

I think if you plug this in, it will work for your case, but unfortunately it's not very efficient due to the for loop within the list comprehension. But unless you generalize your primitive to be closed under batching, it's hard to do much better than this.

What do I mean by "closed under batching"? Consider the example of computing a vector product: in JAX, a simple vector product is lowered to the dot_general primitive, which we can see here:

x = jnp.ones(10)
y = jnp.ones(10)

jax.make_jaxpr(jnp.dot)(x, y)
# { lambda ; a:f32[10] b:f32[10]. let
#     c:f32[] = dot_general[dimension_numbers=(([0], [0]), ([], []))] a b
#   in (c,) }

The parameters to this specify the dimensions to be contracted: ([0], [0]) says both the left and right inputs should be contracted along dimension 0.

Now what happens if we vmap this over batches of vectors? JAX could have implemented this via a loop over slices as in our batching rule above, but instead we see this:

x_batched = jnp.ones((100, 10))
y_batched = jnp.ones((100, 10))

jax.make_jaxpr(jax.vmap(jnp.dot))(x_batched, y_batched)
# { lambda ; a:f32[100,10] b:f32[100,10]. let
#     c:f32[100] = dot_general[dimension_numbers=(([1], [1]), ([0], [0]))] a b
#   in (c,) }

Again it's a single call to dot_general, but here ([1], [1]) specifies that we contract along dimension 1 of both vectors, and the ([0], [0]) in the second position tells us that dimension 0 of both vectors should be considered a batch dimension.

Thus dot_general is closed under batching, because the batched version of dot_general can be written as a call back into dot_general.

Now back to your primitive: it does not look like your primitive is closed under batching, and this is probably something it inherits from how your CUDA kernel is implemented. So in this case there's not really any obvious efficient way to express the batched operation. So your options are:

Use the inefficeint loop-based batching rule as I wrote above
Make the batching rule unimplemented
Expand the scope of your cuda kernel and the jax primitive that wraps it such that both are closed under batching: I'd probably do so by making the kernel operate along the last dimension of a multi-dimensional array, implicitly treating any leading dimensions as batch dimensions. Then the primitive would be closed under batching, and the batch rule would be a single efficient call back into the primitive itself.

Option 3 is probably the best, but unfortunately might also take much more work.

1 reply

frskplis Jul 27, 2023
Author

Thank you so much for your thorough explanation, I managed to implement 3.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

How to implement batching rule for custom cuda op? #16840

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

How to implement batching rule for custom cuda op? #16840

Uh oh!

frskplis Jul 25, 2023

Replies: 1 comment · 1 reply

Uh oh!

Uh oh!

jakevdp Jul 25, 2023 Maintainer

Uh oh!

frskplis Jul 27, 2023 Author

frskplis
Jul 25, 2023

Replies: 1 comment 1 reply

jakevdp
Jul 25, 2023
Maintainer

frskplis Jul 27, 2023
Author