Switching between local and global views when using xmap #7647

bloops · 2021-08-17T16:53:27Z

bloops
Aug 17, 2021

xmap provides a really nice API to do distributed computing with automatic partitioning, while also expressing computations in a way that is agnostic to the sharding semantics. xmap along with XLA SPMD already feels like a super power for distributed computation!

However, in some use cases it would be useful to have more control over the individual chunks present in each device. It would still be great to use xmap for the actual computations, and take advantage of the partitioning logic and XLA optimizations behind the scenes.

The primary use case I have in mind is a sequence of (very cheap) convolution and slicing operations, say around ~10 of them. The required halo exchanges widths are known to me, so I would like to perform halo exchange upfront, manually, to improve the efficiency.

Another use case would be the one-shuffle scheme of doing distributed matrix multiplication, described in this paper: Lu et al, Large-Scale Discrete Fourier Transform on TPUs (2021). Briefly, the one-shuffle scheme cycles the row chunks of the RHS matrix and does matmul one chunk at a time to produce the output row chunk. This approach avoids all gather for matmul.

In order to express such computations, one possible approach could be to reinterpret a Sharded Device Array such as one outputted by xmap, in a way that is local to each core. Such a process should not change the actual content of the device buffers present in each device. Once the per-chunk xmap computations are done, we can switch back to the global view.

Here is the rough idea of what I have in mind. Of course the details need to be worked through and the new API should probably be more ergonomic and consistent.

input = jnp.arange(8 * 8, dtype=jnp.float32).reshape([8, 8])
mesh_devices = np.array(jax.devices()[:4]).reshape([2, 2])
with mesh(mesh_devices, ('x', 'y')):
  # work with ‘global’ view, axis ‘i’ and ‘j’ are Chunked w/ chunk size 2.
  output = xmap(global_func,
       in_axes=['i', 'j', ...],
       out_axes=['i', 'j', ...],
       axis_resources={'i': 'x', 'j': 'y'})(input)
  
  # switch to ‘local’ view. It has global shape [2, 2, 4, 4] and
  # per-device shape [1, 1, 4, 4]. No data movement takes place!
  output_local = reinterpret_as_local(output,
       in_axes=['i', 'j', ...],
       out_axes=['chunk_i', 'chunk_j', 'i', 'j', ...],
       axis_mapping={'i': 'chunk_i', 'j': 'chunk_j'},
       axis_resources={'i': 'x', 'j': 'y'})
 
  # work with local view. E.g. we can do pshuffle(x, ‘chunk_i’, p) to 
  # exchange entire chunks (or its slices) between replicas.
  output2 = xmap(halo_exchange,
       in_axes=['chunk_i', 'chunk_j', ...],
       out_axes=['chunk_i', 'chunk_j', ...],
       axis_resources={'chunk_i': 'x', 'chunk_j': 'y'})(output_local)
 
  # switch back to global view. It is back to having global shape [8, 8] 
  # and per-device shape [4, 4]. Again, no data movement takes place!
  output_global = reinterpret_as_global(output2,
       in_axes=['chunk_i', 'chunk_j', 'i', 'j', ...],
       out_axes=['i', 'j', ...],
       axis_mapping={'i': 'chunk_i', 'j': 'chunk_j'},
       axis_resources={'i': 'x', 'j': 'y'})

I think it is possible to achieve at least some of what this API provides by rebuilding the Sharded Device Array using the partitioning information that is already part of its bookkeeping.

But it would be great to have an official, fleshed out, version of this API. Does it make sense to add support for such local and global reinterpretations to Jax?

bloops · 2021-08-17T17:49:29Z

bloops
Aug 17, 2021
Author

Also tagging @shoyer and @jekbradbury who might be interested in this topic.

0 replies

apaszke · 2021-09-08T16:47:08Z

apaszke
Sep 8, 2021
Collaborator

This is a very good question and in fact it has been raised multiple times in the past. While I am somewhat sympathetic to exposing this, at the same time I am a bit afraid of unintended consequences this would have on xmap. For example, what do you do when axis_resources also mention sequential loops? Then your local chunk sizes get smaller, but more importantly not all local chunks can be resident in memory at the same time, so you can't make them into an array in the way you've outlined! I'd like to think about it more, but at this point I have no clue what the right solution is for that use case 😕

1 reply

bloops Sep 8, 2021
Author

I agree that this is hard problem. I'll also think a bit more about the sequential loop resource axis.

To be more concrete, (and avoid the X-Y problem), here is a use case that I am focusing on. It's a way of matmul-ing that's described in the paper I linked in the OP. The result of the computation is the same, no matter what the chunks are. This part already aligns with the xmap specification. But of course, the actual computation depends on the chunking.

# Multiply a square matrix with a 'portrait' matrix (both chunked along rows).
# E.g. with 4 chunks along the 'j' axis for 'ij,jk->ik'.
#
# [X00 X01 X02 X03]    [Y0]   (device 0)
# [X10 X11 X12 X13] @  [Y1]   (device 1)
# [X20 X21 X22 X23]    [Y2]   (device 2)
# [X30 X31 X32 X33]    [Y3]   (device 3)
#
# We compute the product X @ Y in `num_chunks` phases. The first phase computes
# X00 @ Y0 on device 0, X11 @ Y1 on device 1 ...(Xpp@Yp on device p). The second phase
# shuffles the Yis to the 'previous' device along the chunked axis, and now the first device
# computes X01 @ Y1, second device X12 @ Y2, and so on, and  adds it to the result
# In general, the `q`th phase computes Xp(p+q) @ Y(p+q) on device `p`, and accumulates
# the result locally. The final out of X @ Y is also chunked in the same way across 4 devices.

# in_axes=({0: 'ci', 1: 'i', 3: 'j'}, ['ci', 'j', ...]), out_axes=['ci', 'i', ...], axis_resources={'ci': 'x'}
def matmul(a, b):
  num_chunks = lax.psum(1, 'ci')
  chunk_id = lax.axis_index('ci')
  result = None
  for q in range(num_chunks):
    addend = lax.pdot(a[(q + chunk_id) % num_chunks], b, axis_name='j')
    result = result + addend if result is not None else addend
    b = lax.pshuffle(b, 'ci', np.roll(np.arange(num_chunks), -1))
  return result

(Here's a colab which preps the matrix and calls the above function: https://colab.research.google.com/drive/1VbduSsnmof0CNWCpOSnmkyChTbFSfVzi?usp=sharing)

Another (simpler) use case, which I think can be considered as an instance of this issue, is the 'prolongation' operator in algebraic multigrid. This is extending an array along a dimension by a factor of 2, while interpolating the new values from neighboring elements.

We can use jnp.repeat as a proxy for the essential aspect. Here, again, the actual chunking is independent of the final result. This can also be implemented by locally repeating every chunk, and recreating the 2x longer array with the same chunking strategy.

x = jnp.arange(4)        # [0,1,2,3]. Perhaps chunked across 2 devices as [0, 1] [2, 3]
print(jnp.repeat(x, 2))  # [0,0,1,1,2,2,3,3].  Should now be chunked across devices as [0, 0, 1, 1] [2, 2, 3, 3]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Switching between local and global views when using xmap #7647

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Switching between local and global views when using xmap #7647

Uh oh!

bloops Aug 17, 2021

Replies: 2 comments · 1 reply

Uh oh!

bloops Aug 17, 2021 Author

Uh oh!

apaszke Sep 8, 2021 Collaborator

Uh oh!

Uh oh!

bloops Sep 8, 2021 Author

bloops
Aug 17, 2021

Replies: 2 comments 1 reply

bloops
Aug 17, 2021
Author

apaszke
Sep 8, 2021
Collaborator

bloops Sep 8, 2021
Author