Spaces #566

lstewart · 2025-03-26T13:44:57Z

lstewart
Mar 26, 2025

Goals

Support GPU Symmetric Heaps
Backwards Compatible - avoid changes to existing host applications
Forwards Compatible - provide a path to additional spaces

The big questions

Allocation. How do you specify in which heap to allocate?
Atomicity. How do we enable use of high performance GPU atomics?
Performance. How do we minimize code path and register usage for GPU initiated operations?
Scope. Who can do operations to what memory?

Background

nVidia, AMD, and Intel have provided SHMEM implementations for GPUs.

nvSHMEM

NVSHMEM

rocSHMEM

rocm/rocSHMEM

Intel SHMEM

Intel SHMEM
Documentation

Discussion

nvSHMEM works with CUDA environments on nvidia GPUS. rocSHMEM works with ROCm environments on AMD GPUs. Intel SHMEM works with SYCL environments on Intel Data Center MAX GPUs. They each use a single symmetric heap in GPU memory and support GPU initiated operations

Requirements

These are notional requirements for a spaces proposal.

Pressing Needs

GPU Symmetric Heap. GPU Algorithms need to use GPU memory for performance. GPU environments permit memory to be copied back and forth to the host, and may permit memory mapped access from host code to device memory and from device code to host memory, but such operations and access are not very low latency. For distribution of algorithms across multiple GPUs, we should support RMA, AMO, and Collectives targetting GPU memory, The easiest way to do that is to put a symmetric heap in GPU memory.

User Cited Needs

Data structures in remotely accessible memory
Non uniform memory nodes

Other Ideas

Support for fabric attached memory (FAM)
Support for CXL memory

For these, memory may be remotely accessible (RMA-able) but not symmetric. Only a subset of PEs may have RMA access.

The Proposal

This follows from the slide presentation given at the working group meeting March 27, 2025.

March 27, 2025 Presentation

As much as possible, we'd like to use existing SHMEM mechanisms, such as teams and contexts, in ways that are consistent with current practice. We shouldn't assign new semantics to an old concept.

Key Points of the proposal

The host symmetric heap will be predefined, as now, and associated with SHMEM_TEAM_WORLD and with the default host context.
A new GPU symmetric heap will be predefined, similar but distinct from the host symmetric heap, and associated with a new SHMEM_TEAM_GPU
Add predefined Teams SHMEM_TEAM_GPU and SHMEM_TEAM_GPU_SHARED

SHMEM_TEAM_GPU refers to all PEs that are using GPUs in the job, whether or not those GPUs have load/store access to each other's memory. SHMEM_TEAM_GPU_SHARED is similar to SHMEM_TEAM_SHARED, and refers to PEs whose GPUs have load/store access to each others memory via an inter-gpu fabric such as nvLink, XeLink, or Infinity Fabric. There may be more than one SHMEM_TEAM_GPU_SHARED in the same job, with disjoint subets of SHMEM_TEAM_GPU.

Operations naming a team or a context or using a default team or context will use the default team for the given context or the default context for the given team

The proposal associates a particular heap with each team. A context specifies an atomicity domain for the heap associated with the team from which the context derives. Contexts provide for independent progress, but that is incidental here.

GPU initiated operations will use a different default context than host initiated operations. The device context will be associated with TEAM_GPU_SHARED while the host context is associated with TEAM_WORLD.

GPU code that does not specify a context will default to using SHMEM_TEAM_GPU_SHARED. In this context, the code can safely assume that (a) addresses are in the GPU symmetric heap and are accessible via load/store (b) GPU atomic instructions work. These assumptions allow for short code paths and peak performance in the common case.

GPU code that wishes to use SHMEM_TEAM_GPU_WORLD must do so with an explicit context. This is expensive, but makes clear that communications may need host or NIC assistance, and that the atomicity domain is different, and GPU atomics cannot be safely used. Access to the GPU heap, but outside TEAM_GPU_SHARED will have to use the context versions of the APIs. Access to the host symmetric heap from the GPU will likewise have to use the context versions of the APIs

Host operations that wish to target GPU memory would have to use the context versions of the APIs, but see below.

Allocation in a particular symmetric heap will use a new allocation API that includes a team. The allocation functions that do not specify a team will allocate in the host symmetric heap.

Host operations

Host initiated operations that do not name a team or context will use the host symmetric heap.
Host operations that wish to target the GPU symmetric heap must use an appropriate team or context to do so.

In nvSHMEM, rocSHMEM, and Intel SHMEM, I believe that host operations can target GPU memory or host memory freely and the runtime can tell which is which by looking at the pointer. The virtual address ranges are guaranteed not to overlap. I wish to argue that OpenSHMEM should not support this model, for the following reasons:

Atomicity. I think the host generally has no mechanism to do atomic operations on SHMEM_TEAM_GPU_SHARED. In addition, since the atomicity domain for a context derived from SHMEM_TEAM_WORLD may be different than the atomicity domain for a context derived from SHMEM_TEAM_GPU, you cannot refer to both of them using the AMOs that do not specify a context.
Collectives. The team based collectives say that you can only have one collective underway at a time in a particular team. I think that most implementation implement collectives by using psync structures in the team data structures. I think that generally there is no way to synchronize a gpu initiated collective with a host initated collective, so they may be running simultaneously. Consequently they should use different teams. In fact this suggests that the host version of SHMEM_TEAM_GPU is different that the GPU version, so that their collectives can be independent.
RMA. I think in the present implementations, it would be possible for RMA to host symmetric heap addresses or to GPU symmetric heap addresses both to work using the RMA operations that do not specify a context. I argue this is undesirable though, because (a) requiring a context doesn't cost much for host threads and (b) it reminds users that host memory and GPU memory is sematically different. It would be kind of weird to require a context for AMO but not for RMA, for example.

Could a host operation reference the host symmetric heap on a remote PE and also reference local GPU memory? Yes! The present spec makes clear that the "local" address in an RMA can be any local address, be it global, heap, stack, or symmetric heap. If the local GPU memory is accessible to the host, then the host can use it in an RMA or as the local address in a fetching AMO. Should work fine.

Defer additional specification changes

Operations to create new spaces (associated with a team/context). Generally when you have to allocate in a new space, you need a team to specify the collective group that does the allocation. To do RMA to other than the space associated with the default context, you must use the context specific variations of RMA and AMO routines.

Big Questions section

Allocation

Should we add a set of teams-based allocation calls or can we just use hints?

Larry thinks the hints mechanism was intended to hint the intended use of the memory, such as for atomics, rather then specify which memory to use, such as the GPU heap. But maybe that is the same thing?

Having teams-based allocation permits future spaces that are local to a team, such as host-based fabric attached memory or whatever. Having the team makes it clear which PEs have to participate in collective allocation for such a space. You can't do that with hints.

Design Considerations

Code path length for GPU initiated operations

GPU threads are slower than host threads. Extra function arguments are expensive as they add register pressure.

These considerations argue for having the defaults for GPU code set to the most common situation: GPU initiated operations to the GPU symmetric heap in team TEAM_GPU_SHARED. Such functions can translate OpenSHMEM RMA and AMO routines that do not have a context argument into load/store operations over direct GPU to GPU fabrics.

Issues

Previous Work

Spaces proposal
FAM Ideas
Spaces proposal

abrooks98 · 2025-04-03T15:12:15Z

abrooks98
Apr 3, 2025

A few comments:

Suggesting SHMEM_TEAM_GPU_WORLD over SHMEM_TEAM_GPU
Perhaps call out another team SHMEM_TEAM_GPU_ATOMIC for GPU-based teams which are in the same atomicity domain; this may or may not be distinct from SHMEM_TEAM_GPU_SHARED in case of certain architectures
What is the reason for the default GPU context being SHMEM_TEAM_GPU_SHARED over SHMEM_TEAM_GPU_WORLD?

5 replies

lstewart Apr 3, 2025
Author

Regarding SHMEM_TEAM_GPU_WORLD I think I'm against it, if only to preserve the idea that you might have a mix of GPU and non GPU PEs in one job. That seems useful considering how many cores the host has compared to how many GPU partitions are available. For Aurora we had 12 GPU tiles vs 104 cores or 208 threads

lstewart Apr 3, 2025
Author

Regarding SHMEM_TEAM_ATOMIC, it is a little confusing, are there any such systems? We did see that once in Intel SHMEM where a machine had multiple GPUs not connected by Xe Links. Load store worked over PCIe, but atomics did not work. It seems like a corner case. I would prefer not complicating the user model but I don't know what to do.

abrooks98 Apr 3, 2025

Regarding SHMEM_TEAM_GPU_WORLD I think I'm against it, if only to preserve the idea that you might have a mix of GPU and non GPU PEs in one job. That seems useful considering how many cores the host has compared to how many GPU partitions are available. For Aurora we had 12 GPU tiles vs 104 cores or 208 threads

I agree with your sentiment. I was suggesting that WORLD refers to all relevant PEs, so SHMEM_TEAM_WORLD is all PEs and SHMEM_TEAM_GPU_WORLD is all PEs with a GPU, but not necessarily that they are equivalent

lstewart Apr 3, 2025
Author

Regarding the default being TEAM GPU or TEAM GPU_SHARED, this was primarily so that you could avoid a test in the GPU initiated fast path as to whether the target PE was addressible by load/store. The idea is that making the TEAM_GPU a trifle slower by requiring an extra context argument wouldn't make latency much worse, but requiring a test for shared in the fast path would make the shared path a little slower. It might be prudent to think through the atomicity domain case first

abrooks98 Apr 3, 2025

Regarding SHMEM_TEAM_ATOMIC, it is a little confusing, are there any such systems? We did see that once in Intel SHMEM where a machine had multiple GPUs not connected by Xe Links. Load store worked over PCIe, but atomics did not work. It seems like a corner case. I would prefer not complicating the user model but I don't know what to do.

Yes, for systems that don't use Xe-Link or NVLink (think primarily client GPUs), then SHMEM_TEAM_GPU_SHARED and SHMEM_TEAM_GPU_ATOMIC could be different. Client GPUs can be common in AI workloads, so I think it warrants considering this distinction

abrooks98 · 2025-04-03T15:37:15Z

abrooks98
Apr 3, 2025

Having a separate team SHMEM_TEAM_GPU suggests that it may not be an equivalent set to SHMEM_TEAM_WORLD. For example, a user may want one PE per core or one PE per socket. In these cases, the number of GPUs is not equivalent to the number of host PEs. How should this be handled in the spec?

On a related note, how will GPU mappings be defined? Will it be implicitly determined by the runtime, explicitly determined by the user, or something else?

6 replies

abrooks98 Apr 3, 2025

For mapping, the spec has always been silent on mapping PEs to cores, leaving that to the resource manager. I think trying to move that tar pit into SHMEM would be very complicated. So I would leave it outside and make the runtime initialization responsible for setting up the predefined teams.

Makes sense, I tend to agree. Though I worry how the runtime decides where to allocate the GPU heap. If multiple devices are exposed to a PE by the resource manager, how will that be handled? I don't think it can be allocated on both GPUs, since they may (should?) have distinct address spaces and it would complicate how to resolve destination addresses during any op.

This does bring up a question of how to ask "is PE x in Team y"? because all of a sudden a particular process might have a different PE number than the associated GPU.

Isn't that already provided via shmem_team_translate_pe? It returns -1 if PE x is not in team y

lstewart Apr 3, 2025
Author

The OpenSHMEM implementaion of Teams uses a bitmap scheme so the same ID can mean different teams on different PEs. If you want to permit both GPU and non-GPU nodes, then a non GPU PE that wishes to do RMA to a GPU heap object on a node that is in TEAM_GPU it cannot use the current shmem_team_translate_pe because there is no local TEAM_GPU. This would also come up anytime we want to allow commnications with an object in a heap connected to a team you are not in. The hardware can do it so we should permit it. One fix is to make Team id be a global identifier, but that has problems on very large jobs.

abrooks98 Apr 3, 2025

The OpenSHMEM implementaion of Teams uses a bitmap scheme so the same ID can mean different teams on different PEs. If you want to permit both GPU and non-GPU nodes, then a non GPU PE that wishes to do RMA to a GPU heap object on a node that is in TEAM_GPU it cannot use the current shmem_team_translate_pe because there is no local TEAM_GPU. This would also come up anytime we want to allow commnications with an object in a heap connected to a team you are not in. The hardware can do it so we should permit it. One fix is to make Team id be a global identifier, but that has problems on very large jobs.

I see. However there is an even bigger problem for non-GPU PEs in this scenario. If they are not part of the GPU team, they don't have a reference to the object they are trying to communicate with in the first place. So it's not just a team issue, but a heap reference issue as well. I'm not sure if this can even be supported..

lstewart Apr 3, 2025
Author

I like the question about multiple GPUs per host PE. Pardo and I had a proposal for hierarchical PE numbers, like PE.gpu, but it got lost.

lstewart Apr 3, 2025
Author

Regarding not having a reference to an object not present on your PE. Yes, that seems to require a design for memory handles or some mechanism to make addresses portable. I was hoping to avoid having to solve that. I don't think the Intel proposal for pointer_to_offset works for this, because you still need some way to specify the target space. The memory handle would have to have a (PE, memory-registration-key, offset) triple or something equivalent. Even messier if TEAM_GPU has PE numbers which are in a different space than TEAM_WORLD PE numbers

michael-beebe · 2025-04-03T15:48:03Z

michael-beebe
Apr 3, 2025

So from my understanding:

Separate predefined symmetric heaps for host (SHMEM_TEAM_WORLD) and GPU (SHMEM_TEAM_GPU or SHMEM_TEAM_GPU_SHARED), sized with SHMEM_SYMMETRIC_SIZE or a potential SHMEM_GPU_SYMMETRIC_SIZE, depending on PE-to-GPU mapping?
shmem_team_malloc(SHMEM_TEAM_GPU, size) to allocate symmetric memory for GPUs in SHMEM_TEAM_GPU, with PE count (e.g., one per GPU, core, or socket) defined by runtime, user, or other method?
Context-less RMA/atomics use GPU default context (e.g., tied to SHMEM_TEAM_GPU or SHMEM_TEAM_GPU_SHARED) for fast-path ops?
RMA: Create context with shmem_team_create_ctx(SHMEM_TEAM_GPU, &ctx); then shmem_ctx_<typename>_<rma op>(ctx, dest, src, nelems, dst_pe) for GPUs in SHMEM_TEAM_GPU, dst_pe relative to team’s PE mapping? Host can RMA to GPU heap, but GPU to host globals unclear.
Atomics: Create context with shmem_team_create_ctx(SHMEM_TEAM_GPU_SHARED, &ctx) for load-store atomics or SHMEM_TEAM_GPU for NIC atomics; then shmem_ctx_<typename>_<atomic op>(ctx, dest, pe) in the respective domain, pe per team mapping? Host-to-GPU atomics TBD.

3 replies

lstewart Apr 3, 2025
Author

I agree with all that. Nice summary.

Yiltan Apr 11, 2025

I think I agree that we need something like shmem_team_create_ctx to provide a subset of information that is accessible by the device.

Currently rocshmem has rocshmem_wg_ctx_create [1]. We do not need to pass in a team. We call this API from the device so we would already know its in SHMEM_TEAM_GPU. I think Nvidia abstracts away contexts and doesn't expose them to the user. From the documentation, to me, it is unclear what intel shmem does.

For shmem_team_create_ctx would we expect it to be a device callable API? Or would it be a host API that exports ctx as a device buffer so that it can be passed in as an argument for device code. If the latter, would we still want ctx to be of type shmem_ctx_t or something else to signify it cannot be called from host code? Could network vendors export a doorbell or other structures through a context so it is accessible by a device?

[1] https://github.com/ROCm/rocSHMEM/blob/develop/examples/rocshmem_broadcast_test.cc#L37

michael-beebe Apr 17, 2025

I think this is an up-to-date summary based on more recent discussions:

Two implicit symmetric heaps: SHMEM_SPACE_DEFAULT (host) and SHMEM_SPACE_DEVICE (device, e.g., GPU), accessible across all default teams (SHMEM_TEAM_WORLD, SHMEM_TEAM_SHARED), sized with SHMEM_SYMMETRIC_SIZE.
shmem_space_malloc(SHMEM_SPACE_DEVICE, size) allocates symmetric memory in the specified space, with PE count defined by runtime, user, or other method.
Context-less RMA/atomics use the default context, with the runtime identifying the space (host or device) via the pointer for appropriate operation handling.
RMA: Can use default context (shmem_<typename>_<rma op>(dest, src, nelems, dst_pe)) or create a context with shmem_team_create_ctx(SHMEM_TEAM_SHARED, &ctx) for team-specific operations; then shmem_ctx_<typename>_<rma op>(ctx, dest, src, nelems, dst_pe), with dst_pe relative to the team's PE mapping if using a team context, or SHMEM_TEAM_WORLD if using the default context. Host can RMA to GPU heap, and GPU heap can RMA to host globals (but this is still orchestrated through the host).
Atomics: Can use default context (shmem_<typename>_<atomic op>(dest, pe)) or create a context with shmem_team_create_ctx(SHMEM_TEAM_SHARED, &ctx); then shmem_ctx_<typename>_<atomic op>(ctx, dest, pe), with pe relative to the team's mapping if using a team context, or SHMEM_TEAM_WORLD if using the default context. The runtime selects the atomic mechanism (e.g., CPU atomics, NVLink/Infinity Fabric) based on the memory buffer's space. Host-to-GPU and GPU-to-host atomics are supported.

@Yiltan Regarding shmem_team_create_ctx, I believe the proposal in its current form only addresses host-initiated operations, with shmem_team_create_ctx being a host-side API that creates a shmem_ctx_t context, which the runtime can export as a device-accessible buffer for potential future device-side operations like GPU kernel-initiated RMA. In other words, it's the latter ("Or would it be a host API that exports ctx as a device buffer so that it can be passed in as an argument for device code").

We can discuss this today or next week, but I reckon ctx should still be of type shmem_ctx_t because it keeps things consistent with the existing API, where contexts are already used for host-initiated ops, and it can still be passed to device code for future device-side ops without needing a new type to flag any host-side restrictions. I think network vendors can export stuff like doorbells through the context as an implementation detail as long as the context is device-accessible, although I'm not totally sure about this. We still need to discuss how to fully support device-initiated operations.

Spaces #566

Uh oh!

Uh oh!

Goals

The big questions

Background

nvSHMEM

rocSHMEM

Intel SHMEM

Discussion

Requirements

Pressing Needs

User Cited Needs

Other Ideas

The Proposal

Key Points of the proposal

Host operations

Defer additional specification changes

Big Questions section

Allocation

Design Considerations

Code path length for GPU initiated operations

Issues

Previous Work

Replies: 3 comments · 14 replies

Uh oh!

Uh oh!

Uh oh!

lstewart Apr 3, 2025 Author

Uh oh!

lstewart Apr 3, 2025 Author

Uh oh!

Uh oh!

lstewart Apr 3, 2025 Author

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lstewart Apr 3, 2025 Author

Uh oh!

Uh oh!

lstewart Apr 3, 2025 Author

Uh oh!

lstewart Apr 3, 2025 Author

Uh oh!

Uh oh!

Uh oh!

lstewart Apr 3, 2025 Author

Uh oh!

Uh oh!

Replies: 3 comments 14 replies

lstewart Apr 3, 2025
Author

lstewart Apr 3, 2025
Author

lstewart Apr 3, 2025
Author

lstewart Apr 3, 2025
Author

lstewart Apr 3, 2025
Author

lstewart Apr 3, 2025
Author

lstewart Apr 3, 2025
Author