Replies: 3 comments 14 replies
-
A few comments:
|
Beta Was this translation helpful? Give feedback.
-
Having a separate team On a related note, how will GPU mappings be defined? Will it be implicitly determined by the runtime, explicitly determined by the user, or something else? |
Beta Was this translation helpful? Give feedback.
-
So from my understanding:
|
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Goals
The big questions
Background
nVidia, AMD, and Intel have provided SHMEM implementations for GPUs.
nvSHMEM
NVSHMEM
rocSHMEM
rocm/rocSHMEM
Intel SHMEM
Intel SHMEM
Documentation
Discussion
nvSHMEM works with CUDA environments on nvidia GPUS. rocSHMEM works with ROCm environments on AMD GPUs. Intel SHMEM works with SYCL environments on Intel Data Center MAX GPUs. They each use a single symmetric heap in GPU memory and support GPU initiated operations
Requirements
These are notional requirements for a spaces proposal.
Pressing Needs
GPU Symmetric Heap. GPU Algorithms need to use GPU memory for performance. GPU environments permit memory to be copied back and forth to the host, and may permit memory mapped access from host code to device memory and from device code to host memory, but such operations and access are not very low latency. For distribution of algorithms across multiple GPUs, we should support RMA, AMO, and Collectives targetting GPU memory, The easiest way to do that is to put a symmetric heap in GPU memory.
User Cited Needs
Data structures in remotely accessible memory
Non uniform memory nodes
Other Ideas
Support for fabric attached memory (FAM)
Support for CXL memory
For these, memory may be remotely accessible (RMA-able) but not symmetric. Only a subset of PEs may have RMA access.
The Proposal
This follows from the slide presentation given at the working group meeting March 27, 2025.
March 27, 2025 Presentation
As much as possible, we'd like to use existing SHMEM mechanisms, such as teams and contexts, in ways that are consistent with current practice. We shouldn't assign new semantics to an old concept.
Key Points of the proposal
The host symmetric heap will be predefined, as now, and associated with SHMEM_TEAM_WORLD and with the default host context.
A new GPU symmetric heap will be predefined, similar but distinct from the host symmetric heap, and associated with a new SHMEM_TEAM_GPU
Add predefined Teams SHMEM_TEAM_GPU and SHMEM_TEAM_GPU_SHARED
SHMEM_TEAM_GPU refers to all PEs that are using GPUs in the job, whether or not those GPUs have load/store access to each other's memory. SHMEM_TEAM_GPU_SHARED is similar to SHMEM_TEAM_SHARED, and refers to PEs whose GPUs have load/store access to each others memory via an inter-gpu fabric such as nvLink, XeLink, or Infinity Fabric. There may be more than one SHMEM_TEAM_GPU_SHARED in the same job, with disjoint subets of SHMEM_TEAM_GPU.
The proposal associates a particular heap with each team. A context specifies an atomicity domain for the heap associated with the team from which the context derives. Contexts provide for independent progress, but that is incidental here.
GPU code that does not specify a context will default to using SHMEM_TEAM_GPU_SHARED. In this context, the code can safely assume that (a) addresses are in the GPU symmetric heap and are accessible via load/store (b) GPU atomic instructions work. These assumptions allow for short code paths and peak performance in the common case.
GPU code that wishes to use SHMEM_TEAM_GPU_WORLD must do so with an explicit context. This is expensive, but makes clear that communications may need host or NIC assistance, and that the atomicity domain is different, and GPU atomics cannot be safely used. Access to the GPU heap, but outside TEAM_GPU_SHARED will have to use the context versions of the APIs. Access to the host symmetric heap from the GPU will likewise have to use the context versions of the APIs
Host operations that wish to target GPU memory would have to use the context versions of the APIs, but see below.
Host operations
In nvSHMEM, rocSHMEM, and Intel SHMEM, I believe that host operations can target GPU memory or host memory freely and the runtime can tell which is which by looking at the pointer. The virtual address ranges are guaranteed not to overlap. I wish to argue that OpenSHMEM should not support this model, for the following reasons:
Atomicity. I think the host generally has no mechanism to do atomic operations on SHMEM_TEAM_GPU_SHARED. In addition, since the atomicity domain for a context derived from SHMEM_TEAM_WORLD may be different than the atomicity domain for a context derived from SHMEM_TEAM_GPU, you cannot refer to both of them using the AMOs that do not specify a context.
Collectives. The team based collectives say that you can only have one collective underway at a time in a particular team. I think that most implementation implement collectives by using psync structures in the team data structures. I think that generally there is no way to synchronize a gpu initiated collective with a host initated collective, so they may be running simultaneously. Consequently they should use different teams. In fact this suggests that the host version of SHMEM_TEAM_GPU is different that the GPU version, so that their collectives can be independent.
RMA. I think in the present implementations, it would be possible for RMA to host symmetric heap addresses or to GPU symmetric heap addresses both to work using the RMA operations that do not specify a context. I argue this is undesirable though, because (a) requiring a context doesn't cost much for host threads and (b) it reminds users that host memory and GPU memory is sematically different. It would be kind of weird to require a context for AMO but not for RMA, for example.
Could a host operation reference the host symmetric heap on a remote PE and also reference local GPU memory? Yes! The present spec makes clear that the "local" address in an RMA can be any local address, be it global, heap, stack, or symmetric heap. If the local GPU memory is accessible to the host, then the host can use it in an RMA or as the local address in a fetching AMO. Should work fine.
Defer additional specification changes
Big Questions section
Allocation
Should we add a set of teams-based allocation calls or can we just use hints?
Larry thinks the hints mechanism was intended to hint the intended use of the memory, such as for atomics, rather then specify which memory to use, such as the GPU heap. But maybe that is the same thing?
Having teams-based allocation permits future spaces that are local to a team, such as host-based fabric attached memory or whatever. Having the team makes it clear which PEs have to participate in collective allocation for such a space. You can't do that with hints.
Design Considerations
Code path length for GPU initiated operations
GPU threads are slower than host threads. Extra function arguments are expensive as they add register pressure.
These considerations argue for having the defaults for GPU code set to the most common situation: GPU initiated operations to the GPU symmetric heap in team TEAM_GPU_SHARED. Such functions can translate OpenSHMEM RMA and AMO routines that do not have a context argument into load/store operations over direct GPU to GPU fabrics.
Issues
Previous Work
Spaces proposal
FAM Ideas
Spaces proposal
Beta Was this translation helpful? Give feedback.
All reactions