Use CUDA events instead of CUDA device/stream synchronization #225

pentschev · 2025-04-24T12:01:42Z

The changes contained here resolve #216 , preventing the need for synchronizing CUDA devices and streams and instead relying on events to determine whether the Buffer operations have completed. At a high-level, the changes here do:

Add CUDA event-based tracking for Buffer device operations implementing a new is_ready() method that ensure the CUDA allocation and/or copy have completed on the stream.
A new shared Chunk::Event is used to validate all chunks that are inserted on the Shuffler have reached the event, ensuring data is ready to be consumed;
A new is_ready() method is used to validate all Buffers before they are consumed for both send() and recv() operations.
Update contract of Communicator's send()/recv() operations, informing the caller is responsible to check whether Buffer::is_ready() before making the call.

…-event

…uda-event

wence-

I am not very sold on this approach, I have to say. Can we make the comms routines stream-ordered by passing streams in, recording an event and then waiting on the event before going into the library?

wence- · 2025-04-24T14:32:31Z

cpp/include/rapidsmpf/buffer/buffer.hpp

@@ -208,6 +225,8 @@ class Buffer {
    /// @brief The underlying storage host memory or device memory buffer (where
    /// applicable).
    StorageT storage_;
+    /// @brief CUDA event used to track copy operations
+    cudaEvent_t cuda_event_;


I'm not sure I really like this design. This produces one event per buffer that we create whereas really we only need one event per stream that we see.

Very minimally, we should definitely create the events with cudaEventDisableTiming so they are as lightweight as possible.

I'm not sure I really like this design. This produces one event per buffer that we create whereas really we only need one event per stream that we see.

That would only shift what we need to track, wouldn't it? If we had one event per stream, how would we know if a buffer was created before or after the event? From the check's perspective it could have happened either before or after and we would go back to have potentially invalid memory accesses.

Very minimally, we should definitely create the events with cudaEventDisableTiming so they are as lightweight as possible.

Thanks for the suggestion, I'll do that.

cudaEventRecord records the state of a stream (i.e. any outstanding work on the stream is "noted") and then cudaEventWait waits for completion of all that work. So:

do_some_allocation(stream) cudaEventRecord(event, stream) .... cudaEventWait(event) # allocation guaranteed to have completed

no?

Yes, and this is why we have one event per buffer. If we had one event per stream we could end up with:

buf1 = do_some_allocation(stream) cudaEventRecord(event, stream) buf2 = do_some_allocation(stream) cudaEventWait(event) # buf1 is guaranteed to have completed, but buf2 isn't

That is what I don't think will work, or are you suggesting something different and I misunderstood the suggestion?

Very minimally, we should definitely create the events with cudaEventDisableTiming so they are as lightweight as possible.

Thanks for the suggestion, I'll do that.

This is now done in 1180b8e .

buf1 = do_some_allocation(stream) cudaEventRecord(event, stream) buf2 = do_some_allocation(stream) cudaEventRecrod(event, stream) cudaEventWait(event) # now buf1 and buf2 are guaranteed completed

But in that case aren't we just blocking until all buffers are completed? This would potentially prevent us from progressing buf1 until buf2 has completed too, which could increase memory pressure, so I'm not really sure the benefits outweigh the costs.

An additional complication is that you would also need to keep track of the streams and events globally, meaning we probably wouldn't be able to get an arbitrary Buffer that we don't know which stream it's being allocated/copied nor where the stream's event is located. IOW, we would need to have some sort of manager visible everywhere to track streams and events.

wence- · 2025-04-24T14:35:30Z

cpp/include/rapidsmpf/buffer/buffer.hpp

+    /**
+     * @brief Check if the last copy operation has completed.
+     *
+     * @return true if the copy operation has completed or no copy operation
+     * was performed, false if it is still in progress.
+     */
+    [[nodiscard]] bool is_copy_complete() const;


nit: "copy" is, I think, the wrong phrasing. I think you mean "has any stream-ordered work to allocate this buffer completed".

Well, this only applies to Buffer::copy at the moment, making this more general would IMO imply that it is also relevant for the constructor that just allocates without copies.

wence- · 2025-04-24T14:38:39Z

cpp/src/buffer/buffer.cpp

+    } else if (status == cudaErrorNotReady) {
+        return false;
+    } else {
+        RAPIDSMPF_CUDA_TRY_ALLOC(status);


Do we not also have a RAPIDSMPF_CUDA_TRY?

If not, we should consider introducing. Or something like RAPIDSMPF_EXPECTS(status == cudaSuccess || stats == cudaErrorNotReady, "Unexpected status") and then return status == cudaSuccess

We do, that was my mistake, will fix this too.

Fixed in 23673ab .

wence- · 2025-04-24T15:47:06Z

cpp/src/shuffler/shuffler.cpp

+                auto future = shuffler_.comm_->recv(
+                    src, gpu_data_tag, incoming.buffer_with_event->release()
+                );


I think it makes more sense (probably) to make the communication routines stream-ordered, and push the sync into there.

This refactoring ensures things are correctly allocated and ready to go by not enqueuing the allocated buffer receive until it is ready (basically we take things out and then put them back in). But this is still a delicate interface and requires everyone to remember to do that.

I think it makes more sense (probably) to make the communication routines stream-ordered, and push the sync into there.

See #225 (comment), this would mean we block the Shuffler progress thread.

This refactoring ensures things are correctly allocated and ready to go by not enqueuing the allocated buffer receive until it is ready (basically we take things out and then put them back in). But this is still a delicate interface and requires everyone to remember to do that.

Yes, this is fortunately only the internal implementation, but is required to make sure we don't block the progress either. We could make a larger refactoring that prevents extract/reinsert, but not really sure if that would make everything much less delicate.

pentschev · 2025-04-24T16:10:53Z

I am not very sold on this approach, I have to say. Can we make the comms routines stream-ordered by passing streams in, recording an event and then waiting on the event before going into the library?

Doing so means that comms become blocking, which will go against the way the Shuffler progress is expected to go: just progressing asynchronous work.

…-event

nirandaperera · 2025-04-24T17:00:23Z

@pentschev I am still reviewing this PR. When reading the description, I felt like we are missing something.
Synchronization guarantees are not just related to the copy right? Even buffer creation is async. So, we during receive, we can have a scenario when the BufferResource::allocate

rapidsmpf/cpp/include/rapidsmpf/buffer/resource.hpp

Line 240 in 13343ec

std::unique_ptr<Buffer> allocate(

is async, but the data ptr is not ready. This is because rmm::device_buffer construction is async and stream ordered.

pentschev · 2025-04-24T17:08:38Z

@pentschev I am still reviewing this PR. When reading the description, I felt like we are missing something. Synchronization guarantees are not just related to the copy right? Even buffer creation is async. So, we during receive, we can have a scenario when the BufferResource::allocate

rapidsmpf/cpp/include/rapidsmpf/buffer/resource.hpp

Line 240 in 13343ec

std::unique_ptr<Buffer> allocate(

is async, but the data ptr is not ready. This is because rmm::device_buffer construction is async and stream ordered.

@nirandaperera please read item 4 in the description as well as #227 (also mentioned in the description).

nirandaperera · 2025-04-24T17:17:44Z

@pentschev Why can't we use the Buffer::cuda_event_ event for allocation, rather than a separate BufferWithEvent class?

pentschev · 2025-04-24T17:23:55Z

@pentschev Why can't we use the Buffer::cuda_event_ event for allocation, rather than a separate BufferWithEvent class?

Maybe we could, that's what #227 is about. It will take more work to get that done and maybe requires some discussion before doing that, so we should leave that for a follow-up PR.

nirandaperera · 2025-04-24T17:55:07Z

Let's finish #227 discussion then. For me, construction and copying are not fundamentally different. So, I propose that we handle both in the same PR.

pentschev · 2025-04-24T18:00:18Z

Let's finish #227 discussion then. For me, construction and copying are not fundamentally different. So, I propose that we handle both in the same PR.

The proposal I see more suitable from #227 is the item 2 of the description:

Write a mechanism similar to that added in #225 for Buffer::copy but for allocations as well. The difficulty in here is that the user will always be required to specify the same rmm::cuda_stream_view to both the rmm::device_buffer and Buffer constructors, this requires careful analysis of how the rmm::device_buffer is being created, whether using the default CUDA stream, PTDS or an explicit asynchronous CUDA stream.

This will have to touch a lot of the codebase and potentially the Python layer as well, so I think making them in separate PRs is a better option. Also note this is a critical issue that is hard to "get right" and requires testing on H100 after every iteration to guarantee it didn't break, therefore making changes in incremental steps is much more manageable.

nirandaperera

I made some comments. I think its best if we schedule a call and finalize the design.

nirandaperera · 2025-04-24T17:13:25Z

cpp/src/communicator/mpi.cpp

@@ -116,6 +116,7 @@ std::unique_ptr<Communicator::Future> MPI::send(
 std::unique_ptr<Communicator::Future> MPI::send(
    std::unique_ptr<Buffer> msg, Rank rank, Tag tag
 ) {
+    RAPIDSMPF_EXPECTS(msg->is_copy_complete(), "buffer copy has not completed yet");


msg is already moved to send. Now, if we throw here, we are loosing data, right? 🤔

Unfortunately, you're right. It is unfortunate that you're right because this means there's no good way to push this check onto the communicator and we thus have to force the caller to ensure that. I'll revert those changes and update the docstrings to make the contract to the caller clear: the caller needs to ensure the allocation and data are ready.

Done in eded688 .

nirandaperera · 2025-04-24T17:15:03Z

cpp/src/shuffler/shuffler.cpp

+/**
+ * @brief Combination of a Buffer with associated CUDA event.
+ *
+ * Combining a CUDA event with a buffer allows us to track the completion of
+ * the asynchronous allocation of device memory. It allows disabling the event
+ * if it is not needed, for example, when allocating an empty buffer or if it
+ * is known to that the allocation has been completed.
+ */
+class BufferWithEvent {


A cuda event is already a class member of the Buffer right? Why do we need a separate class here?

The (valid) event only exists for Buffer::copy, here there's no Buffer::copy being called, only the Buffer constructor, in which case the event does not get created. This again boils down to the #227 discussion.

cpp/src/shuffler/shuffler.cpp

nirandaperera · 2025-04-24T17:23:01Z

cpp/src/shuffler/shuffler.cpp

+    [[nodiscard]] MemoryType mem_type() const {
+        return buffer_->mem_type();
+    }
+
+    /**
+     * @brief Get the size of the buffer.
+     *
+     * @return The size of the buffer in bytes.
+     */
+    [[nodiscard]] std::size_t size() const {
+        return buffer_->size;
+    }


I think these can be constexpr

I don't think so, Buffer::size and Buffer::mem_type() aren't constexpr.

nirandaperera · 2025-04-24T17:53:33Z

cpp/src/shuffler/shuffler.cpp

@@ -34,21 +138,27 @@ namespace {
 * @param size The size of the buffer in bytes.
 * @param stream CUDA stream to use for device allocations.
 * @param br Buffer resource used for the reservation and allocation.
+ * @param log Logger to warn if object is destroyed before event is ready.
+ * @param enable_event Whether to track CUDA events for this buffer.


I feel like this is redundant. If this is a device buffer, enable_event will always be true isnt it?

It isn't, see the docstrings. This is used in practice for empty allocations here, otherwise we immediately get warnings in the destructor, because the event will not have been completed, further discussion why empty allocations are different in #226 .

This doc comments needs to be updated

Done in 640c2bf .

nirandaperera · 2025-04-24T18:13:37Z

cpp/src/shuffler/shuffler.cpp

+                } catch (std::logic_error& e) {
+                    RAPIDSMPF_EXPECTS(
+                        outgoing_chunks_
+                            .insert({ready_for_data_msg.cid, std::move(chunk)})


this chunk doesn't have any gpu_data now, does it?

I personally dislike the blanket exception catch here. This catches non-synchronization exceptions like these as well.
I suggest using a return flag, like pair<bool, unique_ptr<Buffer>> as the output from comm::send. This will allow us to throw and exit as well as return with failure gracefully (and revisit later).

Good point. However, returning this return type will be very clunky to manage, in fact the current return type is a Future, which would make doing something like what you suggested even more clunky. We would need something like pair<bool, unique_ptr<Future>> and the user would be responsible to check the value of the bool and then if the send failed extract the Buffer from Future to then continue, at this point it is much simpler for comm::send and the caller to just add a Buffer::is_copy_complete() before calling comm::send.

As per #225 (comment), I'll revert those changes and update the docstring to clarify the contract.

Co-authored-by: Niranda Perera <niranda.perera@gmail.com>

…uda-event

cpp/src/buffer/buffer.cpp

cpp/include/rapidsmpf/buffer/buffer.hpp

cpp/src/buffer/buffer.cpp

nirandaperera · 2025-05-01T22:56:35Z

cpp/src/shuffler/shuffler.cpp

@@ -34,21 +138,27 @@ namespace {
 * @param size The size of the buffer in bytes.
 * @param stream CUDA stream to use for device allocations.
 * @param br Buffer resource used for the reservation and allocation.
+ * @param log Logger to warn if object is destroyed before event is ready.
+ * @param enable_event Whether to track CUDA events for this buffer.


This doc comments needs to be updated

nirandaperera · 2025-05-01T23:54:39Z

cpp/src/shuffler/shuffler.cpp

+    struct IncomingChunk {
+        detail::Chunk chunk;
+        std::unique_ptr<Buffer> buffer;
+    };


Its better to address this. The idea AFAIU is, chunks will be moved from the incoming_chunks_ map to in-transit map ONLY when the receive buffers are allocated AND they are ready.

... if (chunk.gpu_data_size > 0) { // chunk has data if (!chunk.gpu_data){ // buffer has not been allocated. allocate one chunk.gpu_data = ...; } // chunk.gpu_data is valid if (!chunk.gpu_data.is_ready()){ // data buffer is not ready to receive data. Let's move on, and come back later. it++; } // data buffer is allocated and ready. Schedule recv ... // move the chunk to intransit chunks and future to intransit future ... } else { // chunk has no data. Just a control message. Nothing else to receive. create empty buffer with a null event. // Pass to outbox. ... }

cpp/src/buffer/buffer.cpp

madsbk

Nice work @pentschev.

I think we should merge this PR and then work on a follow-up PR that implement and test multi-stream support using ideas from #242, which might include the user to create events explicitly.

cpp/include/rapidsmpf/buffer/buffer.hpp

madsbk · 2025-05-05T07:10:51Z

cpp/include/rapidsmpf/communicator/communicator.hpp

+     * @warning The caller is responsible to ensure the underlying `Buffer` allocation
+     * is already valid before calling, for example, when a CUDA allocation
+     * and/or copy are done asynchronously. Specifically, the caller should ensure
+     * `Buffer::is_ready()` returns true before calling this function.


Let's throw an exception if Buffer::is_ready() == false?

No, unfortunately that doesn't work, as @nirandaperera has noted previously in #225 (comment) . Doing that means we lose the std::unique_ptr<Buffer>.

Still, I think we should check Buffer::is_ready() == false. It might be unrecoverable but still better than a segfault. But let's make it clear in the doc that the buffer has been moved and freed!

How about a warning instead? Note that the exception will be raised in the shuffler's progress thread, so we'll probably need to handle it in some way. I know a warning is not a solution for the potential of a segfault but I think that will more clearly inform the user about what happened, rather than an exception which may not be handled correctly.

What about https://en.cppreference.com/w/cpp/error/terminate ?

That works too, I would suggest that we do this now and add a new ABORT log-level that both logs an error and immediately terminate. WDYT?

Sounds good!

Done in cb0b159 and opened #246 to track this as well.

cpp/src/shuffler/postbox.cpp

cpp/include/rapidsmpf/buffer/buffer.hpp

cpp/src/buffer/buffer.cpp

cpp/include/rapidsmpf/buffer/buffer.hpp

…-event

cpp/include/rapidsmpf/buffer/buffer.hpp

Co-authored-by: Mads R. B. Kristensen <madsbk@gmail.com>

…uda-event

nirandaperera · 2025-05-05T18:16:28Z

cpp/src/shuffler/chunk.cpp

@@ -87,6 +86,10 @@ std::unique_ptr<cudf::table> Chunk::unpack(rmm::cuda_stream_view stream) const {
    return unpack_and_concat(std::move(packed_vec), stream, br->device_mr());
 }

+bool Chunk::is_ready() const {
+    return !gpu_data || gpu_data->is_ready();


There is a small nuance here. Chunk serves 2 purposes,

Carrying messages

Flagging end-of-messages (by setting a non-zero value for expected_num_chunks).

I think this should have the following logic, IINM. @madsbk WDYT?

if (expected_num_chunks > 0){ // always ready. Doesn't have any data buffers return true; } else { // chunk has data, and its not ready until the data buffer is ready // now, gpu_data buffer needs to be set at some point if (gpu_data){ return gpu_data->is_ready(); } else { return false; } }

This could reduce to,

return (expected_num_chunks > 0) || (gpu_data && gpu_data->is_ready());

But, I don't want to block the PR on this.

return (expected_num_chunks > 0) || (gpu_data && gpu_data->is_ready());

Agree, this will make the intention more clear.

I see this now, but it wasn't immediately obvious. I think it makes sense to remove expected_num_chunks from the public Chunk constructor that gets gpu_data, I've applied your suggestion and mine in de3db36, let me know if there's any reason not to make the change with constructors.

Note that the private constructor remains there because it's used in Chunk::from_metadata_message.

@nirandaperera pointed the condition from the commit mentioned above was wrong, it's now fixed in 294d5b9.

+1 on constructors. I think its much more clearer now. LGTM

nirandaperera

Just made a small comment on Chunk::is_ready impl. Other than that, all good.
@pentschev thanks for entertaining my laundry list of comments 🙂

nirandaperera · 2025-05-05T19:41:48Z

@pentschev all LGTM. Let's merge this on green CI

pentschev · 2025-05-05T21:06:28Z

It seems A100 queues are currently long, I'll nevertheless trigger automerge. Thanks all for the reviews!

pentschev · 2025-05-05T21:06:36Z

/merge

As per our discussion last Friday we were ok following Mads' preference here, which was to get this PR merged for the near-term.

pentschev added 7 commits April 23, 2025 04:43

Use CUDA events to track Buffer copy progress

2c0dac4

Test for copy completion

1202c50

Throw exception if buffer copy didn't complete before send()

bc40a83

Only send GPU data if copy is complete

6954a34

Prevent receiving on incomplete Buffer allocation

c3b04e3

Use BufferWithEvent

ecb9e2d

Log destruction/release if allocation is not complete

fa236cb

pentschev added improvement Improves an existing functionality non-breaking Introduces a non-breaking change labels Apr 24, 2025

pentschev requested a review from a team as a code owner April 24, 2025 12:01

pentschev mentioned this pull request Apr 24, 2025

Should Buffers always ensure valid allocations? #227

Closed

pentschev added 4 commits April 24, 2025 15:00

Merge branch 'branch-25.06' into buffer-cuda-event

e393aa0

Merge remote-tracking branch 'upstream/branch-25.06' into buffer-cuda…

c23e13f

…-event

Disable event tracking instead of logger

33015af

Merge remote-tracking branch 'origin/buffer-cuda-event' into buffer-c…

2699326

…uda-event

wence- requested changes Apr 24, 2025

View reviewed changes

pentschev added 3 commits April 24, 2025 09:30

Replace incorrect use of RAPIDSMPF_CUDA_TRY_ALLOC

23673ab

Create evens with cudaEventDisableTiming

1180b8e

Merge remote-tracking branch 'upstream/branch-25.06' into buffer-cuda…

aa2ade6

…-event

nirandaperera requested changes Apr 24, 2025

View reviewed changes

pentschev and others added 3 commits April 24, 2025 21:32

Simplify condition

7714a6f

Co-authored-by: Niranda Perera <niranda.perera@gmail.com>

Make synchronization the user's responsibility before send()

eded688

Merge remote-tracking branch 'origin/buffer-cuda-event' into buffer-c…

8887137

…uda-event

nirandaperera requested changes May 2, 2025

View reviewed changes

pentschev added 6 commits May 2, 2025 02:27

Remove IncomingChunk

8f6ecaf

Fix Event destruction (again)

10fc871

Fix Event smart-pointer type

e2d8c9c

Simplify is_ready() condition

27f856c

Fixed docstring

640c2bf

Fix build errors

40ebb2a

wence- mentioned this pull request May 2, 2025

Ensure data are ready for comms libraries with events #242

Draft

wence- reviewed May 2, 2025

View reviewed changes

cpp/src/buffer/buffer.cpp Outdated Show resolved Hide resolved

madsbk requested changes May 5, 2025

View reviewed changes

pentschev added 3 commits May 5, 2025 02:46

Merge remote-tracking branch 'upstream/branch-25.06' into buffer-cuda…

3f65ea3

…-event

Simplify Event destructor to rely on smart-pointer for thread-safety

17c01e2

Improve docs

8627cfc

madsbk reviewed May 5, 2025

View reviewed changes

cpp/include/rapidsmpf/buffer/buffer.hpp Outdated Show resolved Hide resolved

cpp/include/rapidsmpf/buffer/buffer.hpp Outdated Show resolved Hide resolved

Typo fixes

afd84a1

Co-authored-by: Mads R. B. Kristensen <madsbk@gmail.com>

madsbk approved these changes May 5, 2025

View reviewed changes

pentschev added 2 commits May 5, 2025 06:02

Terminate if buffers are not ready for use

cb0b159

Merge remote-tracking branch 'origin/buffer-cuda-event' into buffer-c…

1f2f966

…uda-event

nirandaperera reviewed May 5, 2025

View reviewed changes

nirandaperera approved these changes May 5, 2025

View reviewed changes

pentschev added 2 commits May 5, 2025 12:30

Remove expected_num_chunks from Chunks constructor with gpu_data

de3db36

Fix Chunk::is_ready condition

294d5b9

pentschev added 2 commits May 5, 2025 13:01

Update condition and is_ready docstring

6a1ae03

Check for gpu_data_size

16c8b17

rapids-bot bot merged commit 0317f38 into rapidsai:branch-25.06 May 5, 2025
21 checks passed

pentschev deleted the buffer-cuda-event branch May 6, 2025 08:32

Use CUDA events instead of CUDA device/stream synchronization #225

Use CUDA events instead of CUDA device/stream synchronization #225

Uh oh!

Conversation

pentschev commented Apr 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wence- left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pentschev commented Apr 24, 2025

Uh oh!

nirandaperera commented Apr 24, 2025

Uh oh!

pentschev commented Apr 24, 2025

Uh oh!

nirandaperera commented Apr 24, 2025

Uh oh!

pentschev commented Apr 24, 2025

Uh oh!

nirandaperera commented Apr 24, 2025

Uh oh!

pentschev commented Apr 24, 2025

Uh oh!

nirandaperera left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pentschev commented Apr 24, 2025 •

edited

Loading

madsbk May 5, 2025 •

edited

Loading