-
Notifications
You must be signed in to change notification settings - Fork 15
Ensure data are ready for comms libraries with events #242
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: branch-25.06
Are you sure you want to change the base?
Conversation
An alternative approach (borrowing a bunch of the same ideas) to #225. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Signposts/thinking
} | ||
|
||
Event::~Event() { | ||
cudaEventDestroy(event_); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thread safety of the dtor is handled by the shared_ptr
storage of events.
if (!done_.load(std::memory_order_relaxed)) { | ||
auto result = cudaEventQuery(event_); | ||
if (result == cudaSuccess) { | ||
done_.store(true, std::memory_order_relaxed); | ||
return true; | ||
} else if (result != cudaErrorNotReady) { | ||
RAPIDSMPF_CUDA_TRY(result); | ||
} | ||
} | ||
return done_.load(std::memory_order_relaxed); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could do some stuff to ensure that cudaEventQuery
is really only called once, but I think this is OK.
cpp/src/cuda_event.cpp
Outdated
void Event::add_as_dependency_to(rmm::cuda_stream_view stream) const { | ||
if (!query()) { | ||
RAPIDSMPF_CUDA_TRY( | ||
cudaStreamWaitEvent(stream.value(), event_, cudaEventWaitDefault) | ||
); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is preparing for if we start using an internal stream inside the shuffler that is distinct from the user stream. In that case, when we copy a user buffer on the internal stream we must add a dependency on the work on the user's stream.
@@ -51,6 +51,37 @@ std::vector<Chunk> PostBox<KeyType>::extract_all() { | |||
return ret; | |||
} | |||
|
|||
template <typename KeyType> | |||
std::vector<Chunk> PostBox<KeyType>::extract_all_ready() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Taken straight from #225.
@@ -175,7 +176,7 @@ class Shuffler::Progress { | |||
|
|||
// Check for new chunks in the inbox and send off their metadata. | |||
auto const t0_send_metadata = Clock::now(); | |||
for (auto&& chunk : shuffler_.outgoing_chunks_.extract_all()) { | |||
for (auto&& chunk : shuffler_.outgoing_chunks_.extract_all_ready()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only process ready chunks.
cpp/src/buffer/buffer.cpp
Outdated
[&](const DeviceStorageT& storage) -> std::unique_ptr<Buffer> { | ||
add_as_dependency_to(stream); | ||
auto event = std::make_shared<Event>(); | ||
auto buf = std::make_unique<rmm::device_buffer>( | ||
storage->data(), storage->size(), stream, br->device_mr() | ||
); | ||
event->record(stream); | ||
return std::unique_ptr<Buffer>( | ||
new Buffer{std::move(buf), br, std::move(event)} | ||
); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This buffer may have work queued on a stream that is not stream
, so add a dependency (call cudaStreamWaitEvent
) on that work before doing the copy, and then record a new event for the copy for the return value.
cpp/src/buffer/buffer.cpp
Outdated
// Have to ensure that any async work is complete. | ||
synchronize(); | ||
auto buf = std::make_unique<rmm::device_buffer>( | ||
storage->data(), storage->size(), stream, br->device_mr() | ||
); | ||
event->record(stream); | ||
|
||
return std::unique_ptr<Buffer>( | ||
new Buffer{std::move(buf), br, std::move(event)} | ||
); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Host to device copy must first synchronize
to ensure that the host data is valid, and then record the memcpy for the return value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, this one can add a stream dependency as well...
cpp/src/buffer/buffer.cpp
Outdated
add_as_dependency_to(stream); | ||
auto ret = std::make_unique<std::vector<uint8_t>>(storage->size()); | ||
RAPIDSMPF_CUDA_TRY_ALLOC(cudaMemcpyAsync( | ||
ret->data(), | ||
storage->data(), | ||
storage->size(), | ||
cudaMemcpyDeviceToHost, | ||
stream | ||
stream.value() | ||
)); | ||
return std::unique_ptr<Buffer>(new Buffer{std::move(ret), br}); | ||
event->record(stream); | ||
return std::unique_ptr<Buffer>( | ||
new Buffer{std::move(ret), br, std::move(event)} | ||
); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Device to host copy needs to add a dependency on the buffer's event on stream
and then record the memcpy for the return value.
std::unique_ptr<Buffer> BufferResource::move( | ||
std::unique_ptr<rmm::device_buffer> data, std::shared_ptr<Event> event | ||
) { | ||
return std::unique_ptr<Buffer>(new Buffer{std::move(data), this, event}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Taking ownership of a device_buffer
now needs and event (and the ctor will ensure this is non-null appropriately).
@@ -105,11 +112,13 @@ std::unique_ptr<Buffer> BufferResource::allocate( | |||
} | |||
|
|||
std::unique_ptr<Buffer> BufferResource::move(std::unique_ptr<std::vector<uint8_t>> data) { | |||
return std::make_unique<Buffer>(Buffer{std::move(data), this}); | |||
return std::unique_ptr<Buffer>(new Buffer{std::move(data), this, nullptr}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Taking ownership of host data never needs an event: the caller promises to synchronise if necessary first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't mind going with either #225 or #242, but to be honest I don't quite see what is the big difference. The only differences I see from here from #225 is that you created a new file for Event
and added a few more methods, like synchronize()
(which I personally feel it's a bad idea to even have most of the time), and add_as_dependency_to
which after a high-level look I still fail to understand how it adds any important information that cannot be capture by recording the event.
/** | ||
* @brief ensure all stream-ordered work to populate the buffer is completed. | ||
*/ | ||
void synchronize() const; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, but how do we prevent abuse then? In my mind, providing an interface like this is recipe to "go do the easy thing".
cpp/src/buffer/buffer.cpp
Outdated
// Have to ensure that any async work is complete. | ||
synchronize(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one was wrong, so I fixed it with a stream dependency.
cpp/src/buffer/buffer.cpp
Outdated
// Have to ensure that any async work is complete. | ||
synchronize(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you have to ensure async work is complete? Your source is host memory, seems like anything that has been done with it should have been synchronized before, it should never be up to the consumer to magically figure out how the host buffer has been processed, how do you know it is even the right stream then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suppose you spill a device side buffer to Buffer
as host memory. You did that with cudaMemcpyAsync
. Now you have a Buffer
with host memory, but it might not be ready
. So if you then copy it to a new Buffer
with host memory you can either fail if it's not is_ready
or you synchronize.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, ok, but when would that happen in practice? I'd argue the consumer needs to checking the Buffer
(or whatever other object) that delivered the host memory should be checked if is_ready()
, otherwise, to use your words, we are now putting a "big hammer" (although slightly smaller than device sync) here. What is the argument for whoever called Buffer::copy()
not to have ensured that in the first place?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we don't do the synchronisation then every time you do a buffer host-to-host copy you must check that that is what you're doing and then sync
. Or else check is_ready
and then if you're not ready, push it back into a copy queue?
That seems like a very difficult to manage API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't get why we care about host-to-host copy, why wouldn't they be sync? Sure, if the source buffer you're consuming has been copied into from the device, then yes, but having to make that assumption for every host buffer just seems wrong. Can you point to a practical scenario where this happens today?
Perhaps I'm missing something, but it doesn't feel like the right implementation to me that we have to add an implicit synchronization before consuming a host buffer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In almost all cases the source host buffer will have no event and thus the sync is a noop. But sometimes it might
@pentschev I think this turns out very similar to your #225 PR. Some differences:
|
Suppose the application stream is When we take ownership of the application buffer we record the state of Then suppose we spill that data, but we're using |
Fair, but to me this is beyond the scope of this change, although since it's already here it's fine, otherwise I would have argued it should be part of a follow-up PR where |
I can certainly rip it out, but we still need to think about what to do about host-to-host copies (given that the source buffer in that scenario might have stream-queued work). |
This will be used to ensure that when passing to a library that is not stream-aware all work used to populate the Buffer is complete.
Now that buffers track their state with events, we can extract only those buffers which are ready to use with a library that is not stream-aware.
Added a |
Rather than using device synchronisation, use events to track work on streams for buffers and only hand over the buffer data pointer to the comms library when the event is ready.