Skip to content

[WIP] [Performance Improvement] Fine-granularity locking in stream_ordered_memory_resource #1912

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 2 commits into
base: branch-25.06
Choose a base branch
from

Conversation

JigaoLuo
Copy link
Contributor

This PR optimizes stream_ordered_memory_resource with fine-grained locking. Replace the monolithic mutex with dedicated locks at fine-granularity. Differentiate hot/cold paths to reduce synchronization overhead. Fixes severe CPU contention, delaying GPU kernels due to waiting for memory.

Description

Try to close #1887 which is my performance observation.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@JigaoLuo JigaoLuo requested a review from a team as a code owner May 11, 2025 20:05
@JigaoLuo JigaoLuo requested review from rongou and bdice May 11, 2025 20:05
Copy link

copy-pr-bot bot commented May 11, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@JigaoLuo JigaoLuo marked this pull request as draft May 11, 2025 20:20
@JigaoLuo
Copy link
Contributor Author

JigaoLuo commented May 11, 2025

To the reviewers: Thank you for your feedback! As an external contributor, I may have overlooked some details, and this is still a draft. That said, I’ve already distinguished between hot and cold paths in the design, implemented thread safety for this part, and passed unit tests locally. The draft currently focuses on small details and TODO on my side, including: 1) Protecting log_summary_trace() with a mutex. 2) Consider the existing get_mutex function and the previous global mutex
Thanks again for your time! Let me know if you spot anything else that needs adjustment.


A detailed description

  • I introduce fine-grained locking to stream_ordered_memory_resource by assigning dedicated mutexes to the free_list map, event map, and individual free_lists. It differentiates between hot and cold paths to minimize synchronization overhead in the most frequent operations (hot path).
  • Previously, a single mutex controlled all streams and free_lists, causing severe CPU-side contention that delayed GPU kernels waiting for memory allocation. This change eliminates that bottleneck by isolating contention points and reducing lock scope.

Although this PR is still WIP, I've built it with PTDS and run local testing. To unit-testing, I increased the thread count from 4 to 16 in the MultiThreadResourceTests. The code passes the unit tests, so I proceeded to benchmark its performance in cudf's Parquet reading, which is my primary use case for this optimization. Profiling results are in the next message.

I welcome feedback on the current implementation. Most of my effort has gone into understanding the existing codebase, profiling bottlenecks, and debugging the solution. As I'm still learning the RMM codebase, thorough code reviews would be particularly valuable. I can also change this to a PR back for running CI tests 😄


Comments to lock-free map

Regarding a lock-free map (as discussed with @vyasr at #1887), I think that such an implementation is unnecessary for several reasons:

  • Low Write Contention: As noted in the issue, the map should have minimal concurrent write operations (inserts/deletes/updates). The majority of accesses are read-only lookups to retrieve free lists.
  • Architecture Considerations: Lock-free structures often rely on x86's strong memory model assumptions, which may not translate well to ARM architectures. ❓
  • Performance: Without a lock-free map, this PR is already efficient under my current read-write lock scheme. You can find more about performance in the following.

…t has its own mutex

Signed-off-by: Jigao Luo <jigao.luo@outlook.com>
@JigaoLuo JigaoLuo force-pushed the fine-granularity-locking-pool branch from d056d6f to e0633c4 Compare May 11, 2025 20:38
@JigaoLuo
Copy link
Contributor Author

JigaoLuo commented May 11, 2025

Performance (in libcudf) with this PR

I benchmarked this PR in libcudf to read Parquet files with 32 concurrent streams, mirroring the setup in the issue #1887 (comment) . While the RMM memory allocator in the row of "librmm" now shows negligible overhead (single-digit microseconds per call, a huge improvement 🚀 ), and mutex contention from rmm in the row "OS runtime lib" has been eliminated mostly, the end-to-end speedup is modest due to unexpected increasing CUDA synchronization costs in the row of "CUDA API".

image

But why more cudaSync time? 🤔

Profiling figure above reveals cudaSynchronization overhead in the row of "CUDA API", which directly constrains end-to-end speedup for libcudf Parquet reading. The root cause in my understanding, as I also hinted in #1887 (comment), is following:

  • [Before this PR optimization]: The sequential launch of kernels has a pattern: after asynchronously launching the first kernel, the host thread acquires the mutex to allocate memory for the next kernel. This CPU-side mutex wait overlaps with the ongoing asynchronous GPU kernel execution.
  • [After this PR optimization]: The reduced mutex contention time is effectively "transferred" to the CUDA synchronization time, making cuda-sync operations appear dominant in profiling data despite the elimination of the original bottleneck from only one mutex.
  • This overlap between CPU mutex waits and GPU kernel execution creates an illusion of increased synchronization costs, when in reality, it reflects more efficient exposure of underlying GPU work in libcudf, previously masked by CPU contention in rmm.

Short conclusion

While this PR resolved a critical bottleneck in the RMM mutex, end-to-end speedup in libcudf Parquet reading is likely capped by deeper dependencies between asynchronous kernel and synchronous memory allocation.

Realizing that substantial speedup in libcudf should be more challenging than I initially anticipated, as I also misidentified the critical bottleneck. The primary constraint lies not in the rmm mutex but in kernels of libcudf.

Comment on lines 258 to 271
log_summary_trace();
read_lock_guard rlock(stream_free_blocks_mtx_);
// Try to find a satisfactory block in free list for the same stream (no sync required)
auto iter = stream_free_blocks_.find(stream_event);
if (iter != stream_free_blocks_.end()) {
// Hot path
lock_guard free_list_lock(iter->second.get_mutex());
iter->second.insert(block);
} else {
rlock.unlock();
// Cold path
write_lock_guard wlock(stream_free_blocks_mtx_);
stream_free_blocks_[stream_event].insert(block); // TODO(jigao): is it thread-safe?
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a potential race here, I think.

Suppose that two threads are using the same stream, they both enter line 259 to search for the stream_event in stream_free_blocks_. That key doesn't exist so they go down the cold path on line 266.

Both unlock the mutex, and go ahead and try to grab the write lock. Only one of the threads can win, call this thread-A. Meanwhile thread-B waits to acquire the lock.

Now thread-A inserts its free_block, and continues, unlocking the mutex.

Now thread-B comes in to insert its block, but the key already exists, and so stream_free_blocks_[stream_event].insert(block) does nothing.

So we drop the reference to block from thread-B without returning to the pool, breaking the contract that free_block has:

  /**
   * @brief Finds, frees and returns the block associated with pointer `ptr`.
   *
   * @param ptr The pointer to the memory to free.
   * @param size The size of the memory to free. Must be equal to the original allocation size.
   * @return The (now freed) block associated with `p`. The caller is expected to return the block
                                                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   * to the pool.
   */
  block_type free_block(void* ptr, std::size_t size) noexcept

Copy link
Contributor Author

@JigaoLuo JigaoLuo May 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wence- Thanks for reviewing.

, and so stream_free_blocks_[stream_event].insert(block) does nothing.

Could you explain why thread B does nothing in this case? I had the same case in mind and had the same concern. But what will happen in my mind is:

  • (Thread A and thread B must have different block from different pointers.)
  • After thread B acquires the write lock, it must see thread A's memory writes to stream_free_blocks_. The map has a new entry inserted by thread A.
  • This implies that stream_free_blocks_[stream_event] in thread B should return a valid iterator pointing to the created free_list.
  • Finally, thread B inserts its block into this list.

The conclusion I have is: Once a write lock is held on the map, operating on the map and its contained free_lists is thread-safe. Is there a flaw in this?


If the following code is preferable to line 270, I'd be happy to replace:

    auto iter = stream_free_blocks_.find(stream_event);
    free_list& blocks =
      (iter != stream_free_blocks_.end()) ? iter->second : stream_free_blocks_[stream_event];
    lock_guard free_list_lock(blocks.get_mutex());
    blocks.insert(block);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah sorry, I thought this insert was calling the insert method on the std::map that is stream_free_blocks_. However, you're right, it's calling insert on the free_list that you get back from the map lookup. So I think this implementation is thread safe.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wence —Thanks for the confirmation! But I will still replace line 270 as discussed for better code style. The write locks on free_list are acquired in a consistent order, and there should be no deadlock

I'll replace line 270 with this code section and add clarifying comments. I've already run unit tests with 64 threads on my end, and the results are stable 🚀

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wence- I've refactored do_deallocate for better readability using early returns while keeping the same logic as we discussed. I also added more comments explaining the hot/cold path ideas. My commit passes local unit tests with 64 threads.

@JigaoLuo JigaoLuo changed the title [WIP] Fine-granularity locking in stream_ordered_memory_resource [WIP] [Performance Improvement]Fine-granularity locking in stream_ordered_memory_resource May 15, 2025
@JigaoLuo JigaoLuo changed the title [WIP] [Performance Improvement]Fine-granularity locking in stream_ordered_memory_resource [WIP] [Performance Improvement] Fine-granularity locking in stream_ordered_memory_resource May 15, 2025
Signed-off-by: Jigao Luo <jigao.luo@outlook.com>
@JigaoLuo JigaoLuo requested a review from wence- May 15, 2025 12:40
@@ -253,9 +256,60 @@ class stream_ordered_memory_resource : public crtp<PoolResource>, public device_
// streams allows stealing from deleted streams.
RMM_ASSERT_CUDA_SUCCESS(cudaEventRecord(stream_event.event, stream.value()));

stream_free_blocks_[stream_event].insert(block);
read_lock_guard rlock(stream_free_blocks_mtx_);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This adds a lot of complexity to do_deallocate that I think should be in the free list implementation. I originally designed this to be portable to other free list implementations, which is why this function was originally so simple -- it more or less just called insert on the stream's free list.

This allocator is already quite fast, and I think in your exploration ultimately you found that it's not the actual bottleneck? Is it worth adding so much complexity? If the complexity can be put in the free list it might be better.

@harrism harrism added non-breaking Non-breaking change improvement Improvement / enhancement to an existing function labels May 19, 2025
@vyasr
Copy link
Contributor

vyasr commented May 20, 2025

@JigaoLuo apologies for the delay. I haven't quite found time to look at this PR yet, but I promise that I will get into it soon!

@JigaoLuo

This comment was marked as off-topic.

@JigaoLuo
Copy link
Contributor Author

JigaoLuo commented May 27, 2025

I’ve conducted performance benchmarks using MULTI_STREAM_ALLOCATIONS_BENCH to compare my PR draft finerlocking with the current branch-25.08 upstream. I introduce more streams and kernels to evaluate performance differences:

static void benchmark_range(benchmark::internal::Benchmark* bench)
{
  bench  //
    ->RangeMultiplier(2)
    ->Ranges({{1, 32}, {4, 32}, {false, true}})  // num_streams, num_kernels, do_prewarm
    ->Unit(benchmark::kMicrosecond);
}

The benchmark shows significant speedup (up to 15x) in high-concurrency scenarios:

Screenshot version:

image

$ python compare.py benchmarks ~/upstream.bench ~/finerlocking.bench 
Comparing /home/jluo/upstream.bench to /home/jluo/finerlocking.bench
Benchmark                                                   Time             CPU      Time Old      Time New       CPU Old       CPU New
----------------------------------------------------------------------------------------------------------------------------------------
BM_MultiStreamAllocations/pool_mr/1/4/0                  +0.0003         +0.0005          2854          2854          2853          2854
BM_MultiStreamAllocations/pool_mr/2/4/0                  +0.0032         +0.0032          1435          1439          1435          1439
BM_MultiStreamAllocations/pool_mr/4/4/0                  +0.0269         +0.0269           728           747           728           747
BM_MultiStreamAllocations/pool_mr/8/4/0                  +0.0204         +0.0204           727           742           727           742
BM_MultiStreamAllocations/pool_mr/16/4/0                 -0.4815         -0.4815          1433           743          1433           743
BM_MultiStreamAllocations/pool_mr/32/4/0                 -0.6550         -0.6550          2146           740          2146           740
BM_MultiStreamAllocations/pool_mr/1/8/0                  +0.0003         +0.0003          5699          5701          5699          5700
BM_MultiStreamAllocations/pool_mr/2/8/0                  -0.4973         -0.4973          5701          2866          5701          2866
BM_MultiStreamAllocations/pool_mr/4/8/0                  -0.6594         -0.6595          4280          1458          4280          1457
BM_MultiStreamAllocations/pool_mr/8/8/0                  -0.8621         -0.8621          5709           787          5708           787
BM_MultiStreamAllocations/pool_mr/16/8/0                 -0.8598         -0.8598          5708           800          5708           800
BM_MultiStreamAllocations/pool_mr/32/8/0                 -0.8176         -0.8176          4281           781          4281           781
BM_MultiStreamAllocations/pool_mr/1/16/0                 +0.0014         +0.0014         11389         11405         11389         11405
BM_MultiStreamAllocations/pool_mr/2/16/0                 +0.0022         +0.0022          5702          5715          5702          5714
BM_MultiStreamAllocations/pool_mr/4/16/0                 -0.4934         -0.4934          5701          2888          5700          2888
BM_MultiStreamAllocations/pool_mr/8/16/0                 -0.7861         -0.7861          7101          1519          7100          1519
BM_MultiStreamAllocations/pool_mr/16/16/0                -0.9126         -0.9125          9930           868          9929           868
BM_MultiStreamAllocations/pool_mr/32/16/0                -0.9055         -0.9055          9248           874          9247           874
BM_MultiStreamAllocations/pool_mr/1/32/0                 -0.0002         -0.0002         22779         22775         22774         22769
BM_MultiStreamAllocations/pool_mr/2/32/0                 +0.0020         +0.0020         11397         11420         11396         11419
BM_MultiStreamAllocations/pool_mr/4/32/0                 -0.7481         -0.7481         22818          5748         22814          5747
BM_MultiStreamAllocations/pool_mr/8/32/0                 -0.8267         -0.8267         17092          2962         17091          2961
BM_MultiStreamAllocations/pool_mr/16/32/0                -0.9294         -0.9294         22832          1611         22831          1611
BM_MultiStreamAllocations/pool_mr/32/32/0                -0.9391         -0.9391         16879          1028         16879          1028


BM_MultiStreamAllocations/pool_mr/1/4/1                  +0.0015         +0.0016          2853          2858          2853          2858
BM_MultiStreamAllocations/pool_mr/2/4/1                  +0.0109         +0.0110          1435          1450          1434          1450
BM_MultiStreamAllocations/pool_mr/4/4/1                  +0.0503         +0.0504           728           765           728           765
BM_MultiStreamAllocations/pool_mr/8/4/1                  +0.0485         +0.0487           728           763           728           763
BM_MultiStreamAllocations/pool_mr/16/4/1                 +0.0490         +0.0490           728           764           728           764
BM_MultiStreamAllocations/pool_mr/32/4/1                 +0.0193         +0.0192           727           741           727           741
BM_MultiStreamAllocations/pool_mr/1/8/1                  +0.0003         +0.0004          5699          5701          5698          5701
BM_MultiStreamAllocations/pool_mr/2/8/1                  +0.0032         +0.0032          2858          2867          2858          2867
BM_MultiStreamAllocations/pool_mr/4/8/1                  +0.0151         +0.0150          1438          1460          1438          1460
BM_MultiStreamAllocations/pool_mr/8/8/1                  +0.0661         +0.0661           743           792           743           792
BM_MultiStreamAllocations/pool_mr/16/8/1                 +0.0552         +0.0553           743           785           743           785
BM_MultiStreamAllocations/pool_mr/32/8/1                 +0.0481         +0.0481           743           779           743           779
BM_MultiStreamAllocations/pool_mr/1/16/1                 +0.0003         +0.0004         11389         11392         11388         11392
BM_MultiStreamAllocations/pool_mr/2/16/1                 +0.0016         +0.0016          5704          5713          5703          5712
BM_MultiStreamAllocations/pool_mr/4/16/1                 +0.0079         +0.0079          2861          2884          2861          2883
BM_MultiStreamAllocations/pool_mr/8/16/1                 +0.0402         +0.0402          1455          1514          1455          1514
BM_MultiStreamAllocations/pool_mr/16/16/1                +0.1617         +0.1619           774           899           773           899
BM_MultiStreamAllocations/pool_mr/32/16/1                +0.1351         +0.1350           774           878           774           878
BM_MultiStreamAllocations/pool_mr/1/32/1                 +0.0006         +0.0005         22771         22784         22770         22782
BM_MultiStreamAllocations/pool_mr/2/32/1                 +0.0011         +0.0011         11396         11409         11396         11408
BM_MultiStreamAllocations/pool_mr/4/32/1                 +0.0035         +0.0035          5713          5733          5713          5733
BM_MultiStreamAllocations/pool_mr/8/32/1                 +0.0251         +0.0251          2881          2954          2881          2953
BM_MultiStreamAllocations/pool_mr/16/32/1                +0.1012         +0.1011          1486          1636          1486          1636
BM_MultiStreamAllocations/pool_mr/32/32/1                +0.2416         +0.2416           830          1030           830          1030
OVERALL_GEOMEAN                                          -0.3001         -0.3001             0             0             0             0

(While these RMM changes may not directly or easily boost libcudf’s end-to-end performance, they isolate synchronization gains and hint at broader optimization potential. This side study in RMM is a part of my story issue in libcudf rapidsai/cudf#18892, and major speedups in libcudf may require deeper integration.)

@harrism
Copy link
Member

harrism commented May 28, 2025

What is do_prewarm? Because with it enabled there seems to be no or little speedup?

@JigaoLuo
Copy link
Contributor Author

JigaoLuo commented May 28, 2025

What is do_prewarm? Because with it enabled there seems to be no or little speedup?

Hi @harrism,
The pre_warm concept is already implemented in the RMM benchmark:

static void run_prewarm(rmm::cuda_stream_pool& stream_pool, rmm::device_async_resource_ref mr)
{
auto buffers = std::vector<rmm::device_uvector<int64_t>>();
for (int32_t i = 0; i < stream_pool.get_pool_size(); i++) {
auto stream = stream_pool.get_stream(i);
buffers.emplace_back(rmm::device_uvector<int64_t>(1, stream, mr));
}
}

Before the benchmark timer, the pre_warm should pre-allocate the needed amount of memory into the stream_free_blocks_ free list within each thread’s PTDS context. Thus, during benchmarking, each thread allocates directly from its thread-local stream_free_blocks_ via accessing a unique slot in the map: a fast operation with minimal contention, as it avoids inter-list memory movement.

With pre-warm and PTDS, my PR draft and the upstream code share comparable performance due to a narrowed critical section without huge contention, though this is an assumption that may not fully reflect real-world user cases. Critically, the pre_warm case is not the focus of this PR. This indirectly supports that my PR draft introduces zero overhead in any scenario.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
improvement Improvement / enhancement to an existing function non-breaking Non-breaking change
Projects
Status: No status
Development

Successfully merging this pull request may close these issues.

[QST] How to reduce the overhead & bottlenecks of do_allocate and do_deallocate with multistreams?
4 participants