Use peak memory usage as a better proxy for ctest parallelism #18603

bdice · 2025-04-30T15:50:48Z

Description

Data from #18599 (comment) shows that we need to mark a few tests as requiring more of the GPU to avoid OOM errors. Conversely, we are marking several tests as requiring more than the default (15% of the GPU) which should be safe to run in parallel (and probably failed due to other large-memory tests).

Currently we run our tests in CI with -j20, but we default all tests to requiring 15% of the GPU (so a maximum of 6 tests are able to run in parallel). Our lowest VRAM GPU in CI is an L4 with 24 GB. We should be able to safely run any tests that require <1 GB of memory in parallel (requiring 20 GB at most). We only have 4 tests using more than that (shown below).

This PR proposes to run ORC_TEST, COPYING_TEST, BINARYOP_TEST, and PARQUET_TEST in isolation (100% of the GPU), while allowing all other tests to be run 14-way parallel (all other tests require 7% of the GPU). This should still allow tests to pass on a 16 GB GPU, if the 14 parallel tests each consume less than 1 GB of the GPU memory. Going forward, any new tests that consume more than 1 GB of memory should be set to run with 100% of the GPU, to keep bookkeeping simpler.

Memory usage by test

ORC_TEST,6.24 GB
COPYING_TEST,6.00 GB
BINARYOP_TEST,5.03 GB
PARQUET_TEST,1.93 GB
JOIN_TEST,0.96 GB
MULTIBYTE_SPLIT_TEST,0.81 GB
STREAM_IO_MULTIBYTE_SPLIT_TEST,0.56 GB
COMPRESSION_TEST,0.48 GB
DATA_CHUNK_SOURCE_TEST,0.47 GB
JSON_TEST,0.35 GB
QUANTILES_TEST,0.07 GB
GROUPBY_TEST,0.05 GB
REPLACE_TEST,0.03 GB
ITERATOR_TEST,0.02 GB
REDUCTIONS_TEST,0.01 GB
TEXT_TEST,0.01 GB
NESTED_JSON_TEST,0.01 GB
STREAM_COMPACTION_TEST,0.01 GB
UTILITIES_TEST,0.01 GB
STREAM_TEXT_TEST,0.01 GB
STRINGS_TEST,0.00 GB
STREAM_STRINGS_TEST,0.00 GB
INTEROP_TEST,0.00 GB
ROLLING_TEST,0.00 GB
FST_TEST,0.00 GB
LOGICAL_STACK_TEST,0.00 GB
TRANSPOSE_TEST,0.00 GB
MERGE_TEST,0.00 GB
STREAM_MERGE_TEST,0.00 GB
REPLACE_NULLS_TEST,0.00 GB
SORT_TEST,0.00 GB
STREAM_COPYING_TEST,0.00 GB
CLAMP_TEST,0.00 GB
PARTITIONING_TEST,0.00 GB
UNARY_TEST,0.00 GB
JSON_TYPE_CAST_TEST,0.00 GB
TRANSFORM_TEST,0.00 GB
DEVICE_ATOMICS_TEST,0.00 GB
STREAM_IO_PARQUET_TEST,0.00 GB
ROUND_TEST,0.00 GB
AST_TEST,0.00 GB
COLUMN_TEST,0.00 GB
BITMASK_TEST,0.00 GB
FILLING_TEST,0.00 GB
LABEL_BINS_TEST,0.00 GB
STREAM_IO_ORC_TEST,0.00 GB
JSON_WRITER_TEST,0.00 GB
FACTORIES_TEST,0.00 GB
SEARCH_TEST,0.00 GB
FIXED_POINT_TEST,0.00 GB
CSV_TEST,0.00 GB
STRUCTS_TEST,0.00 GB
LISTS_TEST,0.00 GB
STREAM_TRANSPOSE_TEST,0.00 GB
STREAM_IO_CSV_TEST,0.00 GB
HASHING_TEST,0.00 GB
RESHAPE_TEST,0.00 GB
STREAM_IO_JSON_TEST,0.00 GB
JSON_PATH_TEST,0.00 GB
STREAM_TRANSFORM_TEST,0.00 GB
IS_SORTED_TEST,0.00 GB
STREAM_JOIN_TEST,0.00 GB
DICTIONARY_TEST,0.00 GB
SCALAR_TEST,0.00 GB
DATETIME_OPS_TEST,0.00 GB
STREAM_SORTING_TEST,0.00 GB
TABLE_TEST,0.00 GB
STREAM_GROUPBY_TEST,0.00 GB
STREAM_STREAM_COMPACTION_TEST,0.00 GB
ENCODE_TEST,0.00 GB
STREAM_LISTS_TEST,0.00 GB
REPLACE_NANS_TEST,0.00 GB
TIMESTAMPS_TEST,0.00 GB
STREAM_DICTIONARY_TEST,0.00 GB
STREAM_REPLACE_TEST,0.00 GB
STREAM_ROLLING_TEST,0.00 GB
STREAM_PARTITIONING_TEST,0.00 GB
STREAM_QUANTILE_TEST,0.00 GB
STREAM_RESHAPE_TEST,0.00 GB
SPAN_TEST,0.00 GB
STREAM_REDUCTION_TEST,0.00 GB
STREAM_FILLING_TEST,0.00 GB
STREAM_HASHING_TEST,0.00 GB
STREAM_LABELING_BINS_TEST,0.00 GB
STREAM_CONCATENATE_TEST,0.00 GB
STREAM_SEARCH_TEST,0.00 GB
NORMALIZE_REPLACE_TEST,0.00 GB
STREAM_NULL_MASK_TEST,0.00 GB
TYPE_INFERENCE_TEST,0.00 GB
STREAM_BINARYOP_TEST,0.00 GB
STREAM_DATETIME_TEST,0.00 GB
STREAM_ROUND_TEST,0.00 GB
STREAM_UNARY_TEST,0.00 GB
STREAM_SCALAR_TEST,0.00 GB
DISPATCHER_TEST,0.00 GB
JIT_PARSER_TEST,0.00 GB
LARGE_STRINGS_TEST,0.00 GB
ROW_SELECTION_TEST,0.00 GB
STREAM_COLUMN_VIEW_TEST,0.00 GB
STREAM_IDENTIFICATION_TEST,0.00 GB
STREAM_POOL_TEST,0.00 GB
TRAITS_TEST,0.00 GB

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

davidwendt · 2025-04-30T15:55:46Z

Could you include LARGE_STRINGS_TEST in the 100% GPU case as well?
It is handling the memory resource independently and so did not show up in the statistics in #18599
But by definition, I would expect this to require a large amount of device memory.
I will look at fixing this in 18599.

cpp/tests/CMakeLists.txt

davidwendt · 2025-04-30T21:49:55Z

You may want to be careful to not overfit to the current CI infrastructure. And also allow room for future tests to grow a bit.
Unless you are trying to encourage keeping memory constraints low in tests in which case you have my full support.

bdice · 2025-04-30T22:11:16Z

You may want to be careful to not overfit to the current CI infrastructure. And also allow room for future tests to grow a bit. Unless you are trying to encourage keeping memory constraints low in tests in which case you have my full support.

Agreed -- I am mostly experimenting right now. I am trying to fit into what I think would work on a 16 GB GPU, although our smallest GPU in CI is 24 GB. I am suspecting our peak memory usage for each test may not be accurate, since running 10 tests in parallel instead of 6 tests fails even on a 24 GB GPU. Perhaps there is overhead in loading libcudf from each process? I'm not sure.

Also, I'm a little sad, because the runtimes for the test suite seem to have gone up a bit with this PR compared to another PR I looked at. Previously we allowed some tests to run in parallel with the larger tests (we had some that requested 30% or 70% of the GPU). Perhaps we should accept that increase in runtime in exchange for greater stability? Or perhaps we should mark the larger tests as requiring 70% instead of 100% of the GPU, therefore allowing two other "small" tests to run at the same time? I'm not sure.

davidwendt · 2025-04-30T22:28:54Z

Also, I'm a little sad, because the runtimes for the test suite seem to have gone up a bit with this PR compared to another PR I looked at. Previously we allowed some tests to run in parallel with the larger tests (we had some that requested 30% or 70% of the GPU). Perhaps we should accept that increase in runtime in exchange for greater stability?

I also took a quick look at runtime and I would totally accept an increase in runtime to make this easier to maintain.
So I rather something simple that can be easily adjusted over time so we don't have to rework this too often.

bdice · 2025-05-01T03:15:04Z

Oooh, I wonder if there's something going on here with "pool" being the default memory resource. We should probably use "async" so the processes can share a driver-managed pool. Running multiple tests in parallel is probably allocating many large pools, each of which takes half of what's left of free memory until the last test gets 1/64 of the GPU memory or something like that. (I don't think the tests are sharing a single pool resource.)

davidwendt · 2025-05-01T11:41:34Z

Oooh, I wonder if there's something going on here with "pool" being the default memory resource. We should probably use "async" so the processes can share a driver-managed pool. Running multiple tests in parallel is probably allocating many large pools, each of which takes half of what's left of free memory until the last test gets 1/64 of the GPU memory or something like that. (I don't think the tests are sharing a single pool resource.)

Ok. The default rmm-mode can be controlled with GTEST_CUDF_RMM_MODE environment variable. Perhaps this should be set in the CI script? Or do you think we should change the default for everybody?

bdice · 2025-05-01T14:40:55Z

The default should be async and not pool, imo. We have been migrating lots of workloads to async over pool because the performance is not significantly different and async is much easier to use with multiple applications.

bdice · 2025-05-01T16:07:39Z

Looks like the failures are now cudf_identify_stream_usage found unexpected stream!. I'll try to take a look at this but it may be late next week. @davidwendt If you're interested in looking sooner, that would be welcome!

vuule · 2025-05-01T17:59:14Z

Also, I'm a little sad, because the runtimes for the test suite seem to have gone up a bit with this PR compared to another PR I looked at.

Could it be because the pool is not used?

davidwendt · 2025-05-02T19:29:54Z

I spent some time on this today. It appears the async memory resource is incompatible with the stream-adaptor/checker. The first problem is that when setting up the memory resource in the gtest main, the cuda_async_memory_resource constructor does a do_allocate() and free_allocate() without a stream
https://github.com/rapidsai/rmm/blob/0c9fe21266680973c390721e86454a885b444869/cpp/include/rmm/mr/device/cuda_async_memory_resource.hpp#L145-L149

    // Allocate and immediately deallocate the initial_pool_size to prime the pool with the
    // specified size
    auto const pool_size = initial_pool_size.value_or(free / 2);
    auto* ptr            = do_allocate(pool_size, cuda_stream_default);
    do_deallocate(ptr, pool_size, cuda_stream_default);

These cause the stream-checker to throw an error because the stream-adaptor is loaded via LD_PRELOAD there is no simple way to delay the check and the gtest fails inside of main() before any tests begin.
I tried setting up environment variables and commenting out pieces of the adapter code to let the memory resource be created but the gtests just seem to fail later for some reason -- it was not immediately obvious why and looks like it may require some more significant investigation to make async work with the stream-checker.

This brings into question the value of the stream-checker as well. Perhaps it needs some rework or perhaps we could turn it off by default for ctest and use it only with the cuda or pool resource. There have been recent issues with running it on ARM systems and with PTDS enabled. We should probably debate this in a separate issue.

Another thought is if we could possibly figure out a way to set the pool size for those 15% cases, etc so they allocate less initial memory.

vyasr · 2025-05-07T15:45:00Z

We can certainly look into reworking the stream testing utility or reassessing how we use it. I'm skeptical that we are really confident enough yet in our stream hygiene to remove its usage entirely, but maybe we could run it on a more limited basis like we run the compute sanitizer tests. Making the stream checker work with more cases (such as with the async pool issue described above) is probably possible but nontrivial.

Use peak memory usage as a better proxy for ctest parallelism

94a8056

bdice requested a review from a team as a code owner April 30, 2025 15:50

github-actions bot assigned bdice Apr 30, 2025

github-actions bot added libcudf Affects libcudf (C++/CUDA) code. CMake CMake build issue labels Apr 30, 2025

bdice mentioned this pull request Apr 30, 2025

Enable reporting peak memory usage for gtests #18599

Merged

3 tasks

bdice added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Apr 30, 2025

Use 100% for LARGE_STRINGS_TEST

903d818

davidwendt approved these changes Apr 30, 2025

View reviewed changes

Try allowing 14 tests to run simultaneously.

66dd8f0

bdice commented Apr 30, 2025

View reviewed changes

cpp/tests/CMakeLists.txt Show resolved Hide resolved

Try 10%

c2ae135

Try 15% default, 70% for most of the large tests.

a4403b9

Use async and 14-way parallelism for small tests.

32c77b7

bdice requested a review from a team as a code owner May 1, 2025 03:17

bdice requested review from karthikeyann and vuule May 1, 2025 03:17

Merge branch 'branch-25.06' into ctest-parallelism

b91f12f

Disable streams testing for now.

8cec680

bdice requested a review from a team as a code owner May 6, 2025 15:31

bdice requested a review from AyodeAwe May 6, 2025 15:31

AyodeAwe approved these changes May 6, 2025

View reviewed changes

bdice mentioned this pull request May 28, 2025

[PERF] Measure impact of async allocator priming the memory pool rapidsai/rmm#1931

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Use peak memory usage as a better proxy for ctest parallelism #18603

Use peak memory usage as a better proxy for ctest parallelism #18603

Uh oh!

bdice commented Apr 30, 2025 •

edited

Loading

Uh oh!

davidwendt commented Apr 30, 2025

Uh oh!

Uh oh!

davidwendt commented Apr 30, 2025

Uh oh!

bdice commented Apr 30, 2025

Uh oh!

davidwendt commented Apr 30, 2025

Uh oh!

bdice commented May 1, 2025 •

edited

Loading

Uh oh!

davidwendt commented May 1, 2025

Uh oh!

bdice commented May 1, 2025

Uh oh!

bdice commented May 1, 2025 •

edited

Loading

Uh oh!

vuule commented May 1, 2025

Uh oh!

davidwendt commented May 2, 2025

Uh oh!

vyasr commented May 7, 2025

Uh oh!

Uh oh!

Use peak memory usage as a better proxy for ctest parallelism #18603

Are you sure you want to change the base?

Use peak memory usage as a better proxy for ctest parallelism #18603

Uh oh!

Conversation

bdice commented Apr 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Checklist

Uh oh!

davidwendt commented Apr 30, 2025

Uh oh!

Uh oh!

davidwendt commented Apr 30, 2025

Uh oh!

bdice commented Apr 30, 2025

Uh oh!

davidwendt commented Apr 30, 2025

Uh oh!

bdice commented May 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

davidwendt commented May 1, 2025

Uh oh!

bdice commented May 1, 2025

Uh oh!

bdice commented May 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vuule commented May 1, 2025

Uh oh!

davidwendt commented May 2, 2025

Uh oh!

vyasr commented May 7, 2025

Uh oh!

Uh oh!

bdice commented Apr 30, 2025 •

edited

Loading

bdice commented May 1, 2025 •

edited

Loading

bdice commented May 1, 2025 •

edited

Loading