Skip to content

Use peak memory usage as a better proxy for ctest parallelism #18603

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 8 commits into
base: branch-25.06
Choose a base branch
from

Conversation

bdice
Copy link
Contributor

@bdice bdice commented Apr 30, 2025

Description

Data from #18599 (comment) shows that we need to mark a few tests as requiring more of the GPU to avoid OOM errors. Conversely, we are marking several tests as requiring more than the default (15% of the GPU) which should be safe to run in parallel (and probably failed due to other large-memory tests).

Currently we run our tests in CI with -j20, but we default all tests to requiring 15% of the GPU (so a maximum of 6 tests are able to run in parallel). Our lowest VRAM GPU in CI is an L4 with 24 GB. We should be able to safely run any tests that require <1 GB of memory in parallel (requiring 20 GB at most). We only have 4 tests using more than that (shown below).

This PR proposes to run ORC_TEST, COPYING_TEST, BINARYOP_TEST, and PARQUET_TEST in isolation (100% of the GPU), while allowing all other tests to be run 14-way parallel (all other tests require 7% of the GPU). This should still allow tests to pass on a 16 GB GPU, if the 14 parallel tests each consume less than 1 GB of the GPU memory. Going forward, any new tests that consume more than 1 GB of memory should be set to run with 100% of the GPU, to keep bookkeeping simpler.

Memory usage by test
ORC_TEST,6.24 GB
COPYING_TEST,6.00 GB
BINARYOP_TEST,5.03 GB
PARQUET_TEST,1.93 GB
JOIN_TEST,0.96 GB
MULTIBYTE_SPLIT_TEST,0.81 GB
STREAM_IO_MULTIBYTE_SPLIT_TEST,0.56 GB
COMPRESSION_TEST,0.48 GB
DATA_CHUNK_SOURCE_TEST,0.47 GB
JSON_TEST,0.35 GB
QUANTILES_TEST,0.07 GB
GROUPBY_TEST,0.05 GB
REPLACE_TEST,0.03 GB
ITERATOR_TEST,0.02 GB
REDUCTIONS_TEST,0.01 GB
TEXT_TEST,0.01 GB
NESTED_JSON_TEST,0.01 GB
STREAM_COMPACTION_TEST,0.01 GB
UTILITIES_TEST,0.01 GB
STREAM_TEXT_TEST,0.01 GB
STRINGS_TEST,0.00 GB
STREAM_STRINGS_TEST,0.00 GB
INTEROP_TEST,0.00 GB
ROLLING_TEST,0.00 GB
FST_TEST,0.00 GB
LOGICAL_STACK_TEST,0.00 GB
TRANSPOSE_TEST,0.00 GB
MERGE_TEST,0.00 GB
STREAM_MERGE_TEST,0.00 GB
REPLACE_NULLS_TEST,0.00 GB
SORT_TEST,0.00 GB
STREAM_COPYING_TEST,0.00 GB
CLAMP_TEST,0.00 GB
PARTITIONING_TEST,0.00 GB
UNARY_TEST,0.00 GB
JSON_TYPE_CAST_TEST,0.00 GB
TRANSFORM_TEST,0.00 GB
DEVICE_ATOMICS_TEST,0.00 GB
STREAM_IO_PARQUET_TEST,0.00 GB
ROUND_TEST,0.00 GB
AST_TEST,0.00 GB
COLUMN_TEST,0.00 GB
BITMASK_TEST,0.00 GB
FILLING_TEST,0.00 GB
LABEL_BINS_TEST,0.00 GB
STREAM_IO_ORC_TEST,0.00 GB
JSON_WRITER_TEST,0.00 GB
FACTORIES_TEST,0.00 GB
SEARCH_TEST,0.00 GB
FIXED_POINT_TEST,0.00 GB
CSV_TEST,0.00 GB
STRUCTS_TEST,0.00 GB
LISTS_TEST,0.00 GB
STREAM_TRANSPOSE_TEST,0.00 GB
STREAM_IO_CSV_TEST,0.00 GB
HASHING_TEST,0.00 GB
RESHAPE_TEST,0.00 GB
STREAM_IO_JSON_TEST,0.00 GB
JSON_PATH_TEST,0.00 GB
STREAM_TRANSFORM_TEST,0.00 GB
IS_SORTED_TEST,0.00 GB
STREAM_JOIN_TEST,0.00 GB
DICTIONARY_TEST,0.00 GB
SCALAR_TEST,0.00 GB
DATETIME_OPS_TEST,0.00 GB
STREAM_SORTING_TEST,0.00 GB
TABLE_TEST,0.00 GB
STREAM_GROUPBY_TEST,0.00 GB
STREAM_STREAM_COMPACTION_TEST,0.00 GB
ENCODE_TEST,0.00 GB
STREAM_LISTS_TEST,0.00 GB
REPLACE_NANS_TEST,0.00 GB
TIMESTAMPS_TEST,0.00 GB
STREAM_DICTIONARY_TEST,0.00 GB
STREAM_REPLACE_TEST,0.00 GB
STREAM_ROLLING_TEST,0.00 GB
STREAM_PARTITIONING_TEST,0.00 GB
STREAM_QUANTILE_TEST,0.00 GB
STREAM_RESHAPE_TEST,0.00 GB
SPAN_TEST,0.00 GB
STREAM_REDUCTION_TEST,0.00 GB
STREAM_FILLING_TEST,0.00 GB
STREAM_HASHING_TEST,0.00 GB
STREAM_LABELING_BINS_TEST,0.00 GB
STREAM_CONCATENATE_TEST,0.00 GB
STREAM_SEARCH_TEST,0.00 GB
NORMALIZE_REPLACE_TEST,0.00 GB
STREAM_NULL_MASK_TEST,0.00 GB
TYPE_INFERENCE_TEST,0.00 GB
STREAM_BINARYOP_TEST,0.00 GB
STREAM_DATETIME_TEST,0.00 GB
STREAM_ROUND_TEST,0.00 GB
STREAM_UNARY_TEST,0.00 GB
STREAM_SCALAR_TEST,0.00 GB
DISPATCHER_TEST,0.00 GB
JIT_PARSER_TEST,0.00 GB
LARGE_STRINGS_TEST,0.00 GB
ROW_SELECTION_TEST,0.00 GB
STREAM_COLUMN_VIEW_TEST,0.00 GB
STREAM_IDENTIFICATION_TEST,0.00 GB
STREAM_POOL_TEST,0.00 GB
TRAITS_TEST,0.00 GB

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@bdice bdice requested a review from a team as a code owner April 30, 2025 15:50
@github-actions github-actions bot added libcudf Affects libcudf (C++/CUDA) code. CMake CMake build issue labels Apr 30, 2025
@bdice bdice added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Apr 30, 2025
@davidwendt
Copy link
Contributor

Could you include LARGE_STRINGS_TEST in the 100% GPU case as well?
It is handling the memory resource independently and so did not show up in the statistics in #18599
But by definition, I would expect this to require a large amount of device memory.
I will look at fixing this in 18599.

@davidwendt
Copy link
Contributor

You may want to be careful to not overfit to the current CI infrastructure. And also allow room for future tests to grow a bit.
Unless you are trying to encourage keeping memory constraints low in tests in which case you have my full support.

@bdice
Copy link
Contributor Author

bdice commented Apr 30, 2025

You may want to be careful to not overfit to the current CI infrastructure. And also allow room for future tests to grow a bit. Unless you are trying to encourage keeping memory constraints low in tests in which case you have my full support.

Agreed -- I am mostly experimenting right now. I am trying to fit into what I think would work on a 16 GB GPU, although our smallest GPU in CI is 24 GB. I am suspecting our peak memory usage for each test may not be accurate, since running 10 tests in parallel instead of 6 tests fails even on a 24 GB GPU. Perhaps there is overhead in loading libcudf from each process? I'm not sure.

Also, I'm a little sad, because the runtimes for the test suite seem to have gone up a bit with this PR compared to another PR I looked at. Previously we allowed some tests to run in parallel with the larger tests (we had some that requested 30% or 70% of the GPU). Perhaps we should accept that increase in runtime in exchange for greater stability? Or perhaps we should mark the larger tests as requiring 70% instead of 100% of the GPU, therefore allowing two other "small" tests to run at the same time? I'm not sure.

@davidwendt
Copy link
Contributor

Also, I'm a little sad, because the runtimes for the test suite seem to have gone up a bit with this PR compared to another PR I looked at. Previously we allowed some tests to run in parallel with the larger tests (we had some that requested 30% or 70% of the GPU). Perhaps we should accept that increase in runtime in exchange for greater stability?

I also took a quick look at runtime and I would totally accept an increase in runtime to make this easier to maintain.
So I rather something simple that can be easily adjusted over time so we don't have to rework this too often.

@bdice
Copy link
Contributor Author

bdice commented May 1, 2025

Oooh, I wonder if there's something going on here with "pool" being the default memory resource. We should probably use "async" so the processes can share a driver-managed pool. Running multiple tests in parallel is probably allocating many large pools, each of which takes half of what's left of free memory until the last test gets 1/64 of the GPU memory or something like that. (I don't think the tests are sharing a single pool resource.)

@bdice bdice requested a review from a team as a code owner May 1, 2025 03:17
@bdice bdice requested review from karthikeyann and vuule May 1, 2025 03:17
@davidwendt
Copy link
Contributor

Oooh, I wonder if there's something going on here with "pool" being the default memory resource. We should probably use "async" so the processes can share a driver-managed pool. Running multiple tests in parallel is probably allocating many large pools, each of which takes half of what's left of free memory until the last test gets 1/64 of the GPU memory or something like that. (I don't think the tests are sharing a single pool resource.)

Ok. The default rmm-mode can be controlled with GTEST_CUDF_RMM_MODE environment variable. Perhaps this should be set in the CI script? Or do you think we should change the default for everybody?

@bdice
Copy link
Contributor Author

bdice commented May 1, 2025

The default should be async and not pool, imo. We have been migrating lots of workloads to async over pool because the performance is not significantly different and async is much easier to use with multiple applications.

@bdice
Copy link
Contributor Author

bdice commented May 1, 2025

Looks like the failures are now cudf_identify_stream_usage found unexpected stream!. I'll try to take a look at this but it may be late next week. @davidwendt If you're interested in looking sooner, that would be welcome!

@vuule
Copy link
Contributor

vuule commented May 1, 2025

Also, I'm a little sad, because the runtimes for the test suite seem to have gone up a bit with this PR compared to another PR I looked at.

Could it be because the pool is not used?

@davidwendt
Copy link
Contributor

I spent some time on this today. It appears the async memory resource is incompatible with the stream-adaptor/checker. The first problem is that when setting up the memory resource in the gtest main, the cuda_async_memory_resource constructor does a do_allocate() and free_allocate() without a stream
https://github.com/rapidsai/rmm/blob/0c9fe21266680973c390721e86454a885b444869/cpp/include/rmm/mr/device/cuda_async_memory_resource.hpp#L145-L149

    // Allocate and immediately deallocate the initial_pool_size to prime the pool with the
    // specified size
    auto const pool_size = initial_pool_size.value_or(free / 2);
    auto* ptr            = do_allocate(pool_size, cuda_stream_default);
    do_deallocate(ptr, pool_size, cuda_stream_default);

These cause the stream-checker to throw an error because the stream-adaptor is loaded via LD_PRELOAD there is no simple way to delay the check and the gtest fails inside of main() before any tests begin.
I tried setting up environment variables and commenting out pieces of the adapter code to let the memory resource be created but the gtests just seem to fail later for some reason -- it was not immediately obvious why and looks like it may require some more significant investigation to make async work with the stream-checker.

This brings into question the value of the stream-checker as well. Perhaps it needs some rework or perhaps we could turn it off by default for ctest and use it only with the cuda or pool resource. There have been recent issues with running it on ARM systems and with PTDS enabled. We should probably debate this in a separate issue.

Another thought is if we could possibly figure out a way to set the pool size for those 15% cases, etc so they allocate less initial memory.

@bdice bdice requested a review from a team as a code owner May 6, 2025 15:31
@bdice bdice requested a review from AyodeAwe May 6, 2025 15:31
@vyasr
Copy link
Contributor

vyasr commented May 7, 2025

We can certainly look into reworking the stream testing utility or reassessing how we use it. I'm skeptical that we are really confident enough yet in our stream hygiene to remove its usage entirely, but maybe we could run it on a more limited basis like we run the compute sanitizer tests. Making the stream checker work with more cases (such as with the async pool issue described above) is probably possible but nontrivial.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CMake CMake build issue improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants