-
Notifications
You must be signed in to change notification settings - Fork 947
Use peak memory usage as a better proxy for ctest parallelism #18603
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: branch-25.06
Are you sure you want to change the base?
Conversation
Could you include |
You may want to be careful to not overfit to the current CI infrastructure. And also allow room for future tests to grow a bit. |
Agreed -- I am mostly experimenting right now. I am trying to fit into what I think would work on a 16 GB GPU, although our smallest GPU in CI is 24 GB. I am suspecting our peak memory usage for each test may not be accurate, since running 10 tests in parallel instead of 6 tests fails even on a 24 GB GPU. Perhaps there is overhead in loading libcudf from each process? I'm not sure. Also, I'm a little sad, because the runtimes for the test suite seem to have gone up a bit with this PR compared to another PR I looked at. Previously we allowed some tests to run in parallel with the larger tests (we had some that requested 30% or 70% of the GPU). Perhaps we should accept that increase in runtime in exchange for greater stability? Or perhaps we should mark the larger tests as requiring 70% instead of 100% of the GPU, therefore allowing two other "small" tests to run at the same time? I'm not sure. |
I also took a quick look at runtime and I would totally accept an increase in runtime to make this easier to maintain. |
Oooh, I wonder if there's something going on here with |
Ok. The default rmm-mode can be controlled with |
The default should be async and not pool, imo. We have been migrating lots of workloads to async over pool because the performance is not significantly different and async is much easier to use with multiple applications. |
Looks like the failures are now |
Could it be because the pool is not used? |
I spent some time on this today. It appears the // Allocate and immediately deallocate the initial_pool_size to prime the pool with the
// specified size
auto const pool_size = initial_pool_size.value_or(free / 2);
auto* ptr = do_allocate(pool_size, cuda_stream_default);
do_deallocate(ptr, pool_size, cuda_stream_default); These cause the stream-checker to throw an error because the stream-adaptor is loaded via This brings into question the value of the stream-checker as well. Perhaps it needs some rework or perhaps we could turn it off by default for ctest and use it only with the cuda or pool resource. There have been recent issues with running it on ARM systems and with PTDS enabled. We should probably debate this in a separate issue. Another thought is if we could possibly figure out a way to set the pool size for those 15% cases, etc so they allocate less initial memory. |
We can certainly look into reworking the stream testing utility or reassessing how we use it. I'm skeptical that we are really confident enough yet in our stream hygiene to remove its usage entirely, but maybe we could run it on a more limited basis like we run the compute sanitizer tests. Making the stream checker work with more cases (such as with the async pool issue described above) is probably possible but nontrivial. |
Description
Data from #18599 (comment) shows that we need to mark a few tests as requiring more of the GPU to avoid OOM errors. Conversely, we are marking several tests as requiring more than the default (15% of the GPU) which should be safe to run in parallel (and probably failed due to other large-memory tests).
Currently we run our tests in CI with
-j20
, but we default all tests to requiring 15% of the GPU (so a maximum of 6 tests are able to run in parallel). Our lowest VRAM GPU in CI is an L4 with 24 GB. We should be able to safely run any tests that require <1 GB of memory in parallel (requiring 20 GB at most). We only have 4 tests using more than that (shown below).This PR proposes to run
ORC_TEST
,COPYING_TEST
,BINARYOP_TEST
, andPARQUET_TEST
in isolation (100% of the GPU), while allowing all other tests to be run 14-way parallel (all other tests require 7% of the GPU). This should still allow tests to pass on a 16 GB GPU, if the 14 parallel tests each consume less than 1 GB of the GPU memory. Going forward, any new tests that consume more than 1 GB of memory should be set to run with 100% of the GPU, to keep bookkeeping simpler.Memory usage by test
Checklist