Train-Test Split with cuPyNumeric and HDF5 on GPU #6

ravankhidirov · 2025-06-23T18:26:43Z

ravankhidirov
Jun 23, 2025

Hi!

We were trying to implement random train-test split operation, which creates boolean masks for train and test splits. However, on the GPU, using boolean masks becomes memory-inefficient, because it creates copies of the data. Is there any efficient way to do this split?

For example, while reading a hdf5 file using legate:

from legate.io.hdf5 import from_file
from_file("random_matrix.hdf5", dataset_name="random_matrix")

Is it possible to split the data into train and test sets before converting it to NumPy or cuPyNumeric arrays (before doing something like numpy.array() or cupynumeric.array())?

Thanks!

manopapad · 2025-06-24T23:04:01Z

manopapad
Jun 24, 2025
Maintainer

I believe fundamentally the way to do a train/test split with minimal temporary allocations would involve doing a "local" shuffle on the portion of the data allocated to each CPU/GPU, then synthesizing the training set out of the first say 75% of each CPU/GPU's local allocation. The problems I see with this approach are:

To do the local shuffle efficiently in parallel, it may be necessary to do some temporary allocations.
The partitioning of the array is not exposed to the cuPyNumeric API, so there's no way to say "give me the array that is synthesized by taking the first 75% of each CPU's/GPU's local allocation". At that point you're looking at writing a custom task implementation. This is not an issue if you're only worried about single-CPU/GPU execution.

If you're OK with temporarily ~2.5x'ing your memory usage, you could:

load your entire dataset into array A
generate a random boolean array B
np.extract(B, A) to get the training set array
np.extract(~B, A) to get the test set array
throw away A

@ipdemes to confirm that cupynumeric.extract will not do scatter gather copies in this case.

Alternatively you could randomly generate a boolean array and write your ML model using masked operations. I.e. always carry around both the "mixed" data array, and the train/test bitmask, and use operations such as mean(mixed_data, where=train_or_test).

I know @RAMitchell has thought about this in the past, so maybe he has a better suggestion.

0 replies

RAMitchell · 2025-06-25T07:54:59Z

RAMitchell
Jun 25, 2025
Collaborator

@manopapad I would also go in this direction - do the sampling locally to avoid globally shuffling the data. I guess you could reduce the peak memory usage by allocating the test/train arrays then reading parts of the HDF5 file at a time. Seems like a really good candidate for streaming if we can stream the HDF5 read. Keep in mind that if you use something like np.random.binomial select the test train split, the size of the test/train arrays will not be known ahead of time - they will vary slightly with each random seed.

0 replies

ravankhidirov · 2025-06-27T20:05:32Z

ravankhidirov
Jun 27, 2025
Author

Thank you for the recommendations @manopapad @RAMitchell !

We’ve tried some of them, but unfortunately, we ran into issues again. Let me first explain the situation:

We have an X matrix that’s approximately 38 GB in size. Our goal is to split it into training and test sets (50%–50%). However, when we used np.extract to split the data, it ended up consuming three times the original memory (effectively copying the original X twice to create X_train and X_test), which exceeded the available GPU memory (97 GB on Vista).

These are some questions that we want to clarify: Why does generating just X_train double the memory usage? Shouldn’t it ideally require only ~1.5× the original size? Similarly, why does creating both splits push the usage to about 2.5–3×? Couldn’t this be done with only ~2× the memory? We observed similar behavior when trying np.compress as well.

Here’s a snippet of the code for context:
Using direct masking doubles the memory, so we tried compress instead

self.X_train = cupynumeric.zeros((self.train_mask.sum(), self.X.shape[1]))
cupynumeric.compress(self.train_mask, self.X, axis=0, out=self.X_train)

We expected that generating X_train would require around 19 GB of additional memory, but it actually reserved another 38 GB, hitting the GPU memory limit.

Is this expected behavior for these functions?

0 replies

manopapad · 2025-06-27T20:42:07Z

manopapad
Jun 27, 2025
Maintainer

At the point where memory allocation fails, you should be getting an error message saying something like "Failed to allocate N bytes" and then a listing of the allocations that are taking up space in the memory, thus causing the allocation to fail. Could you please share that here, and we can maybe see why this is happening?

0 replies

IlgarBaghishov · 2025-07-03T17:40:13Z

IlgarBaghishov
Jul 3, 2025

Thank you for your response. The full script we have is this:

from legate.io.hdf5 import from_file
import cupynumeric as np

a = np.asarray(from_file("../X_big.h5", dataset_name="data"))
train_mask = np.random.rand(a.shape[0]) < 0.5
test_mask = ~train_mask

a_train = a[train_mask]
print("done train")
a_test = a[test_mask]
print("done test")

np.add(a_train,2,out=a_train)
print(a_train[:2,:10])
print("finished train addition")
np.add(a_test,2,out=a_test)
print(a_test[:2,:10])
print("finished test addition")

And the error we get is this:

[0 - 4050dff8f840]    8.491726 {5}{runtime}: LEGION ERROR: Failed to allocate DeferredBuffer/Value/Reduction of 19164422640 bytes for leaf task cupynumeric::AdvancedIndexingTask (UID 9) in GPU_FB_MEM memory because there was insufficient space reserved for dynamic allocations. This was an unbounded memory pool which means you're actually out of space in this memory because it only has 1739864522 remaining free bytes. We strongly recommend all users put bounds on their dynamic memory usage so they can detect if space will be available for task execution and if not select an alternative mapping. (from file /tmp/conda-croot/legate/work/arch-conda/skbuild_core/_deps/legion-src/runtime/legion/legion_context.cc:25986)
Signal 6 received by node 0, process 2150691 (thread 4050dff8f840) - obtaining backtrace
Signal 6 received by process 2150691 (thread 4050dff8f840) at:
stack trace: 22 frames
  [0] = linux-vdso.so.1(__kernel_rt_sigreturn+0) [0x40003bf307f0]
  [1] = /lib64/libc.so.6(+0x84650) [0x40003c114650]
  [2] = /lib64/libc.so.6(raise+0x1c) [0x40003c0cf86c]
  [3] = /lib64/libc.so.6(abort+0xe4) [0x40003c0b7030]
  [4] = /home1/08405/ilgar/miniforge3/envs/subdatapy/lib/python3.1/site-packages/legate/core/_lib/data/../../../../../.././liblegion-legate.so.1(+0xc12ecc) [0x40005c4c2ecc]
  [5] = /home1/08405/ilgar/miniforge3/envs/subdatapy/lib/python3.1/site-packages/legate/core/_lib/data/../../../../../.././liblegion-legate.so.1(Legion::Internal::LeafContext::create_task_local_instance(Realm::Memory, Realm::InstanceLayoutGeneric*)+0x5ec) [0x40005c1b95ac]
  [6] = /home1/08405/ilgar/miniforge3/envs/subdatapy/lib/libcupynumeric.so(+0x4eee38) [0x4051d04eee38]
  [7] = /home1/08405/ilgar/miniforge3/envs/subdatapy/lib/libcupynumeric.so(Legion::DeferredBuffer<double, 2, long long, false> Legion::OutputRegion::create_buffer<double, 2, long long, false>(Realm::Point<2, long long> const&, unsigned int, double const*, bool)+0xe8) [0x4051d04ef198]
  [8] = /home1/08405/ilgar/miniforge3/envs/subdatapy/lib/libcupynumeric.so(+0x3915c90) [0x4051d3915c90]
  [9] = /home1/08405/ilgar/miniforge3/envs/subdatapy/lib/libcupynumeric.so(+0x3919af8) [0x4051d3919af8]
  [10] = /home1/08405/ilgar/miniforge3/envs/subdatapy/lib/libcupynumeric.so(+0x3944b4c) [0x4051d3944b4c]
  [11] = /home1/08405/ilgar/miniforge3/envs/subdatapy/lib/libcupynumeric.so(cupynumeric::AdvancedIndexingTask::gpu_variant(legate::TaskContext)+0x14) [0x4051d3946904]
  [12] = /home1/08405/ilgar/miniforge3/envs/subdatapy/lib/python3.1/site-packages/legate/core/_lib/data/../../../../../../liblegate.so.25.03.01(legate::detail::legion_task_body(void (*)(legate::TaskContext), legate::VariantCode, std::optional<std::basic_string_view<char, std::char_traits<char> > >, void const*, unsigned long, Realm::Processor)+0x1f4) [0x4000449eae24]
  [13] = /home1/08405/ilgar/miniforge3/envs/subdatapy/lib/python3.1/site-packages/legate/core/_lib/data/../../../../../../liblegate.so.25.03.01(legate::detail::task_wrapper(void (*)(legate::TaskContext), legate::VariantCode, std::optional<std::basic_string_view<char, std::char_traits<char> > >, void const*, unsigned long, void const*, unsigned long, Realm::Processor)+0x2c) [0x4000449da77c]
  [14] = /home1/08405/ilgar/miniforge3/envs/subdatapy/lib/libcupynumeric.so(void legate::LegateTask<cupynumeric::AdvancedIndexingTask>::task_wrapper_<&cupynumeric::AdvancedIndexingTask::gpu_variant, (legate::VariantCode)2>(void const*, unsigned long, void const*, unsigned long, Realm::Processor)+0x84) [0x4051d1271574]
  [15] = /home1/08405/ilgar/miniforge3/envs/subdatapy/lib/python3.1/site-packages/legate/core/_lib/data/../../../../../.././librealm-legate.so.1(+0x794918) [0x40005e3f4918]
  [16] = /home1/08405/ilgar/miniforge3/envs/subdatapy/lib/python3.1/site-packages/legate/core/_lib/data/../../../../../.././librealm-legate.so.1(+0x794a2c) [0x40005e3f4a2c]
  [17] = /home1/08405/ilgar/miniforge3/envs/subdatapy/lib/python3.1/site-packages/legate/core/_lib/data/../../../../../.././librealm-legate.so.1(+0x7930c8) [0x40005e3f30c8]
  [18] = /home1/08405/ilgar/miniforge3/envs/subdatapy/lib/python3.1/site-packages/legate/core/_lib/data/../../../../../.././librealm-legate.so.1(+0x796c88) [0x40005e3f6c88]
  [19] = /home1/08405/ilgar/miniforge3/envs/subdatapy/lib/python3.1/site-packages/legate/core/_lib/data/../../../../../.././librealm-legate.so.1(+0x79958c) [0x40005e3f958c]
  [20] = /lib64/libc.so.6(+0x82a38) [0x40003c112a38]
  [21] = /lib64/libc.so.6(+0x2bb9c) [0x40003c0bbb9c]

If we remove the last part about addition (even though the error seems to be coming from masking operation) it does not give any errors however it does not stop running either. It just stays silent.

One thing that we noticed (it might help you figure out how to solve this) is if we put printing of the matrix after each masking operation like this:

a_train = a[train_mask]
print("done train")
print(a_train[:2,:10])
a_test = a[test_mask]
print("done test")
print(a_test[:2,:10])

and leave everything else the same, then everything works correctly.

Seems like masking operation is non-blocking and is doubling the memory and when the next masking operation starts and if previous masking did not finish then the memory is not enough.

Do you know what could be a better solution then just printing the matrix? Or are you planning to fix this soon and there will be no need to add anything in between the masking operations.

Thank you

0 replies

IlgarBaghishov · 2025-07-07T03:00:47Z

IlgarBaghishov
Jul 7, 2025

Hi again and thank you so much for your help!
Another issue we have now is with values of the masked a_train matrix. If I print the last rows of the a_train matrix instead of first rows after masking operation I get zeros like this:

[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]

Seems like this advanced indexing operation did not finish fully and thus left some of the rows as zeros.
We also tried cupynumeric.compress but it just crashes silently.
Please let us know if there is anything we can provide that will help you determine if the issue is on our side.
Thank you.

0 replies

manopapad · 2025-07-08T00:32:09Z

manopapad
Jul 8, 2025
Maintainer

I have opened nv-legate/cupynumeric#1214 and nv-legate/cupynumeric#1215 to follow up on the error conditions you've identified. Note that @ipdemes is currently traveling, and may not be able to pick this up until next week.

For now one thing to try, that should at least resolve the memory usage issues, is to do something like this:

from legate.core import get_legate_runtime

[...]

a_train = a[train_mask]
print("done train")
get_legate_runtime().issue_execution_fence(block=True)
a_test = a[test_mask]
print("done test")

[...]

This will wait until a_train is fully computed before starting to even schedule work for a_test. Unfortunately at the moment, as you correctly identified, there is no "backpressure" mechanism when scheduling work, that will do this automatically (that is, throttle work submission until more memory becomes available). But it's on the roadmap.

You can also del a after this line, since you're not using a anymore, and that should clear its memory.

0 replies

IlgarBaghishov · 2025-07-08T04:33:05Z

IlgarBaghishov
Jul 8, 2025

Thank you for opening the issues. I just tried get_legate_runtime().issue_execution_fence(block=True) but I still got the memory error.

Yes, we do del a after train and test masking operations in the full code. However, the problem with last rows being 0 is the main issue preventing us from using a_train and a_test later on (for least squares with QR and errors calculation).

0 replies

manopapad · 2025-07-08T13:47:28Z

manopapad
Jul 8, 2025
Maintainer

OK, could you perhaps try with the nightly cupynumeric builds, in case those provide a better error message?

conda create -n myenv -c conda-forge -c legate/label/experimental cupynumeric

0 replies

IlgarBaghishov · 2025-07-08T15:43:51Z

IlgarBaghishov
Jul 8, 2025

I installed the nightly build, here is the line from conda list:

cupynumeric               25.07.00.dev28  cuda12_py313_gpu_gdc3a13e9_28    legate/label/experimental

Here is the code I used, just to be sure if I am doing everything correctly:

from legate.io.hdf5 import from_file
import cupynumeric as np
from legate.core import get_legate_runtime

a = np.asarray(from_file("../X_big.h5", dataset_name="data"))
train_mask = np.random.rand(a.shape[0]) < 0.5
test_mask = ~train_mask

a_train = a[train_mask]
get_legate_runtime().issue_execution_fence(block=True)
print("done train", flush=True)                                   # This line gets executed but then I get the error
a_test = a[test_mask]
get_legate_runtime().issue_execution_fence(block=True)
print("done test", flush=True)

np.add(a_train,2,out=a_train)
print("finished train addition", flush=True)
np.add(a_test,2,out=a_test)
print("finished test addition", flush=True)

And here is the entire output of that example code:

done train
[0 - 402e3f08f840]   51.623784 {5}{runtime}: LEGION ERROR: Failed to allocate DeferredBuffer/Value/Reduction of 19158527520 bytes for leaf task cupynumeric::AdvancedIndexingTask (UID 10) in GPU_FB_MEM memory because there was insufficient space reserved for dynamic allocations. This was an unbounded memory pool which means you're actually out of space in this memory because it only has 1733969402 remaining free bytes. We strongly recommend all users put bounds on their dynamic memory usage so they can detect if space will be available for task execution and if not select an alternative mapping. (from file /tmp/conda-croot/legate/work/arch-conda/skbuild_core/_deps/legion-src/runtime/legion/legion_context.cc:26011)
Signal 6 received by node 0, process 326242 (thread 402e3f08f840) - obtaining backtrace
Signal 6 received by process 326242 (thread 402e3f08f840) at:
stack trace: 22 frames
  [0] = linux-vdso.so.1(__kernel_rt_sigreturn+0) [0x400028c407f0]
  [1] = /lib64/libc.so.6(+0x84650) [0x400028e84650]
  [2] = /lib64/libc.so.6(raise+0x1c) [0x400028e3f86c]
  [3] = /lib64/libc.so.6(abort+0xe4) [0x400028e27030]
  [4] = /work/08405/ilgar/vista/conda_libraries/cupynumeric/lib/python3.13/site-packages/legate/core/_lib/data/../../../../../.././liblegion-legate.so.1(+0xc179cc) [0x40004ad579cc]
  [5] = /work/08405/ilgar/vista/conda_libraries/cupynumeric/lib/python3.13/site-packages/legate/core/_lib/data/../../../../../.././liblegion-legate.so.1(Legion::Internal::LeafContext::create_task_local_instance(Realm::Memory, Realm::InstanceLayoutGeneric*)+0x5ec) [0x40004aa4ba1c]
  [6] = /work/08405/ilgar/vista/conda_libraries/cupynumeric/lib/libcupynumeric.so(+0x501ba8) [0x402ea4501ba8]
  [7] = /work/08405/ilgar/vista/conda_libraries/cupynumeric/lib/libcupynumeric.so(Legion::DeferredBuffer<double, 2, long long, false> Legion::OutputRegion::create_buffer<double, 2, long long, false>(Realm::Point<2, long long> const&, unsigned int, double const*, bool)+0xe0) [0x402ea4501f00]
  [8] = /work/08405/ilgar/vista/conda_libraries/cupynumeric/lib/libcupynumeric.so(+0x395fb10) [0x402ea795fb10]
  [9] = /work/08405/ilgar/vista/conda_libraries/cupynumeric/lib/libcupynumeric.so(+0x3968860) [0x402ea7968860]
  [10] = /work/08405/ilgar/vista/conda_libraries/cupynumeric/lib/libcupynumeric.so(+0x3994414) [0x402ea7994414]
  [11] = /work/08405/ilgar/vista/conda_libraries/cupynumeric/lib/libcupynumeric.so(cupynumeric::AdvancedIndexingTask::gpu_variant(legate::TaskContext)+0x14) [0x402ea7995d28]
  [12] = /work/08405/ilgar/vista/conda_libraries/cupynumeric/lib/python3.13/site-packages/legate/core/_lib/data/../../../../../../liblegate.so.25.07.00(legate::detail::legion_task_body(void (*)(legate::TaskContext), legate::VariantCode, std::optional<std::basic_string_view<char, std::char_traits<char> > >, void const*, unsigned long, Realm::Processor)+0x1dc) [0x400030e4e6bc]
  [13] = /work/08405/ilgar/vista/conda_libraries/cupynumeric/lib/python3.13/site-packages/legate/core/_lib/data/../../../../../../liblegate.so.25.07.00(legate::detail::task_wrapper(void (*)(legate::TaskContext), legate::VariantCode, std::optional<std::basic_string_view<char, std::char_traits<char> > >, void const*, unsigned long, void const*, unsigned long, Realm::Processor)+0x2c) [0x400030e3d7fc]
  [14] = /work/08405/ilgar/vista/conda_libraries/cupynumeric/lib/libcupynumeric.so(void legate::LegateTask<cupynumeric::AdvancedIndexingTask>::task_wrapper_<&cupynumeric::AdvancedIndexingTask::gpu_variant, (legate::VariantCode)2>(void const*, unsigned long, void const*, unsigned long, Realm::Processor)+0x84) [0x402ea52891c4]
  [15] = /work/08405/ilgar/vista/conda_libraries/cupynumeric/lib/python3.13/site-packages/legate/core/_lib/data/../../../../../.././librealm-legate.so.1(+0x7a0b54) [0x40004d360b54]
  [16] = /work/08405/ilgar/vista/conda_libraries/cupynumeric/lib/python3.13/site-packages/legate/core/_lib/data/../../../../../.././librealm-legate.so.1(+0x7a0cbc) [0x40004d360cbc]
  [17] = /work/08405/ilgar/vista/conda_libraries/cupynumeric/lib/python3.13/site-packages/legate/core/_lib/data/../../../../../.././librealm-legate.so.1(+0x79f228) [0x40004d35f228]
  [18] = /work/08405/ilgar/vista/conda_libraries/cupynumeric/lib/python3.13/site-packages/legate/core/_lib/data/../../../../../.././librealm-legate.so.1(+0x7a2f18) [0x40004d362f18]
  [19] = /work/08405/ilgar/vista/conda_libraries/cupynumeric/lib/python3.13/site-packages/legate/core/_lib/data/../../../../../.././librealm-legate.so.1(+0x7a4e68) [0x40004d364e68]
  [20] = /lib64/libc.so.6(+0x82a38) [0x400028e82a38]
  [21] = /lib64/libc.so.6(+0x2bb9c) [0x400028e2bb9c]

0 replies

Train-Test Split with cuPyNumeric and HDF5 on GPU #6

Uh oh!

ravankhidirov Jun 23, 2025

Replies: 10 comments

Uh oh!

manopapad Jun 24, 2025 Maintainer

Uh oh!

Uh oh!

RAMitchell Jun 25, 2025 Collaborator

Uh oh!

ravankhidirov Jun 27, 2025 Author

Uh oh!

manopapad Jun 27, 2025 Maintainer

Uh oh!

Uh oh!

IlgarBaghishov Jul 3, 2025

Uh oh!

Uh oh!

IlgarBaghishov Jul 7, 2025

Uh oh!

manopapad Jul 8, 2025 Maintainer

Uh oh!

IlgarBaghishov Jul 8, 2025

Uh oh!

manopapad Jul 8, 2025 Maintainer

Uh oh!

Uh oh!

IlgarBaghishov Jul 8, 2025

ravankhidirov
Jun 23, 2025

manopapad
Jun 24, 2025
Maintainer

RAMitchell
Jun 25, 2025
Collaborator

ravankhidirov
Jun 27, 2025
Author

manopapad
Jun 27, 2025
Maintainer

IlgarBaghishov
Jul 3, 2025

IlgarBaghishov
Jul 7, 2025

manopapad
Jul 8, 2025
Maintainer

IlgarBaghishov
Jul 8, 2025

manopapad
Jul 8, 2025
Maintainer

IlgarBaghishov
Jul 8, 2025