Train-Test Split with cuPyNumeric and HDF5 on GPU #6
Replies: 10 comments
-
I believe fundamentally the way to do a train/test split with minimal temporary allocations would involve doing a "local" shuffle on the portion of the data allocated to each CPU/GPU, then synthesizing the training set out of the first say 75% of each CPU/GPU's local allocation. The problems I see with this approach are:
If you're OK with temporarily ~2.5x'ing your memory usage, you could:
@ipdemes to confirm that Alternatively you could randomly generate a boolean array and write your ML model using masked operations. I.e. always carry around both the "mixed" data array, and the train/test bitmask, and use operations such as I know @RAMitchell has thought about this in the past, so maybe he has a better suggestion. |
Beta Was this translation helpful? Give feedback.
-
@manopapad I would also go in this direction - do the sampling locally to avoid globally shuffling the data. I guess you could reduce the peak memory usage by allocating the test/train arrays then reading parts of the HDF5 file at a time. Seems like a really good candidate for streaming if we can stream the HDF5 read. Keep in mind that if you use something like np.random.binomial select the test train split, the size of the test/train arrays will not be known ahead of time - they will vary slightly with each random seed. |
Beta Was this translation helpful? Give feedback.
-
Thank you for the recommendations @manopapad @RAMitchell ! We’ve tried some of them, but unfortunately, we ran into issues again. Let me first explain the situation: We have an X matrix that’s approximately 38 GB in size. Our goal is to split it into training and test sets (50%–50%). However, when we used These are some questions that we want to clarify: Why does generating just X_train double the memory usage? Shouldn’t it ideally require only ~1.5× the original size? Similarly, why does creating both splits push the usage to about 2.5–3×? Couldn’t this be done with only ~2× the memory? We observed similar behavior when trying Here’s a snippet of the code for context:
We expected that generating X_train would require around 19 GB of additional memory, but it actually reserved another 38 GB, hitting the GPU memory limit. Is this expected behavior for these functions? |
Beta Was this translation helpful? Give feedback.
-
At the point where memory allocation fails, you should be getting an error message saying something like "Failed to allocate N bytes" and then a listing of the allocations that are taking up space in the memory, thus causing the allocation to fail. Could you please share that here, and we can maybe see why this is happening? |
Beta Was this translation helpful? Give feedback.
-
Thank you for your response. The full script we have is this: from legate.io.hdf5 import from_file
import cupynumeric as np
a = np.asarray(from_file("../X_big.h5", dataset_name="data"))
train_mask = np.random.rand(a.shape[0]) < 0.5
test_mask = ~train_mask
a_train = a[train_mask]
print("done train")
a_test = a[test_mask]
print("done test")
np.add(a_train,2,out=a_train)
print(a_train[:2,:10])
print("finished train addition")
np.add(a_test,2,out=a_test)
print(a_test[:2,:10])
print("finished test addition") And the error we get is this:
If we remove the last part about addition (even though the error seems to be coming from masking operation) it does not give any errors however it does not stop running either. It just stays silent. One thing that we noticed (it might help you figure out how to solve this) is if we put printing of the matrix after each masking operation like this: a_train = a[train_mask]
print("done train")
print(a_train[:2,:10])
a_test = a[test_mask]
print("done test")
print(a_test[:2,:10]) and leave everything else the same, then everything works correctly. Seems like masking operation is non-blocking and is doubling the memory and when the next masking operation starts and if previous masking did not finish then the memory is not enough. Do you know what could be a better solution then just printing the matrix? Or are you planning to fix this soon and there will be no need to add anything in between the masking operations. Thank you |
Beta Was this translation helpful? Give feedback.
-
Hi again and thank you so much for your help!
Seems like this advanced indexing operation did not finish fully and thus left some of the rows as zeros. |
Beta Was this translation helpful? Give feedback.
-
I have opened nv-legate/cupynumeric#1214 and nv-legate/cupynumeric#1215 to follow up on the error conditions you've identified. Note that @ipdemes is currently traveling, and may not be able to pick this up until next week. For now one thing to try, that should at least resolve the memory usage issues, is to do something like this:
This will wait until You can also |
Beta Was this translation helpful? Give feedback.
-
Thank you for opening the issues. I just tried Yes, we do |
Beta Was this translation helpful? Give feedback.
-
OK, could you perhaps try with the nightly cupynumeric builds, in case those provide a better error message?
|
Beta Was this translation helpful? Give feedback.
-
I installed the nightly build, here is the line from conda list:
Here is the code I used, just to be sure if I am doing everything correctly: from legate.io.hdf5 import from_file
import cupynumeric as np
from legate.core import get_legate_runtime
a = np.asarray(from_file("../X_big.h5", dataset_name="data"))
train_mask = np.random.rand(a.shape[0]) < 0.5
test_mask = ~train_mask
a_train = a[train_mask]
get_legate_runtime().issue_execution_fence(block=True)
print("done train", flush=True) # This line gets executed but then I get the error
a_test = a[test_mask]
get_legate_runtime().issue_execution_fence(block=True)
print("done test", flush=True)
np.add(a_train,2,out=a_train)
print("finished train addition", flush=True)
np.add(a_test,2,out=a_test)
print("finished test addition", flush=True) And here is the entire output of that example code:
|
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi!
We were trying to implement random train-test split operation, which creates boolean masks for train and test splits. However, on the GPU, using boolean masks becomes memory-inefficient, because it creates copies of the data. Is there any efficient way to do this split?
For example, while reading a hdf5 file using legate:
Is it possible to split the data into train and test sets before converting it to NumPy or cuPyNumeric arrays (before doing something like numpy.array() or cupynumeric.array())?
Thanks!
Beta Was this translation helpful? Give feedback.
All reactions