Random Crashes (No Error) with cupynumeric.load() for Large .npy File (~40GB) #4
-
I'm encountering intermittent crashes without any error messages when loading a ~40GB .npy file using cupynumeric.load(). The same file loads consistently with numpy.load(). Converting the NumPy-loaded array via cupynumeric.array() also results in the same random crashes sometimes. GPU memory should not be an issue because I am using grace hopper nodes on Vista that have 96Gb of GPU memory. I can easily create twice as big of matrices. The problem arises specifically when I try to load the npy file or when I try to turn numpy array into cupynumeric array. In addition, my future objective is multi-node/multi-GPU distributed computations (QR/SVD) on datasets larger than single-node RAM. Therefore pre-loading with NumPy and then converting to cupynumeric in batches is not an ideal solution. Questions:
Thank you so much for your help! |
Beta Was this translation helpful? Give feedback.
Replies: 6 comments
-
I don't have any immediate hunches, could you please share a reproducer? Secondarily, you could try using a debug build of Legate and running with We don't upload debug builds very often (due to the large file sizes), here are instructions to install the latest I could find:
Generally no, as it doesn't natively encode partition information.
I would suggest HDF5, for which we support distributed reads today. This example shows how to use the lower-level Legate interface https://github.com/nv-legate/legate/blob/main/share/legate/examples/io/hdf5/ex1.py#L117, then you can wrap the result of |
Beta Was this translation helpful? Give feedback.
-
Thank you so much for the reply! I will try the HDF5 format. If it works then I will just start using HDF5. If not I will try to get a reproducer. Meanwhile I have another related question. I just noticed cupynumeric.linalg.qr only supports single GPU and single CPU. Are there any plans to add multi-node multi-GPU capability in the near future? That would be really useful! |
Beta Was this translation helpful? Give feedback.
-
Reading HDF5 file with legate.io.hdf5.from_file and then wrapping with cupynumeric.asarray worked. Thank you for that! I even tried it with data larger than memory of one node.
I was only able to install "nompi" version of h5py in the same environment together with legate/cupynumeric |
Beta Was this translation helpful? Give feedback.
-
MGMN QR is on the roadmap, but we first need to resolve some packaging issues with cuSolverMp, as we intend to reuse https://docs.nvidia.com/cuda/cusolvermp/usage/functions.html#cusolvermpgeqrf for this. If you don't mind sharing, we'd love to hear more about your usecase.
"nompi" version of h5py should be sufficient for the needs of cupynumeric (we only use h5py for reading the file metadata, then we manage parallel execution ourslves, and go directly to the individual file reading APIs of hdf5, not relying on their MPI-IO support). Is the "nompi" version of h5py causing issues for you? |
Beta Was this translation helpful? Give feedback.
-
Thanks for the update! It's really great to hear that MGMN QR is planned. I understand it depends on the cuSolverMp packaging. Do you have maybe a rough idea when this MGMN QR support could be released? About the h5py question: You're right, the 'nompi' version works fine for cupynumeric itself. My reason for wanting the MPI version was actually for the step before using legate. I have a use case where I generate a very large matrix – too big for one node's memory. I calculate parts of it on different nodes using another tool, which gives me NumPy arrays on each node. Because this matrix comes from outside legate, my idea was to use the MPI version of h5py to save all these NumPy parts together into one big HDF5 file using MPI-IO. Then, I wanted to load this HDF5 file with legate to do the distributed QR. The small issue now is that I can't install MPI h5py and legate in the same environment. So, I would have to save the file in one environment, and then switch to another environment just to read it with legate and run the QR. It's not a big problem, just an extra step in the process. Let me know if you know of a work around! Thanks again for explaining! |
Beta Was this translation helpful? Give feedback.
-
I was able to reproduce your package conflict, and it appears to be happening because our packages have a dependency on a recent version of hdf5, whose openmpi-compatible package depends on openmpi>5, which our packages are not compatible with. I asked our package maintainers if it's possible to loosen our version restriction on hdf5 and/or openmpi packages.
I don't have a firm estimate at this point, probably at least 1 month away. |
Beta Was this translation helpful? Give feedback.
I don't have any immediate hunches, could you please share a reproducer?
Secondarily, you could try using a debug build of Legate and running with
legate --gdb myprog.py
, which will start your execution inside gdb (assuming you have it installed), which in turn will hopefully catch the crash and allow you to print out a backtrace at the point of failure.We don't upload debug builds very often (due to the large file sizes), here are instructions to install the latest I could find: