You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is the error "Error string: /lib64/libcuda.so.1: undefined symbol: cuIpcOpenMemHandle_v2 CUDA-aware support is disabled." due to unavailability of module nv_peer_mem or nvidia-peermem in the nvidia-driver #13326
An error occurred while trying to map in the address of a function.
Function Name: cuIpcOpenMemHandle_v2
Error string: /lib64/libcuda.so.1: undefined symbol: cuIpcOpenMemHandle_v2
CUDA-aware support is disabled.
Build cuda-11.8 tool kit using gcc-8.2.0 then export its lib64 and bin
Make ucx-1.19.x cuda-aware using the built cuda-11.8 then export its lib and bin (gcc-8.2.0 compiler used)
Link openmpi-4.1.8 with cuda-11.8 making it cuda-aware and also link cuda-aware ucx-1.19.x (gcc-8.2.0 compiler used)
Build the OSU benchmark with the built cuda-aware openmpi-4.1.8 linked with cuda-aware ucx-1.19.x and with
the cuda-11.8 (gcc-8.2.0 compiler used)
The OSU program picked to benchmark was osu_bw
after the execution I am facing the above error
One thing I notice in the built cuda-aware ucx-1.19.x was it had a missing transport gdr_copy
thought it has cuda_copy and cuda_ipc when checking for cuda support with "ucx_info -d | grep -i cuda"
I heard that gdr_copy transport should also be there if ucx is cuda-aware
and that this transport is dependent on module called nv_peer_mem or nvidia-peermem
later I found out that my driver have a missing module call nv_peer_mem or
nvidia-peermem
Could this also be the reason for the above error i.e.
An error occurred while trying to map in the address of a function.
Function Name: cuIpcOpenMemHandle_v2
Error string: /lib64/libcuda.so.1: undefined symbol: cuIpcOpenMemHandle_v2
CUDA-aware support is disabled.