coredump in uct_rdmacm_cm_handle_error_event() when deal with RDMA_CM_EVENT_DEVICE_REMOVAL #9740
huzhijiang
started this conversation in
General
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi,
I am doing hot-unplug and always crush when dealing with RDMA_CM_EVENT_DEVICE_REMOVAL event.
It seems that the event->id->context (which expected to be a uct_rdmacm_cm_ep_t *cep) it got is not a valid cep pointer any more.
By adding printing, event->id->context seems point to a uct_rdmacm_listener_t (same as the id created in uct_rdmacm_listener_t_init()), is that means uct_rdmacm_cm_handle_error_event() should not simply cast every event->id->context to cep? Do not know if this is the problem and relate to my cordump.
By adding printing, I also confirmed that there is no freeing of uct_rdmacm_cm_ep_t or uct_rdmacm_listener_t object before crush. so event->id->context should point to a valid memory, but coredump say it points to a memory area that already freed...
Any ideal?
BTW, I am using ucx 1.12.1 version.
Beta Was this translation helpful? Give feedback.
All reactions