-
Notifications
You must be signed in to change notification settings - Fork 38
Open
Description
I'm using xpmem in our home-brew application (OpenMPI + our own xpmem for in-node comm), on an AMD EPYC cluster, 7.7 (Maipo), kernel 3.10.0-1062.9.1.el7.x86_64. Sometimes after the applications finishes, multiple compute nodes have plenty of zombie/defunct processes that never die. Looking at the stack of some of those processes, I see this:
[<ffffffffb6acbf5e>] __synchronize_srcu+0xfe/0x150
[<ffffffffb6acbfcd>] synchronize_srcu+0x1d/0x20
[<ffffffffb6c1c10d>] mmu_notifier_unregister+0xad/0xe0
[<ffffffffc0b5e614>] xpmem_mmu_notifier_unlink+0x54/0x97 [xpmem]
[<ffffffffc0b5a13d>] xpmem_flush+0x13d/0x1c0 [xpmem]
[<ffffffffb6c47ce7>] filp_close+0x37/0x90
[<ffffffffb6c6b0b8>] put_files_struct+0x88/0xe0
[<ffffffffb6c6b1b9>] exit_files+0x49/0x50
[<ffffffffb6aa2022>] do_exit+0x2b2/0xa50
[<ffffffffb6aa283f>] do_group_exit+0x3f/0xa0
[<ffffffffb6aa28b4>] SyS_exit_group+0x14/0x20
[<ffffffffb718dede>] system_call_fastpath+0x25/0x2a
[<ffffffffffffffff>] 0xffffffffffffffff
So they seem to be hanging on some XPMEM-related process cleanup. This is strange for a few reasons: I checked, and in the code I match each xpmem_attach with an xpmem_detatch. Also, it seems strange that the kernel would be unable to end a process, because xpmem is unable to perform cleanup.
Does anyone have any ideas as to what might be the problem here?
Thanks a lot!
Metadata
Metadata
Assignees
Labels
No labels