Zombie/defunct processes caused by xpmem?

I'm using xpmem in our home-brew application (OpenMPI + our own xpmem for in-node comm), on an AMD EPYC cluster, 7.7 (Maipo), kernel 3.10.0-1062.9.1.el7.x86_64. Sometimes after the applications finishes, multiple compute nodes have plenty of zombie/defunct processes that never die. Looking at the stack of some of those processes, I see this:

```
[<ffffffffb6acbf5e>] __synchronize_srcu+0xfe/0x150
[<ffffffffb6acbfcd>] synchronize_srcu+0x1d/0x20
[<ffffffffb6c1c10d>] mmu_notifier_unregister+0xad/0xe0
[<ffffffffc0b5e614>] xpmem_mmu_notifier_unlink+0x54/0x97 [xpmem]
[<ffffffffc0b5a13d>] xpmem_flush+0x13d/0x1c0 [xpmem]
[<ffffffffb6c47ce7>] filp_close+0x37/0x90
[<ffffffffb6c6b0b8>] put_files_struct+0x88/0xe0
[<ffffffffb6c6b1b9>] exit_files+0x49/0x50
[<ffffffffb6aa2022>] do_exit+0x2b2/0xa50
[<ffffffffb6aa283f>] do_group_exit+0x3f/0xa0
[<ffffffffb6aa28b4>] SyS_exit_group+0x14/0x20
[<ffffffffb718dede>] system_call_fastpath+0x25/0x2a
[<ffffffffffffffff>] 0xffffffffffffffff
```
So they seem to be hanging on some XPMEM-related process cleanup. This is strange for a few reasons: I checked, and in the code I match each `xpmem_attach` with an `xpmem_detatch`. Also, it seems strange that the kernel would be unable to end a process, because xpmem is unable to perform cleanup.

Does anyone have any ideas as to what might be the problem here?

Thanks a lot!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Zombie/defunct processes caused by xpmem? #45

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Zombie/defunct processes caused by xpmem? #45

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions