Skip to content

Zombie/defunct processes caused by xpmem? #45

@angainor

Description

@angainor

I'm using xpmem in our home-brew application (OpenMPI + our own xpmem for in-node comm), on an AMD EPYC cluster, 7.7 (Maipo), kernel 3.10.0-1062.9.1.el7.x86_64. Sometimes after the applications finishes, multiple compute nodes have plenty of zombie/defunct processes that never die. Looking at the stack of some of those processes, I see this:

[<ffffffffb6acbf5e>] __synchronize_srcu+0xfe/0x150
[<ffffffffb6acbfcd>] synchronize_srcu+0x1d/0x20
[<ffffffffb6c1c10d>] mmu_notifier_unregister+0xad/0xe0
[<ffffffffc0b5e614>] xpmem_mmu_notifier_unlink+0x54/0x97 [xpmem]
[<ffffffffc0b5a13d>] xpmem_flush+0x13d/0x1c0 [xpmem]
[<ffffffffb6c47ce7>] filp_close+0x37/0x90
[<ffffffffb6c6b0b8>] put_files_struct+0x88/0xe0
[<ffffffffb6c6b1b9>] exit_files+0x49/0x50
[<ffffffffb6aa2022>] do_exit+0x2b2/0xa50
[<ffffffffb6aa283f>] do_group_exit+0x3f/0xa0
[<ffffffffb6aa28b4>] SyS_exit_group+0x14/0x20
[<ffffffffb718dede>] system_call_fastpath+0x25/0x2a
[<ffffffffffffffff>] 0xffffffffffffffff

So they seem to be hanging on some XPMEM-related process cleanup. This is strange for a few reasons: I checked, and in the code I match each xpmem_attach with an xpmem_detatch. Also, it seems strange that the kernel would be unable to end a process, because xpmem is unable to perform cleanup.

Does anyone have any ideas as to what might be the problem here?

Thanks a lot!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions