Skip to content

Fix(entryGroups): Prevent deadlock and improve robustness #415

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

arnested
Copy link
Owner

The service could previously enter a stalled state due to a deadlock in the entryGroups struct, which manages Avahi mDNS registrations.

The deadlock occurred if avahiServer.EntryGroupNew() failed. In this scenario, the commit function (called via defer in handleContainer) would attempt to operate on a non-existent or nil entry group within e.groups[containerID]. This would cause a panic inside the commit function before e.mutex.Unlock() was called, leaving the mutex permanently locked. Subsequent calls to handleContainer would then block indefinitely on this mutex.

This commit addresses the issue by:

  1. Making the commit function in entryGroups.get panic-safe:
    • It now defers e.mutex.Unlock() to ensure the mutex is always
      released when commit exits, regardless of internal errors or
      panics.
    • It checks a flag (set by get) indicating whether the Avahi group
      was successfully retrieved or created. If not, it avoids operations
      on the group, preventing the panic.
  2. Ensuring that if EntryGroupNew() fails, the failed group is not
    added to the internal map, and get() returns the error appropriately.

Additionally, this change includes:

  • Enhanced debug logging for various Avahi operations in entry_groups.go, dns.go, and container.go to aid in future diagnostics.
  • Unit tests for entryGroups.get covering success, EntryGroupNew failure, and different logic paths within the commit function.

The investigation into Avahi call timeouts revealed that while the underlying godbus library supports contexts, go-avahi does not currently expose this for per-call timeouts. This is noted for potential future enhancement if Avahi calls themselves prove to be a source of prolonged blocking.

The service could previously enter a stalled state due to a deadlock in
the `entryGroups` struct, which manages Avahi mDNS registrations.

The deadlock occurred if `avahiServer.EntryGroupNew()` failed. In this
scenario, the `commit` function (called via `defer` in `handleContainer`)
would attempt to operate on a non-existent or nil entry group within
`e.groups[containerID]`. This would cause a panic inside the `commit`
function *before* `e.mutex.Unlock()` was called, leaving the mutex
permanently locked. Subsequent calls to `handleContainer` would then
block indefinitely on this mutex.

This commit addresses the issue by:
1.  Making the `commit` function in `entryGroups.get` panic-safe:
    *   It now defers `e.mutex.Unlock()` to ensure the mutex is always
        released when `commit` exits, regardless of internal errors or
        panics.
    *   It checks a flag (set by `get`) indicating whether the Avahi group
        was successfully retrieved or created. If not, it avoids operations
        on the group, preventing the panic.
2.  Ensuring that if `EntryGroupNew()` fails, the failed group is not
    added to the internal map, and `get()` returns the error appropriately.

Additionally, this change includes:
- Enhanced debug logging for various Avahi operations in `entry_groups.go`,
  `dns.go`, and `container.go` to aid in future diagnostics.
- Unit tests for `entryGroups.get` covering success, `EntryGroupNew`
  failure, and different logic paths within the `commit` function.

The investigation into Avahi call timeouts revealed that while the
underlying `godbus` library supports contexts, `go-avahi` does not
currently expose this for per-call timeouts. This is noted for potential
future enhancement if Avahi calls themselves prove to be a source of
prolonged blocking.
@CLAassistant
Copy link

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants