Skip to content

RuntimeSDK: Potential deadlock after ExtensionConfig deletion #6863

Open
@sbueringer

Description

@sbueringer

Disclaimer: We are not sure if this deadlock is actually possible. It could be possible, we can either wait until it was actually observed or investigate controller-runtime if it is possible / can be intentionally reproduced.

Some context:

  • The ExtensionConfig controller discovers extensions by sending Discovery requests to the service/url configured in an ExtensionConfig
  • This discovery information is then registered in a local registry
  • When we e.g. want to call all extensions for a BeforeClusterCreate hook we query the registry and then call all registered extensions

The deadlock can happen in the following situation:

  • User deletes the Runtime Extension including the ExtensionConfig and the corresponding Deployment / Service
  • The ExtensionConfig controller does not get a Reconcile call for the delete event
    • This is the part that we are not sure about!
  • When we now want to call extensions for a hook this could include an already removed extension. The call will thus always fail.

I think this case is possible if it's possible that:

  • A watch fails / or misses an event
  • Subsequent list/watches don't retrieve an already deleted object
  • A Delete event is never send to the ExtensionConfig controller

When we want to address this issue, let's definitely first figure out if it can happen and how we can reproduce it. This will require some investigation in controller-runtime

Some ideas to address the issue then:

  • We have to make sure that we are reflecting an ExtensionConfig deletion in all cases in the registry (even if we miss a deletion event)
  • Solutions could be:
    • implement a periodic check to verify all registry entries still have a corresponding ExtensionConfig
    • add a finalizer (we have to think about what happens with multiple controller replicas)

Environment:

  • Cluster-api version: main

/kind bug
[One or more /area label. See https://github.com/kubernetes-sigs/cluster-api/labels?q=area for the list of labels]

Metadata

Metadata

Assignees

Labels

area/runtime-sdkIssues or PRs related to Runtime SDKhelp wantedDenotes an issue that needs help from a contributor. Must meet "help wanted" guidelines.kind/bugCategorizes issue or PR as related to a bug.priority/important-longtermImportant over the long term, but may not be staffed and/or may need multiple releases to complete.triage/acceptedIndicates an issue or PR is ready to be actively worked on.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions