Open
Description
Disclaimer: We are not sure if this deadlock is actually possible. It could be possible, we can either wait until it was actually observed or investigate controller-runtime if it is possible / can be intentionally reproduced.
Some context:
- The
ExtensionConfig
controller discovers extensions by sending Discovery requests to the service/url configured in an ExtensionConfig - This discovery information is then registered in a local registry
- When we e.g. want to call all extensions for a
BeforeClusterCreate
hook we query the registry and then call all registered extensions
The deadlock can happen in the following situation:
- User deletes the Runtime Extension including the ExtensionConfig and the corresponding Deployment / Service
- The
ExtensionConfig
controller does not get a Reconcile call for the delete event- This is the part that we are not sure about!
- When we now want to call extensions for a hook this could include an already removed extension. The call will thus always fail.
I think this case is possible if it's possible that:
- A watch fails / or misses an event
- Subsequent list/watches don't retrieve an already deleted object
- A
Delete
event is never send to theExtensionConfig
controller
When we want to address this issue, let's definitely first figure out if it can happen and how we can reproduce it. This will require some investigation in controller-runtime
Some ideas to address the issue then:
- We have to make sure that we are reflecting an ExtensionConfig deletion in all cases in the registry (even if we miss a deletion event)
- Solutions could be:
- implement a periodic check to verify all registry entries still have a corresponding ExtensionConfig
- add a finalizer (we have to think about what happens with multiple controller replicas)
Environment:
- Cluster-api version: main
/kind bug
[One or more /area label. See https://github.com/kubernetes-sigs/cluster-api/labels?q=area for the list of labels]
Metadata
Metadata
Assignees
Labels
Issues or PRs related to Runtime SDKDenotes an issue that needs help from a contributor. Must meet "help wanted" guidelines.Categorizes issue or PR as related to a bug.Important over the long term, but may not be staffed and/or may need multiple releases to complete.Indicates an issue or PR is ready to be actively worked on.