-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Description
Here goes proposal for k8s-native CMS.
CMS interacts with external hardware monitoring/management system via k8s api, working with node objects.
- We assume that there is some external hardware monitoring system, that sets taints on faulty nodes.
- CMS will monitor specified type of taints on all nodes, or some subset of nodes defined by label filter.
- CMS will cordon the node, indentify affected YT pods (via physical_host attribute) and start removing load from all the pods, residing on the node.
- When all the pods are ready - CMS will drain the node.
- As soon as the node is drained - external system can do with the node whatever it wants. When the node is ready to work again, external system removes the taint -- and CMS uncordons the node.
Somewhat inspired by https://kured.dev/
Metadata
Metadata
Assignees
Labels
No labels