Skip to content

Commit 51537e4

Browse files
alexander-demicevg-gaston
authored andcommitted
Update proposal based on feedback
Signed-off-by: Alexandr Demicev <alexandr.demicev@suse.com>
1 parent 4829556 commit 51537e4

File tree

1 file changed

+27
-16
lines changed

1 file changed

+27
-16
lines changed

docs/proposals/20240807-in-place-updates.md

Lines changed: 27 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -8,8 +8,18 @@ authors:
88
- "@@mogliang"
99
- "@sbueringer"
1010
- "@fabriziopandini"
11+
- "@Danil-Grigorev"
12+
- "@yiannistri"
1113
reviewers:
12-
- TBD
14+
- "@neolit123"
15+
- "@vincepri"
16+
- "@enxebre"
17+
- "@sbueringer"
18+
- "@t-lo"
19+
- "@mhrivnak"
20+
- "@atanasdinov"
21+
- "@elmiko"
22+
- "@wking"
1323
creation-date: "2024-08-07"
1424
last-updated: "2024-08-07"
1525
status: experimental
@@ -72,9 +82,9 @@ __Update Extension__: Runtime Extension (Implementation) is a component responsi
7282

7383
## Summary
7484

75-
The proposal introduces update extensions allowing users to execute custom strategies when performing Cluster API rollouts.
85+
The proposal introduces update extensions allowing users to execute changes on existing machines without deleting the machines and creating a new one.
7686

77-
An External Update Extension implementing custom update strategies will report the subset of changes they know how to perform. Cluster API will orchestrate the different extensions, polling the update progress from them.
87+
An External Update Extension will report the subset of changes they know how to perform. Cluster API will orchestrate the different extensions, polling the update progress from them.
7888

7989
If the totality of the required changes cannot be covered by the defined extensions, Cluster API will fall back to the current behavior (rolling update).
8090

@@ -97,7 +107,6 @@ Even if the project continues to improve immutable rollouts, most probably there
97107
* More efficient updates (multiple instances) that don't require re-bootstrap. Re-bootstrapping a bare metal machine takes ~10-15 mins on average. Speed matters when you have 100s - 1000s of nodes to upgrade. For a common telco RAN use case, users can have 30000-ish nodes. Depending on the parallelism, that could take days / weeks to upgrade because of the re-bootstrap time.
98108
* Credentials rotation, e.g. rotating authorized keys for SSH.
99109

100-
101110
With this proposal, Cluster API provides a new extensibility point for users willing to implement their own specific solution for these problems by implementing an Update extension.
102111

103112
With the implementation of an Update extension, users can take ownership of the rollout process and embrace in-place rollout strategies, intentionally trading off some of the benefits that you get from immutable infrastructure.
@@ -118,7 +127,7 @@ Cluster API user experience MUST be the same when using default, immutable updat
118127

119128
#### Fallback to Immutable rollouts
120129

121-
If external update extensions can not cover the totality of the desired changes, CAPI WILL defer to Cluster API’s default, immutable rollouts. This is important for a couple of reasons:
130+
If external update extensions can not cover the totality of the desired changes, CAPI will defer to Cluster API’s default, immutable rollouts. This is important for a couple of reasons:
122131

123132
* It allows to implement custom rollout strategies incrementally, without the need to cover all use cases up-front.
124133
* There are cases when replacing the machine will always be necessary:
@@ -128,9 +137,9 @@ If external update extensions can not cover the totality of the desired changes,
128137

129138
#### Clean separation of concern
130139

131-
It is the responsibility of the extension to decide if it can perform changes in-place and to perform these changes on a single machine. If the extension decides that it cannot perform changes in-place, CAPI will fall back to rollout.
140+
It is the responsibility of the extension to decide if it can perform changes in-place and to perform these changes on a single machine.
132141

133-
The responsibility to determine which machine should be rolled out as well as the responsibility to handle rollout options like MaxSurge/MaxUnavailable will remain on the controllers owning the machine (e.g. KCP, MD controller).
142+
The responsibility to determine which machine should be updated, the responsibility to handle CAPI resources during in-place update or immutable rollouts as well as the responsibility to handle rollout options like MaxSurge/MaxUnavailable will remain on the controllers owning the machine (e.g. KCP, MD controller).
134143

135144
### Goals
136145

@@ -141,8 +150,7 @@ The responsibility to determine which machine should be rolled out as well as th
141150
### Non-Goals/Future work
142151

143152
- To provide rollbacks in case of an in-place update failure. Failed updates need to be fixed manually by the user on the machine or by replacing the machine.
144-
- Introduce any changes to KCP (or any other control plane provider), MachineDeployment, MachineSet, Machine APIs.
145-
- Maintain a coherent user experience for both rolling and in-place updates.
153+
- Introduce any API changes both in core Cluster API or in KCP (or any other control plane provider).
146154
- Allow in-place updates for single-node clusters without the requirement to reprovision hosts (future goal).
147155

148156
## Proposal
@@ -252,7 +260,7 @@ sequenceDiagram
252260
end
253261
```
254262

255-
Both `KCP` and `MachineDeployment` controllers follow a similar pattern around updates, they first detect if an update is required and then based on the configured strategy follow the appropiate update logic (note that today there is only one valid strategy, `RollingUpdate`).
263+
Both `KCP` and `MachineDeployment` controllers follow a similar pattern around updates; as a first step they detect if an update is required.
256264

257265
With `InPlaceUpdates` feature gate enabled, CAPI controllers will compute the set of desired changes and iterate over the registered external updaters, requesting through the Runtime Hook the set of changes each updater can handle. The changes supported by an updater can be the complete set of desired changes, a subset of them or an empty set, signaling it cannot handle any of the desired changes.
258266

@@ -368,7 +376,7 @@ Once a Machine is marked as pending and `UpToDate` condition is set and the Mach
368376

369377
The Machine controller currently calls registered external updaters sequentially but without a defined order. We are explicitly not trying to design a solution for ordering of execution at this stage. However, determining a specific ordering mechanism or dependency management between update extensions will need to be addressed in future iterations of this proposal.
370378

371-
The controller will trigger updaters by hitting a RuntimeHook endpoint (eg. `/UpdateMachine`). The updater could respond saying "update completed", "update failed" or "update in progress" with an optional "retry after X seconds". The CAPI controller will continuously poll the status of the update by hitting the same endpoint until it reaches a terminal state.
379+
The controller will trigger updaters by hitting a RuntimeHook endpoint (eg. `/UpdateMachine`). The updater could respond saying "update completed", "update failed" or "update in progress" with an optional "retry after X seconds". The CAPI controller will continuously poll the status of the update by hitting the same endpoint until the operation reports "update completed" or "update failed".
372380

373381
CAPI expects the `/UpdateMachine` endpoint of an updater to be idempotent: for the same Machine with the same spec, the endpoint can be called any number of times (before and after it completes), and the end result should be the same. CAPI guarantees that once an `/UpdateMachine` endpoint has been called once, it won't change the Machine spec until the update either completes or fails.
374382

@@ -911,11 +919,14 @@ we will provide a way to toggle the in-place possibly though the API.
911919

912920
## Implementation History
913921

914-
- [ ] MM/DD/YYYY: Proposed idea in an issue or [community meeting]
915-
- [ ] MM/DD/YYYY: Compile a Google Doc following the CAEP template (link here)
916-
- [ ] MM/DD/YYYY: First round of feedback from community
917-
- [ ] MM/DD/YYYY: Present proposal at a [community meeting]
918-
- [ ] MM/DD/YYYY: Open proposal PR
922+
- [x] 2023-09: Proposed idea in an [issue](https://github.com/kubernetes-sigs/cluster-api/issues/9489).
923+
- [x] 2023-10: Feature Group is created.
924+
- [x] 2023-11: Discussed [preliminary idea](https://docs.google.com/document/d/1CqQ1SAqJD264PsDeMj_Z3HhZxe7DViNkpJ9d5q-2Zck/edit?tab=t.0#heading=h.vum8h55q3k9f) with the community in Kubecon NA.
925+
- [x] 2024-02: Compile a Google Doc following the CAEP template [(link here)](https://hackmd.io/fJ9kmuVZSgODjraFWY0kLw?edit).
926+
- [x] 2024-03: First round of feedback.
927+
- [x] 2024-05: Second round of feedback.
928+
- [x] 2024-07: Present proposal at a [community meeting].
929+
- [x] 2024-08: Open proposal [PR](https://github.com/kubernetes-sigs/cluster-api/pull/11029).
919930

920931
<!-- Links -->
921932
[community meeting]: https://docs.google.com/document/d/1ushaVqAKYnZ2VN_aa3GyKlS4kEd6bSug13xaXOakAQI/edit#heading=h.pxsq37pzkbdq

0 commit comments

Comments
 (0)