You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/proposals/20240807-in-place-updates.md
+27-16Lines changed: 27 additions & 16 deletions
Original file line number
Diff line number
Diff line change
@@ -8,8 +8,18 @@ authors:
8
8
- "@@mogliang"
9
9
- "@sbueringer"
10
10
- "@fabriziopandini"
11
+
- "@Danil-Grigorev"
12
+
- "@yiannistri"
11
13
reviewers:
12
-
- TBD
14
+
- "@neolit123"
15
+
- "@vincepri"
16
+
- "@enxebre"
17
+
- "@sbueringer"
18
+
- "@t-lo"
19
+
- "@mhrivnak"
20
+
- "@atanasdinov"
21
+
- "@elmiko"
22
+
- "@wking"
13
23
creation-date: "2024-08-07"
14
24
last-updated: "2024-08-07"
15
25
status: experimental
@@ -72,9 +82,9 @@ __Update Extension__: Runtime Extension (Implementation) is a component responsi
72
82
73
83
## Summary
74
84
75
-
The proposal introduces update extensions allowing users to execute custom strategies when performing Cluster API rollouts.
85
+
The proposal introduces update extensions allowing users to execute changes on existing machines without deleting the machines and creating a new one.
76
86
77
-
An External Update Extension implementing custom update strategies will report the subset of changes they know how to perform. Cluster API will orchestrate the different extensions, polling the update progress from them.
87
+
An External Update Extension will report the subset of changes they know how to perform. Cluster API will orchestrate the different extensions, polling the update progress from them.
78
88
79
89
If the totality of the required changes cannot be covered by the defined extensions, Cluster API will fall back to the current behavior (rolling update).
80
90
@@ -97,7 +107,6 @@ Even if the project continues to improve immutable rollouts, most probably there
97
107
* More efficient updates (multiple instances) that don't require re-bootstrap. Re-bootstrapping a bare metal machine takes ~10-15 mins on average. Speed matters when you have 100s - 1000s of nodes to upgrade. For a common telco RAN use case, users can have 30000-ish nodes. Depending on the parallelism, that could take days / weeks to upgrade because of the re-bootstrap time.
98
108
* Credentials rotation, e.g. rotating authorized keys for SSH.
99
109
100
-
101
110
With this proposal, Cluster API provides a new extensibility point for users willing to implement their own specific solution for these problems by implementing an Update extension.
102
111
103
112
With the implementation of an Update extension, users can take ownership of the rollout process and embrace in-place rollout strategies, intentionally trading off some of the benefits that you get from immutable infrastructure.
@@ -118,7 +127,7 @@ Cluster API user experience MUST be the same when using default, immutable updat
118
127
119
128
#### Fallback to Immutable rollouts
120
129
121
-
If external update extensions can not cover the totality of the desired changes, CAPI WILL defer to Cluster API’s default, immutable rollouts. This is important for a couple of reasons:
130
+
If external update extensions can not cover the totality of the desired changes, CAPI will defer to Cluster API’s default, immutable rollouts. This is important for a couple of reasons:
122
131
123
132
* It allows to implement custom rollout strategies incrementally, without the need to cover all use cases up-front.
124
133
* There are cases when replacing the machine will always be necessary:
@@ -128,9 +137,9 @@ If external update extensions can not cover the totality of the desired changes,
128
137
129
138
#### Clean separation of concern
130
139
131
-
It is the responsibility of the extension to decide if it can perform changes in-place and to perform these changes on a single machine. If the extension decides that it cannot perform changes in-place, CAPI will fall back to rollout.
140
+
It is the responsibility of the extension to decide if it can perform changes in-place and to perform these changes on a single machine.
132
141
133
-
The responsibility to determine which machine should be rolled out as well as the responsibility to handle rollout options like MaxSurge/MaxUnavailable will remain on the controllers owning the machine (e.g. KCP, MD controller).
142
+
The responsibility to determine which machine should be updated, the responsibility to handle CAPI resources during in-place update or immutable rollouts as well as the responsibility to handle rollout options like MaxSurge/MaxUnavailable will remain on the controllers owning the machine (e.g. KCP, MD controller).
134
143
135
144
### Goals
136
145
@@ -141,8 +150,7 @@ The responsibility to determine which machine should be rolled out as well as th
141
150
### Non-Goals/Future work
142
151
143
152
- To provide rollbacks in case of an in-place update failure. Failed updates need to be fixed manually by the user on the machine or by replacing the machine.
144
-
- Introduce any changes to KCP (or any other control plane provider), MachineDeployment, MachineSet, Machine APIs.
145
-
- Maintain a coherent user experience for both rolling and in-place updates.
153
+
- Introduce any API changes both in core Cluster API or in KCP (or any other control plane provider).
146
154
- Allow in-place updates for single-node clusters without the requirement to reprovision hosts (future goal).
147
155
148
156
## Proposal
@@ -252,7 +260,7 @@ sequenceDiagram
252
260
end
253
261
```
254
262
255
-
Both `KCP` and `MachineDeployment` controllers follow a similar pattern around updates, they first detect if an update is required and then based on the configured strategy follow the appropiate update logic (note that today there is only one valid strategy, `RollingUpdate`).
263
+
Both `KCP` and `MachineDeployment` controllers follow a similar pattern around updates; as a first step they detect if an update is required.
256
264
257
265
With `InPlaceUpdates` feature gate enabled, CAPI controllers will compute the set of desired changes and iterate over the registered external updaters, requesting through the Runtime Hook the set of changes each updater can handle. The changes supported by an updater can be the complete set of desired changes, a subset of them or an empty set, signaling it cannot handle any of the desired changes.
258
266
@@ -368,7 +376,7 @@ Once a Machine is marked as pending and `UpToDate` condition is set and the Mach
368
376
369
377
The Machine controller currently calls registered external updaters sequentially but without a defined order. We are explicitly not trying to design a solution for ordering of execution at this stage. However, determining a specific ordering mechanism or dependency management between update extensions will need to be addressed in future iterations of this proposal.
370
378
371
-
The controller will trigger updaters by hitting a RuntimeHook endpoint (eg. `/UpdateMachine`). The updater could respond saying "update completed", "update failed" or "update in progress" with an optional "retry after X seconds". The CAPI controller will continuously poll the status of the update by hitting the same endpoint until it reaches a terminal state.
379
+
The controller will trigger updaters by hitting a RuntimeHook endpoint (eg. `/UpdateMachine`). The updater could respond saying "update completed", "update failed" or "update in progress" with an optional "retry after X seconds". The CAPI controller will continuously poll the status of the update by hitting the same endpoint until the operation reports "update completed" or "update failed".
372
380
373
381
CAPI expects the `/UpdateMachine` endpoint of an updater to be idempotent: for the same Machine with the same spec, the endpoint can be called any number of times (before and after it completes), and the end result should be the same. CAPI guarantees that once an `/UpdateMachine` endpoint has been called once, it won't change the Machine spec until the update either completes or fails.
374
382
@@ -911,11 +919,14 @@ we will provide a way to toggle the in-place possibly though the API.
911
919
912
920
## Implementation History
913
921
914
-
-[ ] MM/DD/YYYY: Proposed idea in an issue or [community meeting]
915
-
-[ ] MM/DD/YYYY: Compile a Google Doc following the CAEP template (link here)
916
-
-[ ] MM/DD/YYYY: First round of feedback from community
917
-
-[ ] MM/DD/YYYY: Present proposal at a [community meeting]
918
-
-[ ] MM/DD/YYYY: Open proposal PR
922
+
-[x] 2023-09: Proposed idea in an [issue](https://github.com/kubernetes-sigs/cluster-api/issues/9489).
923
+
-[x] 2023-10: Feature Group is created.
924
+
-[x] 2023-11: Discussed [preliminary idea](https://docs.google.com/document/d/1CqQ1SAqJD264PsDeMj_Z3HhZxe7DViNkpJ9d5q-2Zck/edit?tab=t.0#heading=h.vum8h55q3k9f) with the community in Kubecon NA.
925
+
-[x] 2024-02: Compile a Google Doc following the CAEP template [(link here)](https://hackmd.io/fJ9kmuVZSgODjraFWY0kLw?edit).
926
+
-[x] 2024-03: First round of feedback.
927
+
-[x] 2024-05: Second round of feedback.
928
+
-[x] 2024-07: Present proposal at a [community meeting].
929
+
-[x] 2024-08: Open proposal [PR](https://github.com/kubernetes-sigs/cluster-api/pull/11029).
0 commit comments