Skip to content

Commit 77c89d0

Browse files
authored
Merge pull request #74562 from mburke5678/mco-node-disruption-policy
Admin Defined Node Disruptions: Tech Preview
2 parents 59fe791 + 9219609 commit 77c89d0

7 files changed

+365
-2
lines changed

architecture/control-plane.adoc

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -62,6 +62,8 @@ ifndef::openshift-dedicated,openshift-rosa[]
6262
* For more information about detecting configuration drift, see xref:../post_installation_configuration/machine-configuration-tasks.adoc#machine-config-drift-detection_post-install-machine-configuration-tasks[Understanding configuration drift detection].
6363
6464
* For information about preventing the control plane machines from rebooting after the Machine Config Operator makes changes to the machine configuration, see xref:../support/troubleshooting/troubleshooting-operator-issues.adoc#troubleshooting-disabling-autoreboot-mco_troubleshooting-operator-issues[Disabling Machine Config Operator from automatically rebooting].
65+
66+
* xref:../post_installation_configuration/machine-configuration-tasks.adoc#machine-config-node-disruption_post-install-machine-configuration-tasks[Understanding node restart behaviors after machine config changes]
6567
endif::openshift-dedicated,openshift-rosa[]
6668
6769
include::modules/etcd-overview.adoc[leveloffset=+1]
Lines changed: 132 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,132 @@
1+
// Module included in the following assemblies:
2+
//
3+
// * post_installation_configuration/machine-configuration-tasks.adoc
4+
5+
:_mod-docs-content-type: PROCEDURE
6+
[id="machine-config-node-disruption-config_{context}"]
7+
= Configuring node restart behaviors upon machine config changes
8+
9+
You can create a node disruption policy to define the machine configuration changes that cause a disruption to your cluster, and which changes do not.
10+
11+
You can control how your nodes respond to changes in the files in the `/var` or `/etc` directory, the systemd units, the SSH keys, and the `registries.conf` file.
12+
13+
include::snippets/machine-config-node-disruption-actions.adoc[]
14+
15+
.Prerequisites
16+
17+
* You have enabled the `TechPreviewNoUpgrade` feature set by using the feature gates. For more information, see "Enabling features using feature gates".
18+
+
19+
[WARNING]
20+
====
21+
Enabling the `TechPreviewNoUpgrade` feature set on your cluster prevents minor version updates. The `TechPreviewNoUpgrade` feature set cannot be disabled. Do not enable this feature set on production clusters.
22+
====
23+
24+
.Procedure
25+
26+
. Edit the `machineconfigurations.operator.openshift.io` object to define the node disruption policy:
27+
+
28+
[source,terminal]
29+
----
30+
$ oc edit MachineConfiguration cluster -n openshift-machine-config-operator
31+
----
32+
33+
. Add a node disruption policy similar to the following:
34+
+
35+
[source,yaml]
36+
----
37+
apiVersion: operator.openshift.io/v1
38+
kind: MachineConfiguration
39+
metadata:
40+
name: cluster
41+
# ...
42+
spec:
43+
nodeDisruptionPolicy: <1>
44+
files: # <2>
45+
- actions: # <3>
46+
- reload: # <4>
47+
serviceName: chronyd.service # <5>
48+
type: Reload
49+
path: /etc/chrony.conf # <6>
50+
sshkey: # <7>
51+
actions:
52+
- type: Drain
53+
- reload:
54+
serviceName: crio.service
55+
type: Reload
56+
- type: DaemonReload
57+
- restart:
58+
serviceName: crio.service
59+
type: Restart
60+
units: # <8>
61+
- actions:
62+
- type: Drain
63+
- reload:
64+
serviceName: crio.service
65+
type: Reload
66+
- type: DaemonReload
67+
- restart:
68+
serviceName: crio.service
69+
type: Restart
70+
name: test.service
71+
----
72+
<1> Specifies the node disruption policy.
73+
<2> Specifies a list of machine config file definitions and actions to take to changes on those paths. This list supports a maximum of 50 entries.
74+
<3> Specifies the series of actions to be executed upon changes to the specified files. Actions are applied in the order that they are set in this list. This list supports a maximum of 10 entries.
75+
<4> Specifies that the listed service is to be reloaded upon changes to the specified files.
76+
<5> Specifies the full name of the service to be acted upon.
77+
<6> Specifies the location of a file that is managed by a machine config. The actions in the policy apply when changes are made to the file in `path`.
78+
<7> Specifies a list of service names and actions to take upon changes to the SSH keys in the cluster.
79+
<8> Specifies a list of systemd unit names and actions to take upon changes to those units.
80+
81+
.Verification
82+
83+
* View the `MachineConfiguration` object file that you created:
84+
+
85+
----
86+
$ oc get MachineConfiguration/cluster -o yaml
87+
----
88+
+
89+
.Example output
90+
[source,yaml]
91+
----
92+
apiVersion: operator.openshift.io/v1
93+
kind: MachineConfiguration
94+
metadata:
95+
labels:
96+
machineconfiguration.openshift.io/role: worker
97+
name: cluster
98+
# ...
99+
status:
100+
nodeDisruptionPolicyStatus: <1>
101+
clusterPolicies:
102+
files:
103+
# ...
104+
- actions:
105+
- reload:
106+
serviceName: chronyd.service
107+
type: Reload
108+
path: /etc/chrony.conf
109+
sshkey:
110+
actions:
111+
- type: Drain
112+
- reload:
113+
serviceName: crio.service
114+
type: Reload
115+
- type: DaemonReload
116+
- restart:
117+
serviceName: crio.service
118+
type: Restart
119+
units:
120+
- actions:
121+
- type: Drain
122+
- reload:
123+
serviceName: crio.service
124+
type: Reload
125+
- type: DaemonReload
126+
- restart:
127+
serviceName: crio.service
128+
type: Restart
129+
name: test.se
130+
# ...
131+
----
132+
<1> Specifies the current cluster-validated policies.
Lines changed: 182 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,182 @@
1+
// Module included in the following assemblies:
2+
//
3+
// * post_installation_configuration/machine-configuration-tasks.adoc
4+
5+
:_mod-docs-content-type: PROCEDURE
6+
[id="machine-config-node-disruption_{context}"]
7+
= Understanding node restart behaviors after machine config changes
8+
9+
By default, when you make certain changes to the fields in a `MachineConfig` object, the Machine Config Operator (MCO) drains and reboots the nodes associated with that machine config. However, you can create a _node disruption policy_ that defines a set of changes to some Ignition config objects that would require little or no disruption to your workloads.
10+
11+
A node disruption policy allows you to define the configuration changes that cause a disruption to your cluster, and which changes do not. This allows you to reduce node downtime when making small machine configuration changes in your cluster. To configure the policy, you modify the `MachineConfiguration` object, which is in the `openshift-machine-config-operator` namespace. See the example node disruption policies in the `MachineConfiguration` objects that follow.
12+
13+
[NOTE]
14+
====
15+
There are machine configuration changes that always require a reboot, regardless of any node disruption policies. For more information, see _About the Machine Config Operator_.
16+
====
17+
18+
After you create the node disruption policy, the MCO validates the policy to search for potential issues in the file, such as problems with formatting. The MCO then merges the policy with the cluster defaults and populates the `status.nodeDisruptionPolicyStatus` fields in the machine config with the actions to be performed upon future changes to the machine config. The configurations in your policy always overwrite the cluster defaults.
19+
20+
[IMPORTANT]
21+
====
22+
The MCO does not validate whether a change can be successfully applied by your node disruption policy. Therefore, you are responsible to ensure the accuracy of your node disruption policies.
23+
====
24+
25+
For example, you can configure a node disruption policy so that sudo configurations do not require a node drain and reboot. Or, you can configure your cluster so that updates to `sshd` are applied with only a reload of that one service.
26+
27+
:FeatureName: The node disruption policy feature
28+
include::snippets/technology-preview.adoc[]
29+
30+
You can control the behavior of the MCO when making the changes to the following Ignition configuration objects:
31+
32+
// I used this wording for the objects to match the previous section in the assembly: file:///home/mburke/openshift-docs/_preview/openshift-enterprise/mco-node-disruption-policy/post_installation_configuration/machine-configuration-tasks.html#what-can-you-change-with-machine-configs.
33+
* *configuration files*: You add to or update the files in the `/var` or `/etc` directory.
34+
* *systemd units*: You create and set the status of a systemd service or modify an existing systemd service.
35+
* *users and groups*: You change SSH keys in the `passwd` section post-installation.
36+
* *ICSP*, *ITMS*, *IDMS* objects: You can remove mirroring rules from an `ImageContentSourcePolicy` (ICSP), `ImageTagMirrorSet` (ITMS), and `ImageDigestMirrorSet` (IDMS) object.
37+
38+
include::snippets/machine-config-node-disruption-actions.adoc[]
39+
40+
// Examples taken from the test cases: https://polarion.engineering.redhat.com/polarion/#/project/OSE/workitems/testcase?query=trello%3AMCO%5C-507
41+
42+
[id="machine-config-node-disruption-example_{context}"]
43+
== Example node disruption policies
44+
45+
The following example `MachineConfiguration` objects contain a node disruption policy.
46+
47+
[TIP]
48+
====
49+
A `MachineConfiguration` object and a `MachineConfig` object are different objects. A `MachineConfiguration` object is a singleton object in the MCO namespace that contains configuration parameters for the MCO operator. A `MachineConfig` object defines changes that are applied to a machine config pool.
50+
====
51+
52+
The following example `MachineConfiguration` object shows no user defined policies. The default node disruption policy values are shown in the `status` stanza.
53+
54+
.Default node disruption policy
55+
[source,yaml]
56+
----
57+
apiVersion: operator.openshift.io/v1
58+
kind: MachineConfiguration
59+
name: cluster
60+
spec:
61+
logLevel: Normal
62+
managementState: Managed
63+
operatorLogLevel: Normal
64+
status:
65+
nodeDisruptionPolicyStatus:
66+
clusterPolicies:
67+
files:
68+
- actions:
69+
- type: None
70+
path: /etc/mco/internal-registry-pull-secret.json
71+
- actions:
72+
- type: None
73+
path: /var/lib/kubelet/config.json
74+
- actions:
75+
- reload:
76+
serviceName: crio.service
77+
type: Reload
78+
path: /etc/machine-config-daemon/no-reboot/containers-gpg.pub
79+
- actions:
80+
- reload:
81+
serviceName: crio.service
82+
type: Reload
83+
path: /etc/containers/policy.json
84+
- actions:
85+
- type: Special
86+
path: /etc/containers/registries.conf
87+
sshkey:
88+
actions:
89+
- type: None
90+
readyReplicas: 0
91+
----
92+
93+
In the following example, when changes are made to the SSH keys, the MCO drains the cluster nodes, reloads the `crio.service`, reloads the systemd configuration, and restarts the `crio-service`.
94+
95+
.Example node disruption policy for an SSH key change
96+
[source,yaml]
97+
----
98+
apiVersion: operator.openshift.io/v1
99+
kind: MachineConfiguration
100+
metadata:
101+
name: cluster
102+
namespace: openshift-machine-config-operator
103+
# ...
104+
spec:
105+
nodeDisruptionPolicy:
106+
sshkey:
107+
actions:
108+
- type: Drain
109+
- reload:
110+
serviceName: crio.service
111+
type: Reload
112+
- type: DaemonReload
113+
- restart:
114+
serviceName: crio.service
115+
type: Restart
116+
# ...
117+
----
118+
119+
In the following example, when changes are made to the files in the `/etc/chrony.conf` directory, the MCO reloads the `chronyd.service` on the cluster nodes.
120+
121+
.Example node disruption policy for a configuration file change
122+
[source,yaml]
123+
----
124+
apiVersion: operator.openshift.io/v1
125+
kind: MachineConfiguration
126+
metadata:
127+
name: cluster
128+
namespace: openshift-machine-config-operator
129+
# ...
130+
spec:
131+
nodeDisruptionPolicy:
132+
files:
133+
- actions:
134+
- reload:
135+
serviceName: chronyd.service
136+
type: Reload
137+
path: /etc/chrony.conf
138+
----
139+
140+
In the following example, when changes are made to the `auditd.service` systemd unit, the MCO drains the cluster nodes, reloads the `crio.service`, reloads the systemd manager configuration, and restarts the `crio.service`.
141+
142+
.Example node disruption policy for a configuration file change
143+
[source,yaml]
144+
----
145+
apiVersion: operator.openshift.io/v1
146+
kind: MachineConfiguration
147+
metadata:
148+
name: cluster
149+
namespace: openshift-machine-config-operator
150+
# ...
151+
spec:
152+
nodeDisruptionPolicy:
153+
units:
154+
- name: auditd.service
155+
actions:
156+
- type: Drain
157+
- type: Reload
158+
reload:
159+
serviceName: crio.service
160+
- type: DaemonReload
161+
- type: Restart
162+
restart:
163+
serviceName: crio.service
164+
----
165+
166+
In the following example, when changes are made to the files in the `registries.conf` directory, the MCO does not drain or reboot the nodes and applies the changes with no further action.
167+
168+
.Example node disruption policy for a configuration file change
169+
[source,yaml]
170+
----
171+
apiVersion: operator.openshift.io/v1
172+
kind: MachineConfiguration
173+
metadata:
174+
name: cluster
175+
namespace: openshift-machine-config-operator
176+
# ...
177+
spec:
178+
nodeDisruptionPolicy:
179+
- actions:
180+
- type: None
181+
path: /etc/containers/registries.conf
182+
----

modules/machine-config-overview.adoc

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -70,7 +70,11 @@ endif::openshift-origin[]
7070

7171
The MCO is not the only Operator that can change operating system components on {product-title} nodes. Other Operators can modify operating system-level features as well. One example is the Node Tuning Operator, which allows you to do node-level tuning through Tuned daemon profiles.
7272

73-
Tasks for the MCO configuration that can be done postinstallation are included in the following procedures. See descriptions of {op-system} bare metal installation for system configuration tasks that must be done during or before {product-title} installation.
73+
Tasks for the MCO configuration that can be done after installation are included in the following procedures. See descriptions of {op-system} bare metal installation for system configuration tasks that must be done during or before {product-title} installation. By default, many of the changes you make with the MCO require a reboot.
74+
75+
include::snippets/node-icsp-no-drain.adoc[]
76+
77+
In other cases, you can mitigate the disruption to your workload when the MCO makes changes by using _node disruption policies_. For information, see _Understanding node restart behaviors after machine config changes_.
7478

7579
There might be situations where the configuration on a node does not fully match what the currently-applied machine config specifies. This state is called _configuration drift_. The Machine Config Daemon (MCD) regularly checks the nodes for configuration drift. If the MCD detects configuration drift, the MCO marks the node `degraded` until an administrator corrects the node configuration. A degraded node is online and operational, but, it cannot be updated. For more information on configuration drift, see _Understanding configuration drift detection_.
7680

modules/understanding-machine-config-operator.adoc

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -44,7 +44,9 @@ When you perform node management operations, you create or modify a
4444
====
4545
When changes are made to a machine configuration, the Machine Config Operator (MCO) automatically reboots all corresponding nodes in order for the changes to take effect.
4646
47-
To prevent the nodes from automatically rebooting after machine configuration changes, before making the changes, you must pause the autoreboot process by setting the `spec.paused` field to `true` in the corresponding machine config pool. When paused, machine configuration changes are not applied until you set the `spec.paused` field to `false` and the nodes have rebooted into the new configuration.
47+
You can mitigate the disruption caused by some machine config changes by using a node disruption policy. See _Understanding node restart behaviors after machine config changes_.
48+
49+
Alternatively, you can prevent the nodes from automatically rebooting after machine configuration changes before making the changes. Pause the autoreboot process by setting the `spec.paused` field to `true` in the corresponding machine config pool. When paused, machine configuration changes are not applied until you set the `spec.paused` field to `false` and the nodes have rebooted into the new configuration.
4850
4951
include::snippets/node-icsp-no-drain.adoc[]
5052
====

post_installation_configuration/machine-configuration-tasks.adoc

Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -24,6 +24,11 @@ Previously, NetworkManager stored new network configurations to `/etc/sysconfig/
2424

2525
include::modules/machine-config-operator.adoc[leveloffset=+2]
2626
include::modules/machine-config-overview.adoc[leveloffset=+2]
27+
28+
.Additional resources
29+
30+
* xref:../post_installation_configuration/machine-configuration-tasks.adoc#machine-config-node-disruption_post-install-machine-configuration-tasks[Understanding node restart behaviors after machine config changes]
31+
2732
include::modules/machine-config-drift-detection.adoc[leveloffset=+2]
2833
include::modules/checking-mco-status.adoc[leveloffset=+2]
2934
include::modules/checking-mco-node-status.adoc[leveloffset=+2]
@@ -38,6 +43,20 @@ include::modules/mco-update-boot-images.adoc[leveloffset=+1]
3843

3944
include::modules/mco-update-boot-images-disable.adoc[leveloffset=+2]
4045

46+
include::modules/machine-config-node-disruption.adoc[leveloffset=+1]
47+
48+
[role="_additional-resources"]
49+
.Additional resources
50+
51+
* xref:../architecture/control-plane.adoc#about-machine-config-operator_control-plane[About the Machine Config Operator]
52+
53+
include::modules/machine-config-node-disruption-config.adoc[leveloffset=+2]
54+
55+
[role="_additional-resources"]
56+
.Additional resources
57+
58+
* xref:../nodes/clusters/nodes-cluster-enabling-features.adoc#nodes-cluster-enabling[Enabling features using feature gates]
59+
4160
[id="using-machineconfigs-to-change-machines"]
4261
== Using MachineConfig objects to configure nodes
4362

0 commit comments

Comments
 (0)