You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
You can use the `quorum-restore.sh` script to restore etcd quorum on clusters that are offline due to quorum loss.
9
+
You can use the `quorum-restore.sh` script to restore etcd quorum on clusters that are offline due to quorum loss. When quorum is lost, the {product-title} API becomes read-only. After quorum is restored, the {product-title} API returns to read/write mode.
10
10
11
11
// Restoring etcd quorum for high availability clusters
* xref:../../../installing/installing_bare_metal/upi/installing-bare-metal.adoc#installing-bare-metal[Installing a user-provisioned cluster on bare metal]
19
+
* xref:../../../installing/installing_bare_metal/ipi/ipi-install-expanding-the-cluster.adoc#replacing-a-bare-metal-control-plane-node_ipi-install-expanding[Replacing a bare-metal control plane node]
Copy file name to clipboardExpand all lines: modules/dr-restoring-etcd-quorum-ha.adoc
+114-3Lines changed: 114 additions & 3 deletions
Original file line number
Diff line number
Diff line change
@@ -13,6 +13,11 @@ You can use the `quorum-restore.sh` script to instantly bring back a new single-
13
13
You might experience data loss if the host that runs the restoration does not have all data replicated to it.
14
14
====
15
15
16
+
[IMPORTANT]
17
+
====
18
+
Quorum restoration should not be used to decrease the number of nodes outside of the restoration process. Decreasing the number of nodes results in an unsupported cluster configuration.
19
+
====
20
+
16
21
.Prerequisites
17
22
18
23
* You have SSH access to the node used to restore quorum.
@@ -21,26 +26,132 @@ You might experience data loss if the host that runs the restoration does not ha
21
26
22
27
. Select a control plane host to use as the recovery host. You run the restore operation on this host.
23
28
24
-
. Using SSH, connect to the chosen recovery node and run the following command to restore etcd quorum:
29
+
.. List the running etcd pods by running the following command:
30
+
+
31
+
[source,terminal]
32
+
----
33
+
$ oc get pods -n openshift-etcd -l app=etcd --field-selector="status.phase==Running"
34
+
----
35
+
36
+
.. Choose a pod and run the following command to obtain its IP address:
Note the IP address of a member that is not a learner and has the highest Raft index.
44
+
45
+
.. Run the following command and note the node name that corresponds to the IP address of the chosen etcd member:
46
+
+
47
+
[source,terminal]
48
+
----
49
+
$ oc get nodes -o jsonpath='{range .items[*]}[{.metadata.name},{.status.addresses[?(@.type=="InternalIP")].address}]{end}'
50
+
----
51
+
52
+
. Using SSH, connect to the chosen recovery node and run the following command to restore etcd quorum:
25
53
+
26
54
[source,terminal]
27
55
----
28
56
$ sudo -E /usr/local/bin/quorum-restore.sh
29
57
----
58
+
+
59
+
After a few minutes, the nodes that went down are automatically synchronized with the node that the recovery script was run on. Any remaining online nodes automatically rejoin the new etcd cluster created by the `quorum-restore.sh` script. This process takes a few minutes.
30
60
31
61
. Exit the SSH session.
32
62
63
+
. Return to a three-node configuration if any nodes are offline. Repeat the following steps for each node that is offline to delete and re-create them. After the machines are re-created, a new revision is forced and etcd automatically scales up.
64
+
+
65
+
** If you use a user-provisioned bare-metal installation, you can re-create a control plane machine by using the same method that you used to originally create it. For more information, see "Installing a user-provisioned cluster on bare metal".
66
+
+
67
+
[WARNING]
68
+
====
69
+
Do not delete and re-create the machine for the recovery host.
70
+
====
71
+
+
72
+
** If you are running installer-provisioned infrastructure, or you used the Machine API to create your machines, follow these steps:
73
+
+
74
+
[WARNING]
75
+
====
76
+
Do not delete and re-create the machine for the recovery host.
77
+
78
+
For bare-metal installations on installer-provisioned infrastructure, control plane machines are not re-created. For more information, see "Replacing a bare-metal control plane node".
79
+
====
80
+
81
+
.. Obtain the machine for one of the offline nodes.
82
+
+
83
+
In a terminal that has access to the cluster as a `cluster-admin` user, run the following command:
84
+
+
85
+
[source,terminal]
86
+
----
87
+
$ oc get machines -n openshift-machine-api -o wide
88
+
----
89
+
+
90
+
.Example output:
91
+
+
92
+
[source,terminal]
93
+
----
94
+
NAME PHASE TYPE REGION ZONE AGE NODE PROVIDERID STATE
<1> The new machine, `clustername-8qw5l-master-3` is being created and is ready after the phase changes from `Provisioning` to `Running`.
134
+
+
135
+
It might take a few minutes for the new machine to be created. The etcd cluster Operator will automatically synchronize when the machine or node returns to a healthy state.
136
+
137
+
.. Repeat these steps for each node that is offline.
138
+
33
139
. Wait until the control plane recovers by running the following command:
34
140
+
35
141
[source,terminal]
36
142
----
37
143
$ oc adm wait-for-stable-cluster
38
144
----
145
+
+
146
+
[NOTE]
147
+
====
148
+
It can take up to 15 minutes for the control plane to recover.
149
+
====
39
150
40
151
.Troubleshooting
41
152
42
-
If you see no progress rolling out the etcd static pods, you can force redeployment from the `cluster-etcd-operator` pod by running the following command:
43
-
153
+
* If you see no progress rolling out the etcd static pods, you can force redeployment from the etcd cluster Operator by running the following command:
0 commit comments