Skip to content

Commit cbcbfcd

Browse files
authored
Merge pull request #88791 from kowen-rh/osdocs-13398
OSDOCS#13398: Expand etcd quorum restoration docs
2 parents 4f77f13 + 111299b commit cbcbfcd

File tree

4 files changed

+134
-6
lines changed

4 files changed

+134
-6
lines changed

backup_and_restore/control_plane_backup_and_restore/disaster_recovery/quorum-restoration.adoc

Lines changed: 9 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,14 @@ include::_attributes/common-attributes.adoc[]
66

77
toc::[]
88

9-
You can use the `quorum-restore.sh` script to restore etcd quorum on clusters that are offline due to quorum loss.
9+
You can use the `quorum-restore.sh` script to restore etcd quorum on clusters that are offline due to quorum loss. When quorum is lost, the {product-title} API becomes read-only. After quorum is restored, the {product-title} API returns to read/write mode.
1010

1111
// Restoring etcd quorum for high availability clusters
12-
include::modules/dr-restoring-etcd-quorum-ha.adoc[leveloffset=+1]
12+
include::modules/dr-restoring-etcd-quorum-ha.adoc[leveloffset=+1]
13+
14+
[role="_additional-resources"]
15+
[id="additional-resources_dr-quorum-restoration"]
16+
== Additional resources
17+
18+
* xref:../../../installing/installing_bare_metal/upi/installing-bare-metal.adoc#installing-bare-metal[Installing a user-provisioned cluster on bare metal]
19+
* xref:../../../installing/installing_bare_metal/ipi/ipi-install-expanding-the-cluster.adoc#replacing-a-bare-metal-control-plane-node_ipi-install-expanding[Replacing a bare-metal control plane node]

modules/dr-restoring-cluster-state-sno.adoc

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -42,4 +42,9 @@ $ sudo -E /usr/local/bin/cluster-restore.sh /home/core/<etcd_backup_directory>
4242
[source,terminal]
4343
----
4444
$ oc adm wait-for-stable-cluster
45-
----
45+
----
46+
+
47+
[NOTE]
48+
====
49+
It can take up to 15 minutes for the control plane to recover.
50+
====

modules/dr-restoring-cluster-state.adoc

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -88,6 +88,11 @@ $ oc patch etcd/cluster --type=merge -p '{"spec": {"unsupportedConfigOverrides":
8888
----
8989
$ oc adm wait-for-stable-cluster
9090
----
91+
+
92+
[NOTE]
93+
====
94+
It can take up to 15 minutes for the control plane to recover.
95+
====
9196

9297
. Once recovered, enable the quorum guard by running the following command:
9398
+

modules/dr-restoring-etcd-quorum-ha.adoc

Lines changed: 114 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,11 @@ You can use the `quorum-restore.sh` script to instantly bring back a new single-
1313
You might experience data loss if the host that runs the restoration does not have all data replicated to it.
1414
====
1515

16+
[IMPORTANT]
17+
====
18+
Quorum restoration should not be used to decrease the number of nodes outside of the restoration process. Decreasing the number of nodes results in an unsupported cluster configuration.
19+
====
20+
1621
.Prerequisites
1722

1823
* You have SSH access to the node used to restore quorum.
@@ -21,26 +26,132 @@ You might experience data loss if the host that runs the restoration does not ha
2126

2227
. Select a control plane host to use as the recovery host. You run the restore operation on this host.
2328

24-
. Using SSH, connect to the chosen recovery node and run the following command to restore etcd quorum:
29+
.. List the running etcd pods by running the following command:
30+
+
31+
[source,terminal]
32+
----
33+
$ oc get pods -n openshift-etcd -l app=etcd --field-selector="status.phase==Running"
34+
----
35+
36+
.. Choose a pod and run the following command to obtain its IP address:
37+
+
38+
[source,terminal]
39+
----
40+
$ oc exec -n openshift-etcd <etcd-pod> -c etcdctl -- etcdctl endpoint status -w table
41+
----
42+
+
43+
Note the IP address of a member that is not a learner and has the highest Raft index.
44+
45+
.. Run the following command and note the node name that corresponds to the IP address of the chosen etcd member:
46+
+
47+
[source,terminal]
48+
----
49+
$ oc get nodes -o jsonpath='{range .items[*]}[{.metadata.name},{.status.addresses[?(@.type=="InternalIP")].address}]{end}'
50+
----
51+
52+
. Using SSH, connect to the chosen recovery node and run the following command to restore etcd quorum:
2553
+
2654
[source,terminal]
2755
----
2856
$ sudo -E /usr/local/bin/quorum-restore.sh
2957
----
58+
+
59+
After a few minutes, the nodes that went down are automatically synchronized with the node that the recovery script was run on. Any remaining online nodes automatically rejoin the new etcd cluster created by the `quorum-restore.sh` script. This process takes a few minutes.
3060

3161
. Exit the SSH session.
3262

63+
. Return to a three-node configuration if any nodes are offline. Repeat the following steps for each node that is offline to delete and re-create them. After the machines are re-created, a new revision is forced and etcd automatically scales up.
64+
+
65+
** If you use a user-provisioned bare-metal installation, you can re-create a control plane machine by using the same method that you used to originally create it. For more information, see "Installing a user-provisioned cluster on bare metal".
66+
+
67+
[WARNING]
68+
====
69+
Do not delete and re-create the machine for the recovery host.
70+
====
71+
+
72+
** If you are running installer-provisioned infrastructure, or you used the Machine API to create your machines, follow these steps:
73+
+
74+
[WARNING]
75+
====
76+
Do not delete and re-create the machine for the recovery host.
77+
78+
For bare-metal installations on installer-provisioned infrastructure, control plane machines are not re-created. For more information, see "Replacing a bare-metal control plane node".
79+
====
80+
81+
.. Obtain the machine for one of the offline nodes.
82+
+
83+
In a terminal that has access to the cluster as a `cluster-admin` user, run the following command:
84+
+
85+
[source,terminal]
86+
----
87+
$ oc get machines -n openshift-machine-api -o wide
88+
----
89+
+
90+
.Example output:
91+
+
92+
[source,terminal]
93+
----
94+
NAME PHASE TYPE REGION ZONE AGE NODE PROVIDERID STATE
95+
clustername-8qw5l-master-0 Running m4.xlarge us-east-1 us-east-1a 3h37m ip-10-0-131-183.ec2.internal aws:///us-east-1a/i-0ec2782f8287dfb7e stopped <1>
96+
clustername-8qw5l-master-1 Running m4.xlarge us-east-1 us-east-1b 3h37m ip-10-0-143-125.ec2.internal aws:///us-east-1b/i-096c349b700a19631 running
97+
clustername-8qw5l-master-2 Running m4.xlarge us-east-1 us-east-1c 3h37m ip-10-0-154-194.ec2.internal aws:///us-east-1c/i-02626f1dba9ed5bba running
98+
clustername-8qw5l-worker-us-east-1a-wbtgd Running m4.large us-east-1 us-east-1a 3h28m ip-10-0-129-226.ec2.internal aws:///us-east-1a/i-010ef6279b4662ced running
99+
clustername-8qw5l-worker-us-east-1b-lrdxb Running m4.large us-east-1 us-east-1b 3h28m ip-10-0-144-248.ec2.internal aws:///us-east-1b/i-0cb45ac45a166173b running
100+
clustername-8qw5l-worker-us-east-1c-pkg26 Running m4.large us-east-1 us-east-1c 3h28m ip-10-0-170-181.ec2.internal aws:///us-east-1c/i-06861c00007751b0a running
101+
----
102+
<1> This is the control plane machine for the offline node, `ip-10-0-131-183.ec2.internal`.
103+
104+
.. Delete the machine of the offline node by running:
105+
+
106+
[source,terminal]
107+
----
108+
$ oc delete machine -n openshift-machine-api clustername-8qw5l-master-0 <1>
109+
----
110+
<1> Specify the name of the control plane machine for the offline node.
111+
+
112+
A new machine is automatically provisioned after deleting the machine of the offline node.
113+
114+
. Verify that a new machine has been created by running:
115+
+
116+
[source,terminal]
117+
----
118+
$ oc get machines -n openshift-machine-api -o wide
119+
----
120+
+
121+
.Example output:
122+
+
123+
[source,terminal]
124+
----
125+
NAME PHASE TYPE REGION ZONE AGE NODE PROVIDERID STATE
126+
clustername-8qw5l-master-1 Running m4.xlarge us-east-1 us-east-1b 3h37m ip-10-0-143-125.ec2.internal aws:///us-east-1b/i-096c349b700a19631 running
127+
clustername-8qw5l-master-2 Running m4.xlarge us-east-1 us-east-1c 3h37m ip-10-0-154-194.ec2.internal aws:///us-east-1c/i-02626f1dba9ed5bba running
128+
clustername-8qw5l-master-3 Provisioning m4.xlarge us-east-1 us-east-1a 85s ip-10-0-173-171.ec2.internal aws:///us-east-1a/i-015b0888fe17bc2c8 running <1>
129+
clustername-8qw5l-worker-us-east-1a-wbtgd Running m4.large us-east-1 us-east-1a 3h28m ip-10-0-129-226.ec2.internal aws:///us-east-1a/i-010ef6279b4662ced running
130+
clustername-8qw5l-worker-us-east-1b-lrdxb Running m4.large us-east-1 us-east-1b 3h28m ip-10-0-144-248.ec2.internal aws:///us-east-1b/i-0cb45ac45a166173b running
131+
clustername-8qw5l-worker-us-east-1c-pkg26 Running m4.large us-east-1 us-east-1c 3h28m ip-10-0-170-181.ec2.internal aws:///us-east-1c/i-06861c00007751b0a running
132+
----
133+
<1> The new machine, `clustername-8qw5l-master-3` is being created and is ready after the phase changes from `Provisioning` to `Running`.
134+
+
135+
It might take a few minutes for the new machine to be created. The etcd cluster Operator will automatically synchronize when the machine or node returns to a healthy state.
136+
137+
.. Repeat these steps for each node that is offline.
138+
33139
. Wait until the control plane recovers by running the following command:
34140
+
35141
[source,terminal]
36142
----
37143
$ oc adm wait-for-stable-cluster
38144
----
145+
+
146+
[NOTE]
147+
====
148+
It can take up to 15 minutes for the control plane to recover.
149+
====
39150

40151
.Troubleshooting
41152

42-
If you see no progress rolling out the etcd static pods, you can force redeployment from the `cluster-etcd-operator` pod by running the following command:
43-
153+
* If you see no progress rolling out the etcd static pods, you can force redeployment from the etcd cluster Operator by running the following command:
154+
+
44155
[source,terminal]
45156
----
46157
$ oc patch etcd cluster -p='{"spec": {"forceRedeploymentReason": "recovery-'"$(date --rfc-3339=ns )"'"}}' --type=merge

0 commit comments

Comments
 (0)