Skip to content

Commit 9f73d07

Browse files
authored
Merge pull request #88498 from kowen-rh/osdocs-12641
OSDOCS#12641: Reorganize and add docs for etcd recovery
2 parents 93be00f + 586d7bd commit 9f73d07

9 files changed

+226
-618
lines changed

_topic_maps/_topic_map.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3670,6 +3670,8 @@ Topics:
36703670
Topics:
36713671
- Name: About disaster recovery
36723672
File: about-disaster-recovery
3673+
- Name: Quorum restoration
3674+
File: quorum-restoration
36733675
- Name: Restoring to a previous cluster state
36743676
File: scenario-2-restoring-cluster-state
36753677
- Name: Recovering from expired control plane certificates

backup_and_restore/control_plane_backup_and_restore/disaster_recovery/about-disaster-recovery.adoc

Lines changed: 14 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -17,10 +17,17 @@ state.
1717
Disaster recovery requires you to have at least one healthy control plane host.
1818
====
1919

20+
xref:../../../backup_and_restore/control_plane_backup_and_restore/disaster_recovery/quorum-restoration.adoc#dr-quorum-restoration[Quorum restoration]:: This solution handles situations where you have lost the majority of your control plane hosts, leading to etcd quorum loss and the cluster going offline. This solution does not require an etcd backup.
21+
+
22+
[NOTE]
23+
====
24+
If you have a majority of your control plane nodes still available and have an etcd quorum, then xref:../../../backup_and_restore/control_plane_backup_and_restore/replacing-unhealthy-etcd-member.adoc#replacing-unhealthy-etcd-member[replace a single unhealthy etcd member].
25+
====
26+
2027
xref:../../../backup_and_restore/control_plane_backup_and_restore/disaster_recovery/scenario-2-restoring-cluster-state.adoc#dr-restoring-cluster-state[Restoring to a previous cluster state]::
2128
This solution handles situations where you want to restore your cluster to
2229
a previous state, for example, if an administrator deletes something critical.
23-
This also includes situations where you have lost the majority of your control plane hosts, leading to etcd quorum loss and the cluster going offline. As long as you have taken an etcd backup, you can follow this procedure to restore your cluster to a previous state.
30+
If you have taken an etcd backup, you can restore your cluster to a previous state.
2431
+
2532
If applicable, you might also need to xref:../../../backup_and_restore/control_plane_backup_and_restore/disaster_recovery/scenario-3-expired-certs.adoc#dr-recovering-expired-certs[recover from expired control plane certificates].
2633
+
@@ -30,15 +37,16 @@ Restoring to a previous cluster state is a destructive and destablizing action t
3037

3138
Prior to performing a restore, see xref:../../../backup_and_restore/control_plane_backup_and_restore/disaster_recovery/scenario-2-restoring-cluster-state.adoc#dr-scenario-2-restoring-cluster-state-about_dr-restoring-cluster-state[About restoring cluster state] for more information on the impact to the cluster.
3239
====
33-
+
34-
[NOTE]
35-
====
36-
If you have a majority of your masters still available and have an etcd quorum, then follow the procedure to xref:../../../backup_and_restore/control_plane_backup_and_restore/replacing-unhealthy-etcd-member.adoc#replacing-unhealthy-etcd-member[replace a single unhealthy etcd member].
37-
====
3840

3941
xref:../../../backup_and_restore/control_plane_backup_and_restore/disaster_recovery/scenario-3-expired-certs.adoc#dr-recovering-expired-certs[Recovering from expired control plane certificates]::
4042
This solution handles situations where your control plane certificates have
4143
expired. For example, if you shut down your cluster before the first certificate
4244
rotation, which occurs 24 hours after installation, your certificates will not
4345
be rotated and will expire. You can follow this procedure to recover from
4446
expired control plane certificates.
47+
48+
// Testing restore procedures
49+
include::modules/dr-testing-restore-procedures.adoc[leveloffset=+1]
50+
[role="_additional-resources"]
51+
.Additional resources
52+
* xref:../../../backup_and_restore/control_plane_backup_and_restore/disaster_recovery/scenario-2-restoring-cluster-state.adoc#dr-restoring-cluster-state[Restoring to a previous cluster state]
Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
:_mod-docs-content-type: ASSEMBLY
2+
[id="dr-quorum-restoration"]
3+
= Quorum restoration
4+
include::_attributes/common-attributes.adoc[]
5+
:context: dr-quorum-restoration
6+
7+
toc::[]
8+
9+
You can use the `quorum-restore.sh` script to restore etcd quorum on clusters that are offline due to quorum loss.
10+
11+
// Restoring etcd quorum for high availability clusters
12+
include::modules/dr-restoring-etcd-quorum-ha.adoc[leveloffset=+1]

backup_and_restore/control_plane_backup_and_restore/disaster_recovery/scenario-2-restoring-cluster-state.adoc

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,9 @@ To restore the cluster to a previous state, you must have previously xref:../../
1111
// About restoring to a previous cluster state
1212
include::modules/dr-restoring-cluster-state-about.adoc[leveloffset=+1]
1313

14+
// Restoring to a previous cluster state for a single node
15+
include::modules/dr-restoring-cluster-state-sno.adoc[leveloffset=+1]
16+
1417
// Restoring to a previous cluster state
1518
include::modules/dr-restoring-cluster-state.adoc[leveloffset=+1]
1619

@@ -23,5 +26,3 @@ include::modules/dr-restoring-cluster-state.adoc[leveloffset=+1]
2326
* xref:../../../installing/installing_bare_metal/ipi/ipi-install-expanding-the-cluster.adoc#replacing-a-bare-metal-control-plane-node_ipi-install-expanding[Replacing a bare-metal control plane node]
2427

2528
include::modules/dr-scenario-cluster-state-issues.adoc[leveloffset=+1]
26-
27-

modules/dr-restoring-cluster-state-about.adoc

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,7 @@ Restoring to a previous cluster state is a destructive and destablizing action t
1818
If you are able to retrieve data using the Kubernetes API server, then etcd is available and you should not restore using an etcd backup.
1919
====
2020

21-
Restoring etcd effectively takes a cluster back in time and all clients will experience a conflicting, parallel history. This can impact the behavior of watching components like kubelets, Kubernetes controller managers, persistent volume controllers, and OpenShift operators, including the network operator.
21+
Restoring etcd effectively takes a cluster back in time and all clients will experience a conflicting, parallel history. This can impact the behavior of watching components like kubelets, Kubernetes controller managers, persistent volume controllers, and {product-title} Operators, including the network Operator.
2222

2323
It can cause Operator churn when the content in etcd does not match the actual content on disk, causing Operators for the Kubernetes API server, Kubernetes controller manager, Kubernetes scheduler, and etcd to get stuck when files on disk conflict with content in etcd. This can require manual actions to resolve the issues.
2424

Lines changed: 45 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,45 @@
1+
// Module included in the following assemblies:
2+
//
3+
// * disaster_recovery/scenario-2-restoring-cluster-state.adoc
4+
5+
:_mod-docs-content-type: PROCEDURE
6+
[id="dr-restoring-cluster-state-sno_{context}"]
7+
= Restoring to a previous cluster state for a single node
8+
9+
You can use a saved etcd backup to restore a previous cluster state on a single node.
10+
11+
[IMPORTANT]
12+
====
13+
When you restore your cluster, you must use an etcd backup that was taken from the same z-stream release. For example, an {product-title} {product-version}.2 cluster must use an etcd backup that was taken from {product-version}.2.
14+
====
15+
16+
.Prerequisites
17+
18+
* Access to the cluster as a user with the `cluster-admin` role through a certificate-based `kubeconfig` file, like the one that was used during installation.
19+
* You have SSH access to control plane hosts.
20+
* A backup directory containing both the etcd snapshot and the resources for the static pods, which were from the same backup. The file names in the directory must be in the following formats: `snapshot_<datetimestamp>.db` and `static_kuberesources_<datetimestamp>.tar.gz`.
21+
22+
.Procedure
23+
24+
. Use SSH to connect to the single node and copy the etcd backup to the `/home/core` directory by running the following command:
25+
+
26+
[source,terminal]
27+
----
28+
$ cp <etcd_backup_directory> /home/core
29+
----
30+
31+
. Run the following command in the single node to restore the cluster from a previous backup:
32+
+
33+
[source,terminal]
34+
----
35+
$ sudo -E /usr/local/bin/cluster-restore.sh /home/core/<etcd_backup_directory>
36+
----
37+
38+
. Exit the SSH session.
39+
40+
. Monitor the recovery progress of the control plane by running the following command:
41+
+
42+
[source,terminal]
43+
----
44+
$ oc adm wait-for-stable-cluster
45+
----

0 commit comments

Comments
 (0)