Merge pull request #88498 from kowen-rh/osdocs-12641

xenolinux · web-flow · commit 9f73d0753172 · 2025-02-18T10:34:37.000+05:30
OSDOCS#12641: Reorganize and add docs for etcd recovery
diff --git a/_topic_maps/_topic_map.yml b/_topic_maps/_topic_map.yml
@@ -3670,6 +3670,8 @@ Topics:
     Topics:
     - Name: About disaster recovery
       File: about-disaster-recovery
+    - Name: Quorum restoration
+      File: quorum-restoration
     - Name: Restoring to a previous cluster state
       File: scenario-2-restoring-cluster-state
     - Name: Recovering from expired control plane certificates
diff --git a/backup_and_restore/control_plane_backup_and_restore/disaster_recovery/about-disaster-recovery.adoc b/backup_and_restore/control_plane_backup_and_restore/disaster_recovery/about-disaster-recovery.adoc
@@ -17,10 +17,17 @@ state.
 Disaster recovery requires you to have at least one healthy control plane host.
 ====
 
+xref:../../../backup_and_restore/control_plane_backup_and_restore/disaster_recovery/quorum-restoration.adoc#dr-quorum-restoration[Quorum restoration]:: This solution handles situations where you have lost the majority of your control plane hosts, leading to etcd quorum loss and the cluster going offline. This solution does not require an etcd backup.
++
+[NOTE]
+====
+If you have a majority of your control plane nodes still available and have an etcd quorum, then xref:../../../backup_and_restore/control_plane_backup_and_restore/replacing-unhealthy-etcd-member.adoc#replacing-unhealthy-etcd-member[replace a single unhealthy etcd member].
+====
+
 xref:../../../backup_and_restore/control_plane_backup_and_restore/disaster_recovery/scenario-2-restoring-cluster-state.adoc#dr-restoring-cluster-state[Restoring to a previous cluster state]::
 This solution handles situations where you want to restore your cluster to
 a previous state, for example, if an administrator deletes something critical.
-This also includes situations where you have lost the majority of your control plane hosts, leading to etcd quorum loss and the cluster going offline. As long as you have taken an etcd backup, you can follow this procedure to restore your cluster to a previous state.
+If you have taken an etcd backup, you can restore your cluster to a previous state.
 +
 If applicable, you might also need to xref:../../../backup_and_restore/control_plane_backup_and_restore/disaster_recovery/scenario-3-expired-certs.adoc#dr-recovering-expired-certs[recover from expired control plane certificates].
 +
@@ -30,15 +37,16 @@ Restoring to a previous cluster state is a destructive and destablizing action t
 
 Prior to performing a restore, see xref:../../../backup_and_restore/control_plane_backup_and_restore/disaster_recovery/scenario-2-restoring-cluster-state.adoc#dr-scenario-2-restoring-cluster-state-about_dr-restoring-cluster-state[About restoring cluster state] for more information on the impact to the cluster.
 ====
-+
-[NOTE]
-====
-If you have a majority of your masters still available and have an etcd quorum, then follow the procedure to xref:../../../backup_and_restore/control_plane_backup_and_restore/replacing-unhealthy-etcd-member.adoc#replacing-unhealthy-etcd-member[replace a single unhealthy etcd member].
-====
 
 xref:../../../backup_and_restore/control_plane_backup_and_restore/disaster_recovery/scenario-3-expired-certs.adoc#dr-recovering-expired-certs[Recovering from expired control plane certificates]::
 This solution handles situations where your control plane certificates have
 expired. For example, if you shut down your cluster before the first certificate
 rotation, which occurs 24 hours after installation, your certificates will not
 be rotated and will expire. You can follow this procedure to recover from
 expired control plane certificates.
+
+// Testing restore procedures
+include::modules/dr-testing-restore-procedures.adoc[leveloffset=+1]
+[role="_additional-resources"]
+.Additional resources
+* xref:../../../backup_and_restore/control_plane_backup_and_restore/disaster_recovery/scenario-2-restoring-cluster-state.adoc#dr-restoring-cluster-state[Restoring to a previous cluster state]
diff --git a/backup_and_restore/control_plane_backup_and_restore/disaster_recovery/quorum-restoration.adoc b/backup_and_restore/control_plane_backup_and_restore/disaster_recovery/quorum-restoration.adoc
@@ -0,0 +1,12 @@
+:_mod-docs-content-type: ASSEMBLY
+[id="dr-quorum-restoration"]
+= Quorum restoration
+include::_attributes/common-attributes.adoc[]
+:context: dr-quorum-restoration
+
+toc::[]
+
+You can use the `quorum-restore.sh` script to restore etcd quorum on clusters that are offline due to quorum loss.
+
+// Restoring etcd quorum for high availability clusters
+include::modules/dr-restoring-etcd-quorum-ha.adoc[leveloffset=+1]
diff --git a/backup_and_restore/control_plane_backup_and_restore/disaster_recovery/scenario-2-restoring-cluster-state.adoc b/backup_and_restore/control_plane_backup_and_restore/disaster_recovery/scenario-2-restoring-cluster-state.adoc
@@ -11,6 +11,9 @@ To restore the cluster to a previous state, you must have previously xref:../../
 // About restoring to a previous cluster state
 include::modules/dr-restoring-cluster-state-about.adoc[leveloffset=+1]
 
+// Restoring to a previous cluster state for a single node
+include::modules/dr-restoring-cluster-state-sno.adoc[leveloffset=+1]
+
 // Restoring to a previous cluster state
 include::modules/dr-restoring-cluster-state.adoc[leveloffset=+1]
 
@@ -23,5 +26,3 @@ include::modules/dr-restoring-cluster-state.adoc[leveloffset=+1]
 * xref:../../../installing/installing_bare_metal/ipi/ipi-install-expanding-the-cluster.adoc#replacing-a-bare-metal-control-plane-node_ipi-install-expanding[Replacing a bare-metal control plane node]
 
 include::modules/dr-scenario-cluster-state-issues.adoc[leveloffset=+1]
-
-
diff --git a/modules/dr-restoring-cluster-state-about.adoc b/modules/dr-restoring-cluster-state-about.adoc
@@ -18,7 +18,7 @@ Restoring to a previous cluster state is a destructive and destablizing action t
 If you are able to retrieve data using the Kubernetes API server, then etcd is available and you should not restore using an etcd backup.
 ====
 
-Restoring etcd effectively takes a cluster back in time and all clients will experience a conflicting, parallel history. This can impact the behavior of watching components like kubelets, Kubernetes controller managers, persistent volume controllers, and OpenShift operators, including the network operator.
+Restoring etcd effectively takes a cluster back in time and all clients will experience a conflicting, parallel history. This can impact the behavior of watching components like kubelets, Kubernetes controller managers, persistent volume controllers, and {product-title} Operators, including the network Operator.
 
 It can cause Operator churn when the content in etcd does not match the actual content on disk, causing Operators for the Kubernetes API server, Kubernetes controller manager, Kubernetes scheduler, and etcd to get stuck when files on disk conflict with content in etcd. This can require manual actions to resolve the issues.
 
diff --git a/modules/dr-restoring-cluster-state-sno.adoc b/modules/dr-restoring-cluster-state-sno.adoc
@@ -0,0 +1,45 @@
+// Module included in the following assemblies:
+//
+// * disaster_recovery/scenario-2-restoring-cluster-state.adoc
+
+:_mod-docs-content-type: PROCEDURE
+[id="dr-restoring-cluster-state-sno_{context}"]
+= Restoring to a previous cluster state for a single node
+
+You can use a saved etcd backup to restore a previous cluster state on a single node.
+
+[IMPORTANT]
+====
+When you restore your cluster, you must use an etcd backup that was taken from the same z-stream release. For example, an {product-title} {product-version}.2 cluster must use an etcd backup that was taken from {product-version}.2.
+====
+
+.Prerequisites
+
+* Access to the cluster as a user with the `cluster-admin` role through a certificate-based `kubeconfig` file, like the one that was used during installation.
+* You have SSH access to control plane hosts.
+* A backup directory containing both the etcd snapshot and the resources for the static pods, which were from the same backup. The file names in the directory must be in the following formats: `snapshot_<datetimestamp>.db` and `static_kuberesources_<datetimestamp>.tar.gz`.
+
+.Procedure
+
+. Use SSH to connect to the single node and copy the etcd backup to the `/home/core` directory by running the following command:
++
+[source,terminal]
+----
+$ cp <etcd_backup_directory> /home/core
+----
+
+. Run the following command in the single node to restore the cluster from a previous backup:
++
+[source,terminal]
+----
+$ sudo -E /usr/local/bin/cluster-restore.sh /home/core/<etcd_backup_directory>
+----
+
+. Exit the SSH session.
+
+. Monitor the recovery progress of the control plane by running the following command:
++
+[source,terminal]
+----
+$ oc adm wait-for-stable-cluster
+----
diff --git a/modules/dr-restoring-cluster-state.adoc b/modules/dr-restoring-cluster-state.adoc
diff --git a/modules/dr-restoring-etcd-quorum-ha.adoc b/modules/dr-restoring-etcd-quorum-ha.adoc
diff --git a/modules/dr-testing-restore-procedures.adoc b/modules/dr-testing-restore-procedures.adoc