Merge pull request #93059 from lahinson/osdocs-12351-etcd-migration-backup-restore

lahinson · web-flow · commit 925b271170c3 · 2025-05-13T15:32:28.000-04:00
[OSDOCS-12351]: Migrating etcd backup/restore docs to new etcd book
diff --git a/_topic_maps/_topic_map.yml b/_topic_maps/_topic_map.yml
@@ -2519,7 +2519,7 @@ Topics:
   File: etcd-practices
 - Name: Performance considerations for etcd
   File: etcd-performance
-- Name: Backing up etcd data
+- Name: Backing up and restoring etcd data
   File: etcd-backup
 - Name: Encrypting etcd data
   File: etcd-encrypt
diff --git a/etcd/etcd-backup.adoc b/etcd/etcd-backup.adoc
@@ -1,7 +1,149 @@
 :_mod-docs-content-type: ASSEMBLY
 [id="etcd-backup"]
 include::_attributes/common-attributes.adoc[]
-= Backing up etcd
+= Backing up and restoring etcd data
 :context: etcd-backup
 
-// This assembly will contain modules to provide information about backing up and restoring etcd.
+toc::[]
+
+As the key-value store for {product-title}, etcd persists the state of all resource objects.
+
+Back up the etcd data for your cluster regularly and store it in a secure location, ideally outside the {product-title} environment. Do not take an etcd backup before the first certificate rotation completes, which occurs 24 hours after installation, otherwise the backup will contain expired certificates. It is also recommended to take etcd backups during non-peak usage hours because the etcd snapshot has a high I/O cost.
+
+Be sure to take an etcd backup before you update your cluster. Taking a backup before you update is important because when you restore your cluster, you must use an etcd backup that was taken from the same z-stream release. For example, an {product-title} 4.17.5 cluster must use an etcd backup that was taken from 4.17.5.
+
+[IMPORTANT]
+====
+Back up your cluster's etcd data by performing a single invocation of the backup script on a control plane host. Do not take a backup for each control plane host.
+====
+
+After you have an etcd backup, you can xref:../etcd/etcd-backup.adoc#dr-scenario-2-restoring-cluster-state-about_etcd-backup[restore to a previous cluster state].
+
+// Backing up etcd data
+include::modules/backup-etcd.adoc[leveloffset=+1]
+
+[role="_additional-resources"]
+.Additional resources
+* xref:../hosted_control_planes/hcp_high_availability/hcp-recovering-etcd-cluster.adoc[Recovering an unhealthy etcd cluster for {hcp}]
+
+// Creating automated etcd backups
+include::modules/etcd-creating-automated-backups.adoc[leveloffset=+1]
+
+// Creating a single etcd backup
+include::modules/creating-single-etcd-backup.adoc[leveloffset=+2]
+
+// Creating recurring etcd backups
+include::modules/creating-recurring-etcd-backups.adoc[leveloffset=+2]
+
+[id="replace-unhealthy-etcd-member_{context}"]
+== Replacing an unhealthy etcd member
+
+The process to replace a single unhealthy etcd member depends on whether the etcd member is unhealthy because the machine is not running or the node is not ready, or because the etcd pod is crashlooping.
+
+[NOTE]
+====
+If you have lost the majority of your control plane hosts, follow the disaster recovery procedure to  xref:../etcd/etcd-backup.adoc#dr-scenario-2-restoring-cluster-state-about_etcd-backup[restore to a previous cluster state] instead of this procedure.
+
+If the control plane certificates are not valid on the member being replaced, then you must follow the procedure to xref:../etcd/etcd-backup.adoc#dr-scenario-3-recovering-expired-certs_etcd-backup[recover from expired control plane certificates] instead of this procedure.
+
+If a control plane node is lost and a new one is created, the etcd cluster Operator handles generating the new TLS certificates and adding the node as an etcd member.
+====
+
+// Identifying an unhealthy etcd member
+include::modules/restore-identify-unhealthy-etcd-member.adoc[leveloffset=+2]
+
+[.role=_additional-resources]
+.Additional resources
+* xref:../etcd/etcd-backup.adoc#backing-up-etcd-data_etcd-backup[Backing up etcd data]
+
+// Determining the state of the unhealthy etcd member
+include::modules/restore-determine-state-etcd-member.adoc[leveloffset=+2]
+
+// Replacing an unhealthy etcd member whose machine is not running or whose node is not ready
+include::modules/restore-replace-stopped-etcd-member.adoc[leveloffset=+3]
+
+[role="_additional-resources"]
+.Additional resources
+* xref:../machine_management/control_plane_machine_management/cpmso-troubleshooting.adoc#cpmso-ts-etcd-degraded_cpmso-troubleshooting[Recovering a degraded etcd Operator]
+* link:https://docs.redhat.com/en/documentation/assisted_installer_for_openshift_container_platform/2023/html/assisted_installer_for_openshift_container_platform/expanding-the-cluster#installing-primary-control-plane-node-unhealthy-cluster_expanding-the-cluster[Installing a primary control plane node on an unhealthy cluster]
+
+// Replacing an unhealthy etcd member whose etcd pod is crashlooping
+include::modules/restore-replace-crashlooping-etcd-member.adoc[leveloffset=+3]
+
+// Replacing an unhealthy baremetal stopped etcd member
+include::modules/restore-replace-stopped-baremetal-etcd-member.adoc[leveloffset=+3]
+
+[role="_additional-resources"]
+.Additional resources
+* xref:../machine_management/deleting-machine.adoc#machine-lifecycle-hook-deletion-etcd_deleting-machine[Quorum protection with machine lifecycle hooks]
+
+[id="etcd-disaster-recovery_{context}"]
+== Disaster recovery
+
+The disaster recovery documentation provides information for administrators on how to recover from several disaster situations that might occur with their {product-title} cluster. As an administrator, you might need to follow one or more of the following procedures to return your cluster to a working state.
+
+[IMPORTANT]
+====
+Disaster recovery requires you to have at least one healthy control plane host.
+====
+
+xref:../etcd/etcd-backup.adoc#dr-restoring-etcd-quorum-ha_etcd-backup[Restoring etcd quorum for high availability clusters]:: This solution handles situations where you have lost the majority of your control plane hosts, leading to etcd quorum loss and the cluster going offline. This solution does not require an etcd backup.
++
+[NOTE]
+====
+If you have a majority of your control plane nodes still available and have an etcd quorum, xref:../etcd/etcd-backup.adoc#replace-unhealthy-etcd-member_etcd-backup[replace a single unhealthy etcd member].
+====
+
+xref:../etcd/etcd-backup.adoc#dr-scenario-2-restoring-cluster-state-about_etcd-backup[Restoring to a previous cluster state]:: This solution handles situations where you want to restore your cluster to a previous state, for example, if an administrator deletes something critical. If you have taken an etcd backup, you can restore your cluster to a previous state.
++
+If applicable, you might also need to xref:../etcd/etcd-backup.adoc#dr-scenario-3-recovering-expired-certs_etcd-backup[recover from expired control plane certificates].
++
+[WARNING]
+====
+Restoring to a previous cluster state is a destructive and destablizing action to take on a running cluster. This procedure should only be used as a last resort.
+
+Before performing a restore, see xref:../etcd/etcd-backup.adoc#dr-scenario-2-restoring-cluster-state-about_etcd-backup[Restoring a cluster state] for more information on the impact to the cluster.
+====
+
+xref:../etcd/etcd-backup.adoc#dr-scenario-3-recovering-expired-certs_etcd-backup[Recovering from expired control plane certificates]:: This solution handles situations where your control plane certificates have expired. For example, if you shut down your cluster before the first certificate rotation, which occurs 24 hours after installation, your certificates will not be rotated and will expire. You can follow this procedure to recover from expired control plane certificates.
+
+// Testing restore procedures
+include::modules/dr-testing-restore-procedures.adoc[leveloffset=+2]
+
+[role="_additional-resources"]
+.Additional resources
+* xref:../etcd/etcd-backup.adoc#dr-scenario-2-restoring-cluster-state-about_etcd-backup[Restoring to a previous cluster state]
+
+// Restoring etcd quorum for high availability clusters
+include::modules/dr-restoring-etcd-quorum-ha.adoc[leveloffset=+2]
+
+[role="_additional-resources"]
+.Additional resources
+* xref:../installing/installing_bare_metal/upi/installing-bare-metal.adoc[Installing a user-provisioned cluster on bare metal]
+
+* xref:../installing/installing_bare_metal/ipi/ipi-install-expanding-the-cluster.adoc#replacing-a-bare-metal-control-plane-node_ipi-install-expanding[Replacing a bare-metal control plane node]
+
+// Restoring to a previous cluster state
+include::modules/dr-restoring-cluster-state-about.adoc[leveloffset=+2]
+
+// Restoring to a previous cluster state for a single node
+include::modules/dr-restoring-cluster-state-sno.adoc[leveloffset=+3]
+
+// Restoring to a previous cluster state
+include::modules/dr-restoring-cluster-state.adoc[leveloffset=+3]
+
+// Restoring a cluster from etcd backup manually
+include::modules/manually-restoring-cluster-etcd-backup.adoc[leveloffset=+3]
+
+[role="_additional-resources"]
+.Additional resources
+* xref:../etcd/etcd-backup.adoc#backing-up-etcd-data_etcd-backup[Backing up etcd data]
+* xref:../installing/installing_bare_metal/upi/installing-bare-metal.adoc[Installing a user-provisioned cluster on bare metal]
+* xref:../networking/accessing-hosts.adoc#accessing-hosts[Accessing hosts on Amazon Web Services in an installer-provisioned infrastructure cluster]
+* xref:../installing/installing_bare_metal/ipi/ipi-install-expanding-the-cluster.adoc#replacing-a-bare-metal-control-plane-node_ipi-install-expanding[Replacing a bare-metal control plane node]
+
+// Issues and workarounds for restoring a persistent storage state
+include::modules/dr-scenario-cluster-state-issues.adoc[leveloffset=+3]
+
+// Recovering from expired control plane certificates
+include::modules/dr-recover-expired-control-plane-certs.adoc[leveloffset=+2]
diff --git a/hosted_control_planes/hcp_high_availability/hcp-recovering-etcd-cluster.adoc b/hosted_control_planes/hcp_high_availability/hcp-recovering-etcd-cluster.adoc
@@ -1,6 +1,6 @@
 :_mod-docs-content-type: ASSEMBLY
 [id="hcp-recovering-etcd-cluster"]
-= Recovering an unhealthy etcd cluster
+= Recovering an unhealthy etcd cluster for {hcp}
 include::_attributes/common-attributes.adoc[]
 :context: hcp-recovering-etcd-cluster
 
diff --git a/modules/backup-etcd.adoc b/modules/backup-etcd.adoc
@@ -2,6 +2,7 @@
 //
 // * backup_and_restore/control_plane_backup_and_restore/backing-up-etcd.adoc
 // * post_installation_configuration/cluster-tasks.adoc
+// * etcd/etcd-backup.adoc
 
 :_mod-docs-content-type: PROCEDURE
 [id="backing-up-etcd-data_{context}"]
diff --git a/modules/creating-recurring-etcd-backups.adoc b/modules/creating-recurring-etcd-backups.adoc
@@ -1,5 +1,10 @@
+//Module included in the following assemblies:
+//
+// * etcd/etcd-backup.adoc
+
+:_mod-docs-content-type: PROCEDURE
 [id="creating-recurring-etcd-backups_{context}"]
-= Creating recurring etcd backups
+= Creating recurring automated etcd backups
 
 Follow these steps to create automated recurring backups of etcd.
 
@@ -190,7 +195,7 @@ $ oc apply -f etcd-backup-pvc.yaml
 
 . Create a custom resource definition (CRD) file named `etcd-recurring-backups.yaml`. The contents of the created CRD define the schedule and retention type of automated backups.
 +
-For the default retention type of `RetentionNumber` with 15 retained backups, use contents such as the following example:
+** For the default retention type of `RetentionNumber` with 15 retained backups, use contents such as the following example:
 +
 [source,yaml]
 ----
@@ -206,7 +211,7 @@ spec:
 ----
 <1> The `CronTab` schedule for recurring backups. Adjust this value for your needs.
 +
-To use retention based on the maximum number of backups, add the following key-value pairs to the `etcd` key:
+** To use retention based on the maximum number of backups, add the following key-value pairs to the `etcd` key:
 +
 [source,yaml]
 ----
@@ -225,7 +230,7 @@ spec:
 A known issue causes the number of retained backups to be one greater than the configured value.
 ====
 +
-For retention based on the file size of backups, use the following:
+** For retention based on the file size of backups, use the following:
 +
 [source,yaml]
 ----
diff --git a/modules/creating-single-etcd-backup.adoc b/modules/creating-single-etcd-backup.adoc
@@ -1,10 +1,11 @@
 // Module included in the following assemblies:
 //
 // * backup_and_restore/control_plane_backup_and_restore/backing-up-etcd.adoc
+// * etcd/etcd-backup.adoc
 
 :_mod-docs-content-type: PROCEDURE
 [id="creating-single-etcd-backup_{context}"]
-= Creating a single etcd backup
+= Creating a single automated etcd backup
 
 Follow these steps to create a single etcd backup by creating and applying a custom resource (CR).
 
diff --git a/modules/dr-recover-expired-control-plane-certs.adoc b/modules/dr-recover-expired-control-plane-certs.adoc
@@ -1,6 +1,7 @@
 // Module included in the following assemblies:
 //
 // * disaster_recovery/scenario-3-expired-certs.adoc
+// * etcd/etcd-backup.adoc
 
 :_mod-docs-content-type: PROCEDURE
 [id="dr-scenario-3-recovering-expired-certs_{context}"]
diff --git a/modules/dr-restoring-cluster-state-about.adoc b/modules/dr-restoring-cluster-state-about.adoc
@@ -1,10 +1,13 @@
 // Module included in the following assemblies:
 //
 // * disaster_recovery/scenario-2-restoring-cluster-state.adoc
+// * etcd/etcd-backup.adoc
 
 :_mod-docs-content-type: CONCEPT
 [id="dr-scenario-2-restoring-cluster-state-about_{context}"]
-= About restoring cluster state
+= Restoring to a previous cluster state
+
+To restore the cluster to a previous state, you must have previously backed up the `etcd` data by creating a snapshot. You will use this snapshot to restore the cluster state. For more information, see "Backing up etcd data".
 
 You can use an etcd backup to restore your cluster to a previous state. This can be used to recover from the following situations:
 
@@ -22,4 +25,4 @@ Restoring etcd effectively takes a cluster back in time and all clients will exp
 
 It can cause Operator churn when the content in etcd does not match the actual content on disk, causing Operators for the Kubernetes API server, Kubernetes controller manager, Kubernetes scheduler, and etcd to get stuck when files on disk conflict with content in etcd. This can require manual actions to resolve the issues.
 
-In extreme cases, the cluster can lose track of persistent volumes, delete critical workloads that no longer exist, reimage machines, and rewrite CA bundles with expired certificates.
+In extreme cases, the cluster can lose track of persistent volumes, delete critical workloads that no longer exist, reimage machines, and rewrite CA bundles with expired certificates.
diff --git a/modules/dr-restoring-cluster-state-sno.adoc b/modules/dr-restoring-cluster-state-sno.adoc
@@ -1,6 +1,7 @@
 // Module included in the following assemblies:
 //
 // * disaster_recovery/scenario-2-restoring-cluster-state.adoc
+// * etcd/etcd-backup.adoc
 
 :_mod-docs-content-type: PROCEDURE
 [id="dr-restoring-cluster-state-sno_{context}"]
diff --git a/modules/dr-restoring-cluster-state.adoc b/modules/dr-restoring-cluster-state.adoc
@@ -2,6 +2,7 @@
 //
 // * disaster_recovery/scenario-2-restoring-cluster-state.adoc
 // * post_installation_configuration/cluster-tasks.adoc
+// * etcd/etcd-backup.adoc
 
 // Contributors: The documentation for this section changed drastically for 4.18+.
 
@@ -14,7 +15,7 @@
 
 :_mod-docs-content-type: PROCEDURE
 [id="dr-scenario-2-restoring-cluster-state_{context}"]
-= Restoring to a previous cluster state
+= Restoring to a previous cluster state for more than one node
 
 You can use a saved etcd backup to restore a previous cluster state or restore a cluster that has lost the majority of control plane hosts.
 
diff --git a/modules/dr-restoring-etcd-quorum-ha.adoc b/modules/dr-restoring-etcd-quorum-ha.adoc
@@ -1,12 +1,15 @@
 // Module included in the following assemblies:
 //
 // * disaster_recovery/quorum-restoration.adoc
+// * etcd/etcd-backup.adoc
 
 :_mod-docs-content-type: PROCEDURE
 [id="dr-restoring-etcd-quorum-ha_{context}"]
 = Restoring etcd quorum for high availability clusters
 
-You can use the `quorum-restore.sh` script to instantly bring back a new single-member etcd cluster based on its local data directory and mark all other members as invalid by retiring the previous cluster identifier. No prior backup is required to restore the control plane from.
+You can use the `quorum-restore.sh` script to restore etcd quorum on clusters that are offline due to quorum loss. When quorum is lost, the {product-title} API becomes read-only. After quorum is restored, the {product-title} API returns to read/write mode.
+
+The `quorum-restore.sh` script instantly brings back a new single-member etcd cluster based on its local data directory and marks all other members as invalid by retiring the previous cluster identifier. No prior backup is required to restore the control plane from.
 
 For high availability (HA) clusters, a three-node HA cluster requires you to shut down etcd on two hosts to avoid a cluster split. On four-node and five-node HA clusters, you must shut down three hosts. Quorum requires a simple majority of nodes. The minimum number of nodes required for quorum on a three-node HA cluster is two. On four-node and five-node HA clusters, the minimum number of nodes required for quorum is three. If you start a new cluster from backup on your recovery host, the other etcd members might still be able to form quorum and continue service.
 
diff --git a/modules/dr-scenario-cluster-state-issues.adoc b/modules/dr-scenario-cluster-state-issues.adoc
@@ -2,6 +2,7 @@
 //
 // * disaster_recovery/scenario-2-restoring-cluster-state.adoc
 // * post_installation_configuration/cluster-tasks.adoc
+// * etcd/etcd-backup.adoc
 
 [id="dr-scenario-cluster-state-issues_{context}"]
 = Issues and workarounds for restoring a persistent storage state
diff --git a/modules/dr-testing-restore-procedures.adoc b/modules/dr-testing-restore-procedures.adoc
@@ -1,6 +1,7 @@
 // Module included in the following assemblies:
 //
 // * disaster_recovery/about-disaster-recovery.adoc
+// * etcd/etcd-backup.adoc
 
 :_mod-docs-content-type: PROCEDURE
 [id="dr-testing-restore-procedures_{context}"]
diff --git a/modules/manually-restoring-cluster-etcd-backup.adoc b/modules/manually-restoring-cluster-etcd-backup.adoc
@@ -2,6 +2,7 @@
 //
 // * disaster_recovery/scenario-2-manually-restoring-cluster-etcd-backup.adoc
 // * post_installation_configuration/cluster-tasks.adoc
+// * etcd/etcd-backup.adoc
 
 
 :_mod-docs-content-type: PROCEDURE
diff --git a/modules/restore-determine-state-etcd-member.adoc b/modules/restore-determine-state-etcd-member.adoc
@@ -1,6 +1,7 @@
 // Module included in the following assemblies:
 //
 // * backup_and_restore/replacing-unhealthy-etcd-member.adoc
+// * etcd/etcd-backup.adoc
 
 :_mod-docs-content-type: PROCEDURE
 [id="restore-determine-state-etcd-member_{context}"]
diff --git a/modules/restore-identify-unhealthy-etcd-member.adoc b/modules/restore-identify-unhealthy-etcd-member.adoc
@@ -1,6 +1,7 @@
 // Module included in the following assemblies:
 //
 // * backup_and_restore/replacing-unhealthy-etcd-member.adoc
+// * etcd/etcd-backup.adoc
 
 :_mod-docs-content-type: PROCEDURE
 [id="restore-identify-unhealthy-etcd-member_{context}"]
@@ -10,7 +11,8 @@ You can identify if your cluster has an unhealthy etcd member.
 
 .Prerequisites
 
-* Access to the cluster as a user with the `cluster-admin` role.
+* You have access to the cluster as a user with the `cluster-admin` role.
+* You have taken an etcd backup. For more information, see "Backing up etcd data".
 
 .Procedure
 
diff --git a/modules/restore-replace-crashlooping-etcd-member.adoc b/modules/restore-replace-crashlooping-etcd-member.adoc
@@ -1,6 +1,7 @@
 // Module included in the following assemblies:
 //
 // * backup_and_restore/replacing-unhealthy-etcd-member.adoc
+// * etcd/etcd-backup.adoc
 
 :_mod-docs-content-type: PROCEDURE
 [id="restore-replace-crashlooping-etcd-member_{context}"]
diff --git a/modules/restore-replace-stopped-baremetal-etcd-member.adoc b/modules/restore-replace-stopped-baremetal-etcd-member.adoc
@@ -1,6 +1,7 @@
 // Module included in the following assemblies:
 //
 // * /backup_and_restore/control_plane_backup_and_restore/replacing-unhealthy-etcd-member.adoc
+// * etcd/etcd-backup.adoc
 
 :_mod-docs-content-type: PROCEDURE
 [id="restore-replace-stopped-baremetal-etcd-member_{context}"]
diff --git a/modules/restore-replace-stopped-etcd-member.adoc b/modules/restore-replace-stopped-etcd-member.adoc
@@ -1,6 +1,7 @@
 // Module included in the following assemblies:
 //
 // * backup_and_restore/control_plane_backup_and_restore/modules/restore-replace-stopped-etcd-member.adoc
+// * etcd/etcd-backup.adoc
 
 :_mod-docs-content-type: PROCEDURE
 [id="restore-replace-stopped-etcd-member_{context}"]

Original file line number	Diff line number	Diff line change
`@@ -1,5 +1,10 @@`
	`1`	`+//Module included in the following assemblies:`
	`2`	`+//`
	`3`	`+// * etcd/etcd-backup.adoc`
	`4`	`+`
	`5`	`+:_mod-docs-content-type: PROCEDURE`
`1`	`6`	`[id="creating-recurring-etcd-backups_{context}"]`
`2`		`-= Creating recurring etcd backups`
	`7`	`+= Creating recurring automated etcd backups`
`3`	`8`
`4`	`9`	`Follow these steps to create automated recurring backups of etcd.`
`5`	`10`
`@@ -190,7 +195,7 @@ $ oc apply -f etcd-backup-pvc.yaml`
`190`	`195`
`191`	`196`	. Create a custom resource definition (CRD) file named `etcd-recurring-backups.yaml`. The contents of the created CRD define the schedule and retention type of automated backups.
`192`	`197`	`+`
`193`		-For the default retention type of `RetentionNumber` with 15 retained backups, use contents such as the following example:
	`198`	+** For the default retention type of `RetentionNumber` with 15 retained backups, use contents such as the following example:
`194`	`199`	`+`
`195`	`200`	`[source,yaml]`
`196`	`201`	`----`
`@@ -206,7 +211,7 @@ spec:`
`206`	`211`	`----`
`207`	`212`	<1> The `CronTab` schedule for recurring backups. Adjust this value for your needs.
`208`	`213`	`+`
`209`		-To use retention based on the maximum number of backups, add the following key-value pairs to the `etcd` key:
	`214`	+** To use retention based on the maximum number of backups, add the following key-value pairs to the `etcd` key:
`210`	`215`	`+`
`211`	`216`	`[source,yaml]`
`212`	`217`	`----`
`@@ -225,7 +230,7 @@ spec:`
`225`	`230`	`A known issue causes the number of retained backups to be one greater than the configured value.`
`226`	`231`	`====`
`227`	`232`	`+`
`228`		`-For retention based on the file size of backups, use the following:`
	`233`	`+** For retention based on the file size of backups, use the following:`
`229`	`234`	`+`
`230`	`235`	`[source,yaml]`
`231`	`236`	`----`