Skip to content

Commit 925b271

Browse files
authored
Merge pull request #93059 from lahinson/osdocs-12351-etcd-migration-backup-restore
[OSDOCS-12351]: Migrating etcd backup/restore docs to new etcd book
2 parents 1f81a03 + 44616a1 commit 925b271

19 files changed

+181
-14
lines changed

_topic_maps/_topic_map.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2519,7 +2519,7 @@ Topics:
25192519
File: etcd-practices
25202520
- Name: Performance considerations for etcd
25212521
File: etcd-performance
2522-
- Name: Backing up etcd data
2522+
- Name: Backing up and restoring etcd data
25232523
File: etcd-backup
25242524
- Name: Encrypting etcd data
25252525
File: etcd-encrypt

etcd/etcd-backup.adoc

Lines changed: 144 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,149 @@
11
:_mod-docs-content-type: ASSEMBLY
22
[id="etcd-backup"]
33
include::_attributes/common-attributes.adoc[]
4-
= Backing up etcd
4+
= Backing up and restoring etcd data
55
:context: etcd-backup
66

7-
// This assembly will contain modules to provide information about backing up and restoring etcd.
7+
toc::[]
8+
9+
As the key-value store for {product-title}, etcd persists the state of all resource objects.
10+
11+
Back up the etcd data for your cluster regularly and store it in a secure location, ideally outside the {product-title} environment. Do not take an etcd backup before the first certificate rotation completes, which occurs 24 hours after installation, otherwise the backup will contain expired certificates. It is also recommended to take etcd backups during non-peak usage hours because the etcd snapshot has a high I/O cost.
12+
13+
Be sure to take an etcd backup before you update your cluster. Taking a backup before you update is important because when you restore your cluster, you must use an etcd backup that was taken from the same z-stream release. For example, an {product-title} 4.17.5 cluster must use an etcd backup that was taken from 4.17.5.
14+
15+
[IMPORTANT]
16+
====
17+
Back up your cluster's etcd data by performing a single invocation of the backup script on a control plane host. Do not take a backup for each control plane host.
18+
====
19+
20+
After you have an etcd backup, you can xref:../etcd/etcd-backup.adoc#dr-scenario-2-restoring-cluster-state-about_etcd-backup[restore to a previous cluster state].
21+
22+
// Backing up etcd data
23+
include::modules/backup-etcd.adoc[leveloffset=+1]
24+
25+
[role="_additional-resources"]
26+
.Additional resources
27+
* xref:../hosted_control_planes/hcp_high_availability/hcp-recovering-etcd-cluster.adoc[Recovering an unhealthy etcd cluster for {hcp}]
28+
29+
// Creating automated etcd backups
30+
include::modules/etcd-creating-automated-backups.adoc[leveloffset=+1]
31+
32+
// Creating a single etcd backup
33+
include::modules/creating-single-etcd-backup.adoc[leveloffset=+2]
34+
35+
// Creating recurring etcd backups
36+
include::modules/creating-recurring-etcd-backups.adoc[leveloffset=+2]
37+
38+
[id="replace-unhealthy-etcd-member_{context}"]
39+
== Replacing an unhealthy etcd member
40+
41+
The process to replace a single unhealthy etcd member depends on whether the etcd member is unhealthy because the machine is not running or the node is not ready, or because the etcd pod is crashlooping.
42+
43+
[NOTE]
44+
====
45+
If you have lost the majority of your control plane hosts, follow the disaster recovery procedure to xref:../etcd/etcd-backup.adoc#dr-scenario-2-restoring-cluster-state-about_etcd-backup[restore to a previous cluster state] instead of this procedure.
46+
47+
If the control plane certificates are not valid on the member being replaced, then you must follow the procedure to xref:../etcd/etcd-backup.adoc#dr-scenario-3-recovering-expired-certs_etcd-backup[recover from expired control plane certificates] instead of this procedure.
48+
49+
If a control plane node is lost and a new one is created, the etcd cluster Operator handles generating the new TLS certificates and adding the node as an etcd member.
50+
====
51+
52+
// Identifying an unhealthy etcd member
53+
include::modules/restore-identify-unhealthy-etcd-member.adoc[leveloffset=+2]
54+
55+
[.role=_additional-resources]
56+
.Additional resources
57+
* xref:../etcd/etcd-backup.adoc#backing-up-etcd-data_etcd-backup[Backing up etcd data]
58+
59+
// Determining the state of the unhealthy etcd member
60+
include::modules/restore-determine-state-etcd-member.adoc[leveloffset=+2]
61+
62+
// Replacing an unhealthy etcd member whose machine is not running or whose node is not ready
63+
include::modules/restore-replace-stopped-etcd-member.adoc[leveloffset=+3]
64+
65+
[role="_additional-resources"]
66+
.Additional resources
67+
* xref:../machine_management/control_plane_machine_management/cpmso-troubleshooting.adoc#cpmso-ts-etcd-degraded_cpmso-troubleshooting[Recovering a degraded etcd Operator]
68+
* link:https://docs.redhat.com/en/documentation/assisted_installer_for_openshift_container_platform/2023/html/assisted_installer_for_openshift_container_platform/expanding-the-cluster#installing-primary-control-plane-node-unhealthy-cluster_expanding-the-cluster[Installing a primary control plane node on an unhealthy cluster]
69+
70+
// Replacing an unhealthy etcd member whose etcd pod is crashlooping
71+
include::modules/restore-replace-crashlooping-etcd-member.adoc[leveloffset=+3]
72+
73+
// Replacing an unhealthy baremetal stopped etcd member
74+
include::modules/restore-replace-stopped-baremetal-etcd-member.adoc[leveloffset=+3]
75+
76+
[role="_additional-resources"]
77+
.Additional resources
78+
* xref:../machine_management/deleting-machine.adoc#machine-lifecycle-hook-deletion-etcd_deleting-machine[Quorum protection with machine lifecycle hooks]
79+
80+
[id="etcd-disaster-recovery_{context}"]
81+
== Disaster recovery
82+
83+
The disaster recovery documentation provides information for administrators on how to recover from several disaster situations that might occur with their {product-title} cluster. As an administrator, you might need to follow one or more of the following procedures to return your cluster to a working state.
84+
85+
[IMPORTANT]
86+
====
87+
Disaster recovery requires you to have at least one healthy control plane host.
88+
====
89+
90+
xref:../etcd/etcd-backup.adoc#dr-restoring-etcd-quorum-ha_etcd-backup[Restoring etcd quorum for high availability clusters]:: This solution handles situations where you have lost the majority of your control plane hosts, leading to etcd quorum loss and the cluster going offline. This solution does not require an etcd backup.
91+
+
92+
[NOTE]
93+
====
94+
If you have a majority of your control plane nodes still available and have an etcd quorum, xref:../etcd/etcd-backup.adoc#replace-unhealthy-etcd-member_etcd-backup[replace a single unhealthy etcd member].
95+
====
96+
97+
xref:../etcd/etcd-backup.adoc#dr-scenario-2-restoring-cluster-state-about_etcd-backup[Restoring to a previous cluster state]:: This solution handles situations where you want to restore your cluster to a previous state, for example, if an administrator deletes something critical. If you have taken an etcd backup, you can restore your cluster to a previous state.
98+
+
99+
If applicable, you might also need to xref:../etcd/etcd-backup.adoc#dr-scenario-3-recovering-expired-certs_etcd-backup[recover from expired control plane certificates].
100+
+
101+
[WARNING]
102+
====
103+
Restoring to a previous cluster state is a destructive and destablizing action to take on a running cluster. This procedure should only be used as a last resort.
104+
105+
Before performing a restore, see xref:../etcd/etcd-backup.adoc#dr-scenario-2-restoring-cluster-state-about_etcd-backup[Restoring a cluster state] for more information on the impact to the cluster.
106+
====
107+
108+
xref:../etcd/etcd-backup.adoc#dr-scenario-3-recovering-expired-certs_etcd-backup[Recovering from expired control plane certificates]:: This solution handles situations where your control plane certificates have expired. For example, if you shut down your cluster before the first certificate rotation, which occurs 24 hours after installation, your certificates will not be rotated and will expire. You can follow this procedure to recover from expired control plane certificates.
109+
110+
// Testing restore procedures
111+
include::modules/dr-testing-restore-procedures.adoc[leveloffset=+2]
112+
113+
[role="_additional-resources"]
114+
.Additional resources
115+
* xref:../etcd/etcd-backup.adoc#dr-scenario-2-restoring-cluster-state-about_etcd-backup[Restoring to a previous cluster state]
116+
117+
// Restoring etcd quorum for high availability clusters
118+
include::modules/dr-restoring-etcd-quorum-ha.adoc[leveloffset=+2]
119+
120+
[role="_additional-resources"]
121+
.Additional resources
122+
* xref:../installing/installing_bare_metal/upi/installing-bare-metal.adoc[Installing a user-provisioned cluster on bare metal]
123+
124+
* xref:../installing/installing_bare_metal/ipi/ipi-install-expanding-the-cluster.adoc#replacing-a-bare-metal-control-plane-node_ipi-install-expanding[Replacing a bare-metal control plane node]
125+
126+
// Restoring to a previous cluster state
127+
include::modules/dr-restoring-cluster-state-about.adoc[leveloffset=+2]
128+
129+
// Restoring to a previous cluster state for a single node
130+
include::modules/dr-restoring-cluster-state-sno.adoc[leveloffset=+3]
131+
132+
// Restoring to a previous cluster state
133+
include::modules/dr-restoring-cluster-state.adoc[leveloffset=+3]
134+
135+
// Restoring a cluster from etcd backup manually
136+
include::modules/manually-restoring-cluster-etcd-backup.adoc[leveloffset=+3]
137+
138+
[role="_additional-resources"]
139+
.Additional resources
140+
* xref:../etcd/etcd-backup.adoc#backing-up-etcd-data_etcd-backup[Backing up etcd data]
141+
* xref:../installing/installing_bare_metal/upi/installing-bare-metal.adoc[Installing a user-provisioned cluster on bare metal]
142+
* xref:../networking/accessing-hosts.adoc#accessing-hosts[Accessing hosts on Amazon Web Services in an installer-provisioned infrastructure cluster]
143+
* xref:../installing/installing_bare_metal/ipi/ipi-install-expanding-the-cluster.adoc#replacing-a-bare-metal-control-plane-node_ipi-install-expanding[Replacing a bare-metal control plane node]
144+
145+
// Issues and workarounds for restoring a persistent storage state
146+
include::modules/dr-scenario-cluster-state-issues.adoc[leveloffset=+3]
147+
148+
// Recovering from expired control plane certificates
149+
include::modules/dr-recover-expired-control-plane-certs.adoc[leveloffset=+2]

hosted_control_planes/hcp_high_availability/hcp-recovering-etcd-cluster.adoc

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
:_mod-docs-content-type: ASSEMBLY
22
[id="hcp-recovering-etcd-cluster"]
3-
= Recovering an unhealthy etcd cluster
3+
= Recovering an unhealthy etcd cluster for {hcp}
44
include::_attributes/common-attributes.adoc[]
55
:context: hcp-recovering-etcd-cluster
66

modules/backup-etcd.adoc

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,7 @@
22
//
33
// * backup_and_restore/control_plane_backup_and_restore/backing-up-etcd.adoc
44
// * post_installation_configuration/cluster-tasks.adoc
5+
// * etcd/etcd-backup.adoc
56

67
:_mod-docs-content-type: PROCEDURE
78
[id="backing-up-etcd-data_{context}"]

modules/creating-recurring-etcd-backups.adoc

Lines changed: 9 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,10 @@
1+
//Module included in the following assemblies:
2+
//
3+
// * etcd/etcd-backup.adoc
4+
5+
:_mod-docs-content-type: PROCEDURE
16
[id="creating-recurring-etcd-backups_{context}"]
2-
= Creating recurring etcd backups
7+
= Creating recurring automated etcd backups
38

49
Follow these steps to create automated recurring backups of etcd.
510

@@ -190,7 +195,7 @@ $ oc apply -f etcd-backup-pvc.yaml
190195
191196
. Create a custom resource definition (CRD) file named `etcd-recurring-backups.yaml`. The contents of the created CRD define the schedule and retention type of automated backups.
192197
+
193-
For the default retention type of `RetentionNumber` with 15 retained backups, use contents such as the following example:
198+
** For the default retention type of `RetentionNumber` with 15 retained backups, use contents such as the following example:
194199
+
195200
[source,yaml]
196201
----
@@ -206,7 +211,7 @@ spec:
206211
----
207212
<1> The `CronTab` schedule for recurring backups. Adjust this value for your needs.
208213
+
209-
To use retention based on the maximum number of backups, add the following key-value pairs to the `etcd` key:
214+
** To use retention based on the maximum number of backups, add the following key-value pairs to the `etcd` key:
210215
+
211216
[source,yaml]
212217
----
@@ -225,7 +230,7 @@ spec:
225230
A known issue causes the number of retained backups to be one greater than the configured value.
226231
====
227232
+
228-
For retention based on the file size of backups, use the following:
233+
** For retention based on the file size of backups, use the following:
229234
+
230235
[source,yaml]
231236
----

modules/creating-single-etcd-backup.adoc

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,11 @@
11
// Module included in the following assemblies:
22
//
33
// * backup_and_restore/control_plane_backup_and_restore/backing-up-etcd.adoc
4+
// * etcd/etcd-backup.adoc
45

56
:_mod-docs-content-type: PROCEDURE
67
[id="creating-single-etcd-backup_{context}"]
7-
= Creating a single etcd backup
8+
= Creating a single automated etcd backup
89

910
Follow these steps to create a single etcd backup by creating and applying a custom resource (CR).
1011

modules/dr-recover-expired-control-plane-certs.adoc

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,7 @@
11
// Module included in the following assemblies:
22
//
33
// * disaster_recovery/scenario-3-expired-certs.adoc
4+
// * etcd/etcd-backup.adoc
45

56
:_mod-docs-content-type: PROCEDURE
67
[id="dr-scenario-3-recovering-expired-certs_{context}"]

modules/dr-restoring-cluster-state-about.adoc

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,13 @@
11
// Module included in the following assemblies:
22
//
33
// * disaster_recovery/scenario-2-restoring-cluster-state.adoc
4+
// * etcd/etcd-backup.adoc
45

56
:_mod-docs-content-type: CONCEPT
67
[id="dr-scenario-2-restoring-cluster-state-about_{context}"]
7-
= About restoring cluster state
8+
= Restoring to a previous cluster state
9+
10+
To restore the cluster to a previous state, you must have previously backed up the `etcd` data by creating a snapshot. You will use this snapshot to restore the cluster state. For more information, see "Backing up etcd data".
811

912
You can use an etcd backup to restore your cluster to a previous state. This can be used to recover from the following situations:
1013

@@ -22,4 +25,4 @@ Restoring etcd effectively takes a cluster back in time and all clients will exp
2225

2326
It can cause Operator churn when the content in etcd does not match the actual content on disk, causing Operators for the Kubernetes API server, Kubernetes controller manager, Kubernetes scheduler, and etcd to get stuck when files on disk conflict with content in etcd. This can require manual actions to resolve the issues.
2427

25-
In extreme cases, the cluster can lose track of persistent volumes, delete critical workloads that no longer exist, reimage machines, and rewrite CA bundles with expired certificates.
28+
In extreme cases, the cluster can lose track of persistent volumes, delete critical workloads that no longer exist, reimage machines, and rewrite CA bundles with expired certificates.

modules/dr-restoring-cluster-state-sno.adoc

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,7 @@
11
// Module included in the following assemblies:
22
//
33
// * disaster_recovery/scenario-2-restoring-cluster-state.adoc
4+
// * etcd/etcd-backup.adoc
45

56
:_mod-docs-content-type: PROCEDURE
67
[id="dr-restoring-cluster-state-sno_{context}"]

modules/dr-restoring-cluster-state.adoc

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,7 @@
22
//
33
// * disaster_recovery/scenario-2-restoring-cluster-state.adoc
44
// * post_installation_configuration/cluster-tasks.adoc
5+
// * etcd/etcd-backup.adoc
56

67
// Contributors: The documentation for this section changed drastically for 4.18+.
78

@@ -14,7 +15,7 @@
1415

1516
:_mod-docs-content-type: PROCEDURE
1617
[id="dr-scenario-2-restoring-cluster-state_{context}"]
17-
= Restoring to a previous cluster state
18+
= Restoring to a previous cluster state for more than one node
1819

1920
You can use a saved etcd backup to restore a previous cluster state or restore a cluster that has lost the majority of control plane hosts.
2021

0 commit comments

Comments
 (0)