Skip to content

Commit 047fcd5

Browse files
authored
Merge pull request #86873 from bergerhoffer/OSDOCS-12867
OSDOCS#12867: Docs for hibernating a cluster
2 parents c969ec4 + f6710a4 commit 047fcd5

File tree

5 files changed

+278
-0
lines changed

5 files changed

+278
-0
lines changed

_topic_maps/_topic_map.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3540,6 +3540,8 @@ Topics:
35403540
File: graceful-cluster-shutdown
35413541
- Name: Restarting a cluster gracefully
35423542
File: graceful-cluster-restart
3543+
- Name: Hibernating a cluster
3544+
File: hibernating-cluster
35433545
- Name: OADP Application backup and restore
35443546
Dir: application_backup_and_restore
35453547
Topics:
Lines changed: 41 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,41 @@
1+
:_mod-docs-content-type: ASSEMBLY
2+
[id="hibernating-cluster"]
3+
= Hibernating an {product-title} cluster
4+
include::_attributes/common-attributes.adoc[]
5+
:context: hibernating-cluster
6+
7+
toc::[]
8+
9+
You can hibernate your {product-title} cluster for up to 90 days.
10+
11+
// About hibernating a cluster
12+
include::modules/hibernating-cluster-about.adoc[leveloffset=+1]
13+
14+
[id="hibernating-cluster_prerequisites_{context}"]
15+
== Prerequisites
16+
17+
* Take an xref:../backup_and_restore/control_plane_backup_and_restore/backing-up-etcd.adoc#backing-up-etcd-data_backup-etcd[etcd backup] prior to hibernating the cluster.
18+
+
19+
[IMPORTANT]
20+
====
21+
It is important to take an etcd backup before hibernating so that your cluster can be restored if you encounter any issues when resuming the cluster.
22+
23+
For example, the following conditions can cause the resumed cluster to malfunction:
24+
25+
* etcd data corruption during hibernation
26+
* Node failure due to hardware
27+
* Network connectivity issues
28+
29+
If your cluster fails to recover, follow the steps to xref:../backup_and_restore/control_plane_backup_and_restore/disaster_recovery/scenario-2-restoring-cluster-state.adoc#dr-restoring-cluster-state[restore to a previous cluster state].
30+
====
31+
32+
// Hibernating a cluster
33+
include::modules/hibernating-cluster-hibernate.adoc[leveloffset=+1]
34+
35+
[role="_additional-resources"]
36+
.Additional resources
37+
38+
* xref:../backup_and_restore/control_plane_backup_and_restore/backing-up-etcd.adoc#backup-etcd[Backing up etcd]
39+
40+
// Resuming a hibernated cluster
41+
include::modules/hibernating-cluster-resume.adoc[leveloffset=+1]
Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
1+
// Module included in the following assemblies:
2+
//
3+
// * backup_and_restore/hibernating-cluster.adoc
4+
5+
:_mod-docs-content-type: CONCEPT
6+
[id="hibernating-cluster-about_{context}"]
7+
= About cluster hibernation
8+
9+
{product-title} clusters can be hibernated in order to save money on cloud hosting costs. You can hibernate your {product-title} cluster for up to 90 days and expect it to resume successfully.
10+
11+
You must wait at least 24 hours after cluster installation before hibernating your cluster to allow for the first certification rotation.
12+
13+
[IMPORTANT]
14+
====
15+
If you must hibernate your cluster before the 24 hour certificate rotation, use the following procedure instead: link:https://www.redhat.com/en/blog/enabling-openshift-4-clusters-to-stop-and-resume-cluster-vms[Enabling OpenShift 4 Clusters to Stop and Resume Cluster VMs].
16+
====
17+
18+
When hibernating a cluster, you must hibernate all cluster nodes. It is not supported to suspend only certain nodes.
19+
20+
After resuming, it can take up to 45 minutes for the cluster to become ready.
Lines changed: 97 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,97 @@
1+
// Module included in the following assemblies:
2+
//
3+
// * backup_and_restore/hibernating-cluster.adoc
4+
5+
:_mod-docs-content-type: PROCEDURE
6+
[id="hibernating-cluster-hibernate_{context}"]
7+
= Hibernating a cluster
8+
9+
You can hibernate a cluster for up to 90 days. The cluster can recover if certificates expire while the cluster was in hibernation.
10+
11+
.Prerequisites
12+
13+
* The cluster has been running for at least 24 hours to allow the first certificate rotation to complete.
14+
+
15+
[IMPORTANT]
16+
====
17+
If you must hibernate your cluster before the 24 hour certificate rotation, use the following procedure instead: link:https://www.redhat.com/en/blog/enabling-openshift-4-clusters-to-stop-and-resume-cluster-vms[Enabling OpenShift 4 Clusters to Stop and Resume Cluster VMs].
18+
====
19+
20+
* You have taken an etcd backup.
21+
22+
* You have access to the cluster as a user with the `cluster-admin` role.
23+
24+
.Procedure
25+
26+
. Confirm that your cluster has been installed for at least 24 hours.
27+
28+
. Ensure that all nodes are in a good state by running the following command:
29+
+
30+
[source,terminal]
31+
----
32+
$ oc get nodes
33+
----
34+
+
35+
.Example output
36+
[source,terminal]
37+
----
38+
NAME STATUS ROLES AGE VERSION
39+
ci-ln-812tb4k-72292-8bcj7-master-0 Ready control-plane,master 32m v1.31.3
40+
ci-ln-812tb4k-72292-8bcj7-master-1 Ready control-plane,master 32m v1.31.3
41+
ci-ln-812tb4k-72292-8bcj7-master-2 Ready control-plane,master 32m v1.31.3
42+
Ci-ln-812tb4k-72292-8bcj7-worker-a-zhdvk Ready worker 19m v1.31.3
43+
ci-ln-812tb4k-72292-8bcj7-worker-b-9hrmv Ready worker 19m v1.31.3
44+
ci-ln-812tb4k-72292-8bcj7-worker-c-q8mw2 Ready worker 19m v1.31.3
45+
----
46+
+
47+
All nodes should show `Ready` in the `STATUS` column.
48+
49+
. Ensure that all cluster Operators are in a good state by running the following command:
50+
+
51+
[source,terminal]
52+
----
53+
$ oc get clusteroperators
54+
----
55+
+
56+
.Example output
57+
[source,terminal]
58+
----
59+
NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE
60+
authentication 4.18.0-0 True False False 51m
61+
baremetal 4.18.0-0 True False False 72m
62+
cloud-controller-manager 4.18.0-0 True False False 75m
63+
cloud-credential 4.18.0-0 True False False 77m
64+
cluster-api 4.18.0-0 True False False 42m
65+
cluster-autoscaler 4.18.0-0 True False False 72m
66+
config-operator 4.18.0-0 True False False 72m
67+
console 4.18.0-0 True False False 55m
68+
...
69+
----
70+
+
71+
All cluster Operators should show `AVAILABLE`=`True`, `PROGRESSING`=`False`, and `DEGRADED`=`False`.
72+
73+
. Ensure that all machine config pools are in a good state by running the following command:
74+
+
75+
[source,terminal]
76+
----
77+
$ oc get mcp
78+
----
79+
+
80+
.Example output
81+
[source,terminal]
82+
----
83+
NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE
84+
master rendered-master-87871f187930e67233c837e1d07f49c7 True False False 3 3 3 0 96m
85+
worker rendered-worker-3c4c459dc5d90017983d7e72928b8aed True False False 3 3 3 0 96m
86+
----
87+
+
88+
All machine config pools should show `UPDATING`=`False` and `DEGRADED`=`False`.
89+
90+
. Stop the cluster virtual machines:
91+
+
92+
Use the tools native to your cluster's cloud environment to shut down the cluster's virtual machines.
93+
+
94+
[IMPORTANT]
95+
====
96+
If you use a bastion virtual machine, do not shut down this virtual machine.
97+
====
Lines changed: 118 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,118 @@
1+
// Module included in the following assemblies:
2+
//
3+
// * backup_and_restore/hibernating-cluster.adoc
4+
5+
:_mod-docs-content-type: PROCEDURE
6+
[id="hibernating-cluster-resume_{context}"]
7+
= Resuming a hibernated cluster
8+
9+
When you resume a hibernated cluster within 90 days, you might have to approve certificate signing requests (CSRs) for the nodes to become ready.
10+
11+
It can take around 45 minutes for the cluster to resume, depending on the size of your cluster.
12+
13+
.Prerequisites
14+
15+
* You hibernated your cluster less than 90 days ago.
16+
* You have access to the cluster as a user with the `cluster-admin` role.
17+
18+
.Procedure
19+
20+
. Within 90 days of cluster hibernation, resume the cluster virtual machines:
21+
+
22+
Use the tools native to your cluster's cloud environment to resume the cluster's virtual machines.
23+
24+
. Wait about 5 minutes, depending on the number of nodes in your cluster.
25+
26+
. Approve CSRs for the nodes:
27+
28+
.. Check that there is a CSR for each node in the `NotReady` state:
29+
+
30+
[source,terminal]
31+
----
32+
$ oc get csr
33+
----
34+
+
35+
.Example output
36+
[source,terminal]
37+
----
38+
NAME AGE SIGNERNAME REQUESTOR REQUESTEDDURATION CONDITION
39+
csr-4dwsd 37m kubernetes.io/kube-apiserver-client system:node:ci-ln-812tb4k-72292-8bcj7-worker-c-q8mw2 24h Pending
40+
csr-4vrbr 49m kubernetes.io/kube-apiserver-client system:node:ci-ln-812tb4k-72292-8bcj7-master-1 24h Pending
41+
csr-4wk5x 51m kubernetes.io/kubelet-serving system:node:ci-ln-812tb4k-72292-8bcj7-master-1 <none> Pending
42+
csr-84vb6 51m kubernetes.io/kube-apiserver-client-kubelet system:serviceaccount:openshift-machine-config-operator:node-bootstrapper <none> Pending
43+
----
44+
45+
.. Approve each valid CSR by running the following command:
46+
+
47+
[source,terminal]
48+
----
49+
$ oc adm certificate approve <csr_name>
50+
----
51+
52+
.. Verify that all necessary CSRs were approved by running the following command:
53+
+
54+
[source,terminal]
55+
----
56+
$ oc get csr
57+
----
58+
+
59+
.Example output
60+
[source,terminal]
61+
----
62+
NAME AGE SIGNERNAME REQUESTOR REQUESTEDDURATION CONDITION
63+
csr-4dwsd 37m kubernetes.io/kube-apiserver-client system:node:ci-ln-812tb4k-72292-8bcj7-worker-c-q8mw2 24h Approved,Issued
64+
csr-4vrbr 49m kubernetes.io/kube-apiserver-client system:node:ci-ln-812tb4k-72292-8bcj7-master-1 24h Approved,Issued
65+
csr-4wk5x 51m kubernetes.io/kubelet-serving system:node:ci-ln-812tb4k-72292-8bcj7-master-1 <none> Approved,Issued
66+
csr-84vb6 51m kubernetes.io/kube-apiserver-client-kubelet system:serviceaccount:openshift-machine-config-operator:node-bootstrapper <none> Approved,Issued
67+
----
68+
+
69+
CSRs should show `Approved,Issued` in the `CONDITION` column.
70+
71+
. Verify that all nodes now show as ready by running the following command:
72+
+
73+
[source,terminal]
74+
----
75+
$ oc get nodes
76+
----
77+
+
78+
.Example output
79+
[source,terminal]
80+
----
81+
NAME STATUS ROLES AGE VERSION
82+
ci-ln-812tb4k-72292-8bcj7-master-0 Ready control-plane,master 32m v1.31.3
83+
ci-ln-812tb4k-72292-8bcj7-master-1 Ready control-plane,master 32m v1.31.3
84+
ci-ln-812tb4k-72292-8bcj7-master-2 Ready control-plane,master 32m v1.31.3
85+
Ci-ln-812tb4k-72292-8bcj7-worker-a-zhdvk Ready worker 19m v1.31.3
86+
ci-ln-812tb4k-72292-8bcj7-worker-b-9hrmv Ready worker 19m v1.31.3
87+
ci-ln-812tb4k-72292-8bcj7-worker-c-q8mw2 Ready worker 19m v1.31.3
88+
----
89+
+
90+
All nodes should show `Ready` in the `STATUS` column. It might take a few minutes for all nodes to become ready after approving the CSRs.
91+
92+
. Wait for cluster Operators to restart to load the new certificates.
93+
+
94+
This might take 5 or 10 minutes.
95+
96+
. Verify that all cluster Operators are in a good state by running the following command:
97+
+
98+
[source,terminal]
99+
----
100+
$ oc get clusteroperators
101+
----
102+
+
103+
.Example output
104+
[source,terminal]
105+
----
106+
NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE
107+
authentication 4.18.0-0 True False False 51m
108+
baremetal 4.18.0-0 True False False 72m
109+
cloud-controller-manager 4.18.0-0 True False False 75m
110+
cloud-credential 4.18.0-0 True False False 77m
111+
cluster-api 4.18.0-0 True False False 42m
112+
cluster-autoscaler 4.18.0-0 True False False 72m
113+
config-operator 4.18.0-0 True False False 72m
114+
console 4.18.0-0 True False False 55m
115+
...
116+
----
117+
+
118+
All cluster Operators should show `AVAILABLE`=`True`, `PROGRESSING`=`False`, and `DEGRADED`=`False`.

0 commit comments

Comments
 (0)