Merge pull request #86873 from bergerhoffer/OSDOCS-12867

bergerhoffer · web-flow · commit 047fcd55409e · 2025-02-05T08:17:09.000-05:00
OSDOCS#12867: Docs for hibernating a cluster
diff --git a/_topic_maps/_topic_map.yml b/_topic_maps/_topic_map.yml
@@ -3540,6 +3540,8 @@ Topics:
   File: graceful-cluster-shutdown
 - Name: Restarting a cluster gracefully
   File: graceful-cluster-restart
+- Name: Hibernating a cluster
+  File: hibernating-cluster
 - Name: OADP Application backup and restore
   Dir: application_backup_and_restore
   Topics:
diff --git a/backup_and_restore/hibernating-cluster.adoc b/backup_and_restore/hibernating-cluster.adoc
@@ -0,0 +1,41 @@
+:_mod-docs-content-type: ASSEMBLY
+[id="hibernating-cluster"]
+= Hibernating an {product-title} cluster
+include::_attributes/common-attributes.adoc[]
+:context: hibernating-cluster
+
+toc::[]
+
+You can hibernate your {product-title} cluster for up to 90 days.
+
+// About hibernating a cluster
+include::modules/hibernating-cluster-about.adoc[leveloffset=+1]
+
+[id="hibernating-cluster_prerequisites_{context}"]
+== Prerequisites
+
+* Take an xref:../backup_and_restore/control_plane_backup_and_restore/backing-up-etcd.adoc#backing-up-etcd-data_backup-etcd[etcd backup] prior to hibernating the cluster.
++
+[IMPORTANT]
+====
+It is important to take an etcd backup before hibernating so that your cluster can be restored if you encounter any issues when resuming the cluster.
+
+For example, the following conditions can cause the resumed cluster to malfunction:
+
+* etcd data corruption during hibernation
+* Node failure due to hardware
+* Network connectivity issues
+
+If your cluster fails to recover, follow the steps to xref:../backup_and_restore/control_plane_backup_and_restore/disaster_recovery/scenario-2-restoring-cluster-state.adoc#dr-restoring-cluster-state[restore to a previous cluster state].
+====
+
+// Hibernating a cluster
+include::modules/hibernating-cluster-hibernate.adoc[leveloffset=+1]
+
+[role="_additional-resources"]
+.Additional resources
+
+* xref:../backup_and_restore/control_plane_backup_and_restore/backing-up-etcd.adoc#backup-etcd[Backing up etcd]
+
+// Resuming a hibernated cluster
+include::modules/hibernating-cluster-resume.adoc[leveloffset=+1]
diff --git a/modules/hibernating-cluster-about.adoc b/modules/hibernating-cluster-about.adoc
@@ -0,0 +1,20 @@
+// Module included in the following assemblies:
+//
+// * backup_and_restore/hibernating-cluster.adoc
+
+:_mod-docs-content-type: CONCEPT
+[id="hibernating-cluster-about_{context}"]
+= About cluster hibernation
+
+{product-title} clusters can be hibernated in order to save money on cloud hosting costs. You can hibernate your {product-title} cluster for up to 90 days and expect it to resume successfully.
+
+You must wait at least 24 hours after cluster installation before hibernating your cluster to allow for the first certification rotation.
+
+[IMPORTANT]
+====
+If you must hibernate your cluster before the 24 hour certificate rotation, use the following procedure instead: link:https://www.redhat.com/en/blog/enabling-openshift-4-clusters-to-stop-and-resume-cluster-vms[Enabling OpenShift 4 Clusters to Stop and Resume Cluster VMs].
+====
+
+When hibernating a cluster, you must hibernate all cluster nodes. It is not supported to suspend only certain nodes.
+
+After resuming, it can take up to 45 minutes for the cluster to become ready.
diff --git a/modules/hibernating-cluster-hibernate.adoc b/modules/hibernating-cluster-hibernate.adoc
@@ -0,0 +1,97 @@
+// Module included in the following assemblies:
+//
+// * backup_and_restore/hibernating-cluster.adoc
+
+:_mod-docs-content-type: PROCEDURE
+[id="hibernating-cluster-hibernate_{context}"]
+= Hibernating a cluster
+
+You can hibernate a cluster for up to 90 days. The cluster can recover if certificates expire while the cluster was in hibernation.
+
+.Prerequisites
+
+* The cluster has been running for at least 24 hours to allow the first certificate rotation to complete.
++
+[IMPORTANT]
+====
+If you must hibernate your cluster before the 24 hour certificate rotation, use the following procedure instead: link:https://www.redhat.com/en/blog/enabling-openshift-4-clusters-to-stop-and-resume-cluster-vms[Enabling OpenShift 4 Clusters to Stop and Resume Cluster VMs].
+====
+
+* You have taken an etcd backup.
+
+* You have access to the cluster as a user with the `cluster-admin` role.
+
+.Procedure
+
+. Confirm that your cluster has been installed for at least 24 hours.
+
+. Ensure that all nodes are in a good state by running the following command:
++
+[source,terminal]
+----
+$ oc get nodes
+----
++
+.Example output
+[source,terminal]
+----
+NAME                                      STATUS  ROLES                 AGE   VERSION
+ci-ln-812tb4k-72292-8bcj7-master-0        Ready	  control-plane,master  32m   v1.31.3
+ci-ln-812tb4k-72292-8bcj7-master-1        Ready	  control-plane,master  32m   v1.31.3
+ci-ln-812tb4k-72292-8bcj7-master-2        Ready	  control-plane,master  32m   v1.31.3
+Ci-ln-812tb4k-72292-8bcj7-worker-a-zhdvk  Ready	  worker                19m   v1.31.3
+ci-ln-812tb4k-72292-8bcj7-worker-b-9hrmv  Ready	  worker                19m   v1.31.3
+ci-ln-812tb4k-72292-8bcj7-worker-c-q8mw2  Ready	  worker                19m   v1.31.3
+----
++
+All nodes should show `Ready` in the `STATUS` column.
+
+. Ensure that all cluster Operators are in a good state by running the following command:
++
+[source,terminal]
+----
+$ oc get clusteroperators
+----
++
+.Example output
+[source,terminal]
+----
+NAME                      VERSION   AVAILABLE  PROGRESSING  DEGRADED  SINCE   MESSAGE
+authentication            4.18.0-0  True       False        False     51m
+baremetal                 4.18.0-0  True       False        False     72m
+cloud-controller-manager  4.18.0-0  True       False        False     75m
+cloud-credential          4.18.0-0  True       False        False     77m
+cluster-api               4.18.0-0  True       False        False     42m
+cluster-autoscaler        4.18.0-0  True       False        False     72m
+config-operator           4.18.0-0  True       False        False     72m
+console                   4.18.0-0  True       False        False     55m
+...
+----
++
+All cluster Operators should show `AVAILABLE`=`True`, `PROGRESSING`=`False`, and `DEGRADED`=`False`.
+
+. Ensure that all machine config pools are in a good state by running the following command:
++
+[source,terminal]
+----
+$ oc get mcp
+----
++
+.Example output
+[source,terminal]
+----
+NAME    CONFIG                                            UPDATED  UPDATING  DEGRADED  MACHINECOUNT  READYMACHINECOUNT  UPDATEDMACHINECOUNT  DEGRADEDMACHINECOUNT  AGE
+master  rendered-master-87871f187930e67233c837e1d07f49c7  True     False     False     3             3                  3                    0                     96m
+worker  rendered-worker-3c4c459dc5d90017983d7e72928b8aed  True     False     False     3             3                  3                    0                     96m
+----
++
+All machine config pools should show `UPDATING`=`False` and `DEGRADED`=`False`.
+
+. Stop the cluster virtual machines:
++
+Use the tools native to your cluster's cloud environment to shut down the cluster's virtual machines.
++
+[IMPORTANT]
+====
+If you use a bastion virtual machine, do not shut down this virtual machine.
+====
diff --git a/modules/hibernating-cluster-resume.adoc b/modules/hibernating-cluster-resume.adoc
@@ -0,0 +1,118 @@
+// Module included in the following assemblies:
+//
+// * backup_and_restore/hibernating-cluster.adoc
+
+:_mod-docs-content-type: PROCEDURE
+[id="hibernating-cluster-resume_{context}"]
+= Resuming a hibernated cluster
+
+When you resume a hibernated cluster within 90 days, you might have to approve certificate signing requests (CSRs) for the nodes to become ready.
+
+It can take around 45 minutes for the cluster to resume, depending on the size of your cluster.
+
+.Prerequisites
+
+* You hibernated your cluster less than 90 days ago.
+* You have access to the cluster as a user with the `cluster-admin` role.
+
+.Procedure
+
+. Within 90 days of cluster hibernation, resume the cluster virtual machines:
++
+Use the tools native to your cluster's cloud environment to resume the cluster's virtual machines.
+
+. Wait about 5 minutes, depending on the number of nodes in your cluster.
+
+. Approve CSRs for the nodes:
+
+.. Check that there is a CSR for each node in the `NotReady` state:
++
+[source,terminal]
+----
+$ oc get csr
+----
++
+.Example output
+[source,terminal]
+----
+NAME       AGE  SIGNERNAME                                   REQUESTOR                                                                  REQUESTEDDURATION  CONDITION
+csr-4dwsd  37m  kubernetes.io/kube-apiserver-client          system:node:ci-ln-812tb4k-72292-8bcj7-worker-c-q8mw2                       24h                Pending
+csr-4vrbr  49m  kubernetes.io/kube-apiserver-client          system:node:ci-ln-812tb4k-72292-8bcj7-master-1                             24h                Pending
+csr-4wk5x  51m  kubernetes.io/kubelet-serving                system:node:ci-ln-812tb4k-72292-8bcj7-master-1                             <none>             Pending
+csr-84vb6  51m  kubernetes.io/kube-apiserver-client-kubelet  system:serviceaccount:openshift-machine-config-operator:node-bootstrapper  <none>             Pending
+----
+
+.. Approve each valid CSR by running the following command:
++
+[source,terminal]
+----
+$ oc adm certificate approve <csr_name>
+----
+
+.. Verify that all necessary CSRs were approved by running the following command:
++
+[source,terminal]
+----
+$ oc get csr
+----
++
+.Example output
+[source,terminal]
+----
+NAME       AGE  SIGNERNAME                                   REQUESTOR                                                                  REQUESTEDDURATION  CONDITION
+csr-4dwsd  37m  kubernetes.io/kube-apiserver-client          system:node:ci-ln-812tb4k-72292-8bcj7-worker-c-q8mw2                       24h                Approved,Issued
+csr-4vrbr  49m  kubernetes.io/kube-apiserver-client          system:node:ci-ln-812tb4k-72292-8bcj7-master-1                             24h                Approved,Issued
+csr-4wk5x  51m  kubernetes.io/kubelet-serving                system:node:ci-ln-812tb4k-72292-8bcj7-master-1                             <none>             Approved,Issued
+csr-84vb6  51m  kubernetes.io/kube-apiserver-client-kubelet  system:serviceaccount:openshift-machine-config-operator:node-bootstrapper  <none>             Approved,Issued
+----
++
+CSRs should show `Approved,Issued` in the `CONDITION` column.
+
+. Verify that all nodes now show as ready by running the following command:
++
+[source,terminal]
+----
+$ oc get nodes
+----
++
+.Example output
+[source,terminal]
+----
+NAME                                      STATUS  ROLES                 AGE   VERSION
+ci-ln-812tb4k-72292-8bcj7-master-0        Ready	  control-plane,master  32m   v1.31.3
+ci-ln-812tb4k-72292-8bcj7-master-1        Ready	  control-plane,master  32m   v1.31.3
+ci-ln-812tb4k-72292-8bcj7-master-2        Ready	  control-plane,master  32m   v1.31.3
+Ci-ln-812tb4k-72292-8bcj7-worker-a-zhdvk  Ready	  worker                19m   v1.31.3
+ci-ln-812tb4k-72292-8bcj7-worker-b-9hrmv  Ready	  worker                19m   v1.31.3
+ci-ln-812tb4k-72292-8bcj7-worker-c-q8mw2  Ready	  worker                19m   v1.31.3
+----
++
+All nodes should show `Ready` in the `STATUS` column. It might take a few minutes for all nodes to become ready after approving the CSRs.
+
+. Wait for cluster Operators to restart to load the new certificates.
++
+This might take 5 or 10 minutes.
+
+. Verify that all cluster Operators are in a good state by running the following command:
++
+[source,terminal]
+----
+$ oc get clusteroperators
+----
++
+.Example output
+[source,terminal]
+----
+NAME                      VERSION   AVAILABLE  PROGRESSING  DEGRADED  SINCE   MESSAGE
+authentication            4.18.0-0  True       False        False     51m
+baremetal                 4.18.0-0  True       False        False     72m
+cloud-controller-manager  4.18.0-0  True       False        False     75m
+cloud-credential          4.18.0-0  True       False        False     77m
+cluster-api               4.18.0-0  True       False        False     42m
+cluster-autoscaler        4.18.0-0  True       False        False     72m
+config-operator           4.18.0-0  True       False        False     72m
+console                   4.18.0-0  True       False        False     55m
+...
+----
++
+All cluster Operators should show `AVAILABLE`=`True`, `PROGRESSING`=`False`, and `DEGRADED`=`False`.