Merge pull request #81438 from lahinson/osdocs-11001-hcp-troubleshooting

lahinson · web-flow · commit 7eeb2d584188 · 2024-09-06T16:16:40.000-04:00
[OSDOCS-11001]: Moving HCP troubleshooting docs to OCP
diff --git a/hosted_control_planes/hcp-troubleshooting.adoc b/hosted_control_planes/hcp-troubleshooting.adoc
@@ -9,6 +9,39 @@ toc::[]
 If you encounter issues with {hcp}, see the following information to guide you through troubleshooting.
 
 include::modules/hosted-control-planes-troubleshooting.adoc[leveloffset=+1]
+include::modules/hcp-must-gather-dc.adoc[leveloffset=+1]
+
+[role="_additional-resources"]
+.Additional resources
+
+* link:https://docs.redhat.com/en/documentation/red_hat_advanced_cluster_management_for_kubernetes/2.11/html/clusters/cluster_mce_overview#install-on-disconnected-networks[Install on disconnected networks]
+
+[id="hcp-ts-ocp-virt"]
+== Troubleshooting hosted clusters on {VirtProductName}
+
+When you troubleshoot a hosted cluster on {VirtProductName}, start with the top-level `HostedCluster` and `NodePool` resources and then work down the stack until you find the root cause. The following steps can help you discover the root cause of common issues.
+
+include::modules/hcp-ts-hc-stuck.adoc[leveloffset=+2]
+include::modules/hcp-ts-no-nodes-reg.adoc[leveloffset=+2]
+
+[role="_additional-resources"]
+.Additional resources
+
+* link:https://docs.redhat.com/en/documentation/red_hat_advanced_cluster_management_for_kubernetes/2.11/html/clusters/cluster_mce_overview#identifying-vm-console-logs[Identifying the problem: Access the VM console logs]
+
+include::modules/hcp-ts-nodes-stuck.adoc[leveloffset=+2]
+include::modules/hcp-ts-ingress-not-online.adoc[leveloffset=+2]
+include::modules/hcp-ts-load-balancer-svcs.adoc[leveloffset=+2]
+include::modules/hcp-ts-pvcs-not-avail.adoc[leveloffset=+2]
+include::modules/hcp-ts-vm-nodes.adoc[leveloffset=+2]
+include::modules/hcp-ts-rhcos.adoc[leveloffset=+2]
+include::modules/hcp-ts-non-bm.adoc[leveloffset=+2]
+
+[role="_additional-resources"]
+.Additional resources
+
+* link:https://docs.redhat.com/en/documentation/red_hat_advanced_cluster_management_for_kubernetes/2.11/html/clusters/cluster_mce_overview#remove-managed-cluster[Removing a cluster from management]
+
 include::modules/hosted-restart-hcp-components.adoc[leveloffset=+1]
 include::modules/hosted-control-planes-pause-reconciliation.adoc[leveloffset=+1]
 include::modules/scale-down-data-plane.adoc[leveloffset=+1]
diff --git a/modules/hcp-must-gather-dc.adoc b/modules/hcp-must-gather-dc.adoc
@@ -0,0 +1,28 @@
+// Module included in the following assemblies:
+//
+// * hosted_control_planes/hcp-troubleshooting.adoc
+
+:_mod-docs-content-type: PROCEDURE
+[id="hcp-must-gather-dc_{context}"]
+= Entering the must-gather command in a disconnected environment
+
+Complete the following steps to run the `must-gather` command in a disconnected environment.
+
+.Procedure
+
+. In a disconnected environment, mirror the Red{nbsp}Hat operator catalog images into their mirror registry. For more information, see _Install on disconnected networks_.
+
+. Run the following command to extract logs, which reference the image from their mirror registry:
++
+[source,terminal]
+----
+REGISTRY=registry.example.com:5000
+IMAGE=$REGISTRY/multicluster-engine/must-gather-rhel8@sha256:ff9f37eb400dc1f7d07a9b6f2da9064992934b69847d17f59e385783c071b9d8
+
+$ oc adm must-gather \
+  --image=$IMAGE /usr/bin/gather \
+  hosted-cluster-namespace=HOSTEDCLUSTERNAMESPACE \
+  hosted-cluster-name=HOSTEDCLUSTERNAME \
+  --dest-dir=./data
+----
+
diff --git a/modules/hcp-ts-hc-stuck.adoc b/modules/hcp-ts-hc-stuck.adoc
@@ -0,0 +1,18 @@
+// Module included in the following assemblies:
+//
+// * hosted_control_planes/hcp-troubleshooting.adoc
+
+:_mod-docs-content-type: PROCEDURE
+[id="hcp-ts-hc-stuck_{context}"]
+= HostedCluster resource is stuck in a partial state
+
+If a hosted control plane is not coming fully online because a `HostedCluster` resource is pending, identify the problem by checking prerequisites, resource conditions, and node and Operator status.
+
+.Procedure
+
+* Ensure that you meet all of the prerequisites for a hosted cluster on {VirtProductName}.
+* View the conditions on the `HostedCluster` and `NodePool` resources for validation errors that prevent progress.
+* By using the `kubeconfig` file of the hosted cluster, inspect the status of the hosted cluster: 
+
+** View the output of the `oc get clusteroperators` command to see which cluster Operators are pending. 
+** View the output of the `oc get nodes` command to ensure that worker nodes are ready.
diff --git a/modules/hcp-ts-ingress-not-online.adoc b/modules/hcp-ts-ingress-not-online.adoc
@@ -0,0 +1,25 @@
+// Module included in the following assemblies:
+//
+// * hosted_control_planes/hcp-troubleshooting.adoc
+
+:_mod-docs-content-type: PROCEDURE
+[id="hcp-ts-ingress-not-online_{context}"]
+= Ingress and console cluster operators are not coming online
+
+If a hosted control plane is not coming fully online because the Ingress and console cluster Operators are not online, check the wildcard DNS routes and load balancer.
+
+.Procedure
+
+* If the cluster uses the default Ingress behavior, enter the following command to ensure that wildcard DNS routes are enabled on the {product-title} cluster that the virtual machines (VMs) are hosted on:
++
+[source,terminal]
+----
+$ oc patch ingresscontroller -n openshift-ingress-operator \
+  default --type=json -p \
+  '[{ "op": "add", "path": "/spec/routeAdmission", "value": {wildcardPolicy: "WildcardsAllowed"}}]'
+----
+
+* If you use a custom base domain for the hosted control plane, complete the following steps:
+
+** Ensure that the load balancer is targeting the VM pods correctly.
+** Ensure that the wildcard DNS entry is targeting the load balancer IP address.
diff --git a/modules/hcp-ts-load-balancer-svcs.adoc b/modules/hcp-ts-load-balancer-svcs.adoc
@@ -0,0 +1,20 @@
+// Module included in the following assemblies:
+//
+// * hosted_control_planes/hcp-troubleshooting.adoc
+
+:_mod-docs-content-type: PROCEDURE
+[id="hcp-ts-load-balancer-svcs_{context}"]
+= Load balancer services for the hosted cluster are not available
+
+If a hosted control plane is not coming fully online because the load balancer services are not becoming available, check events, details, and the Kubernetes Cluster Configuration Manager (KCCM) pod.
+
+.Procedure
+
+* Look for events and details that are associated with the load balancer service within the hosted cluster.
+
+* By default, load balancers for the hosted cluster are handled by the kubevirt-cloud-controller-manager within the hosted control plane namespace. Ensure that the KCCM pod is online and view its logs for errors or warnings. To identify the KCCM pod in the hosted control plane namespace, enter the following command:
++
+[source,terminal]
+----
+$ oc get pods -n <hosted_control_plane_namespace> -l app=cloud-controller-manager
+----
diff --git a/modules/hcp-ts-no-nodes-reg.adoc b/modules/hcp-ts-no-nodes-reg.adoc
@@ -0,0 +1,40 @@
+// Module included in the following assemblies:
+//
+// * hosted_control_planes/hcp-troubleshooting.adoc
+
+:_mod-docs-content-type: PROCEDURE
+[id="hcp-ts-no-nodes-reg_{context}"]
+= No worker nodes are registered
+
+If a hosted control plane is not coming fully online because the hosted control plane has no worker nodes registered, identify the problem by checking the status of various parts of the hosted control plane.
+
+.Procedure
+
+* View the `HostedCluster` and `NodePool` conditions for failures that indicate what the problem might be.
+
+* Enter the following command to view the KubeVirt worker node virtual machine (VM) status for the `NodePool` resource:
++
+[source,terminal]
+----
+$ oc get vm -n <namespace>
+----
+
+* If the VMs are stuck in the provisioning state, enter the following command to view the CDI import pods within the VM namespace for clues about why the importer pods have not completed:
++
+[source,terminal]
+----
+$ oc get pods -n <namespace> | grep "import"
+----
+
+* If the VMs are stuck in the starting state, enter the following command to view the status of the virt-launcher pods:
++
+[source,terminal]
+----
+$ oc get pods -n <namespace> -l kubevirt.io=virt-launcher
+----
++
+If the virt-launcher pods are in a pending state, investigate why the pods are not being scheduled. For example, not enough resources might exist to run the virt-launcher pods.
+
+* If the VMs are running but they are not registered as worker nodes, use the web console to gain VNC access to one of the affected VMs. The VNC output indicates whether the ignition configuration was applied. If a VM cannot access the hosted control plane ignition server on startup, the VM cannot be provisioned correctly.
+
+* If the ignition configuration was applied but the VM is still not registering as a node, see _Identifying the problem: Access the VM console logs_ to learn how to access the VM console logs during startup.
diff --git a/modules/hcp-ts-nodes-stuck.adoc b/modules/hcp-ts-nodes-stuck.adoc
@@ -0,0 +1,27 @@
+// Module included in the following assemblies:
+//
+// * hosted_control_planes/hcp-troubleshooting.adoc
+
+:_mod-docs-content-type: PROCEDURE
+[id="hcp-ts-nodes-stuck_{context}"]
+= Worker nodes are stuck in the NotReady state
+
+During cluster creation, nodes enter the `NotReady` state temporarily while the networking stack is rolled out. This part of the process is normal. However, if this part of the process takes longer than 15 minutes, an issue might have occurred.
+
+.Procedure
+
+Identify the problem by investigating the node object and pods:
+
+* Enter the following command to view the conditions on the node object and determine why the node is not ready:
++
+[source,terminal]
+----
+$ oc get nodes -o yaml
+----
+
+* Enter the following command to look for failing pods within the cluster:
++
+[source,terminal]
+----
+$ oc get pods -A --field-selector=status.phase!=Running,status,phase!=Succeeded
+----
diff --git a/modules/hcp-ts-non-bm.adoc b/modules/hcp-ts-non-bm.adoc
@@ -0,0 +1,21 @@
+// Module included in the following assemblies:
+//
+// * hosted_control_planes/hcp-troubleshooting.adoc
+
+:_mod-docs-content-type: PROCEDURE
+[id="hcp-ts-non-bm_{context}"]
+= Return non-bare-metal clusters to the late binding pool
+
+If you are using late binding managed clusters without `BareMetalHosts`, you must complete additional manual steps to delete a late binding cluster and return the nodes back to the Discovery ISO.
+
+For late binding managed clusters without `BareMetalHosts`, removing cluster information does not automatically return all nodes to the Discovery ISO.
+
+.Procedure
+
+To unbind the non-bare-metal nodes with late binding, complete the following steps:
+
+. Remove the cluster information. For more information, see _Removing a cluster from management_.
+
+. Clean the root disks.
+
+. Reboot manually with the Discovery ISO.
diff --git a/modules/hcp-ts-pvcs-not-avail.adoc b/modules/hcp-ts-pvcs-not-avail.adoc
@@ -0,0 +1,27 @@
+// Module included in the following assemblies:
+//
+// * hosted_control_planes/hcp-troubleshooting.adoc
+
+:_mod-docs-content-type: PROCEDURE
+[id="hcp-ts-pvcs-not-avail_{context}"]
+= Hosted cluster PVCs are not available
+
+If a hosted control plane is not coming fully online because the persistent volume claims (PVCs) for a hosted cluster are not available, check the PVC events and details, and component logs.
+
+.Procedure
+
+* Look for events and details that are associated with the PVC to understand which errors are occurring.
+
+* If a PVC is failing to attach to a pod, view the logs for the kubevirt-csi-node `daemonset` component within the hosted cluster to further investigate the problem. To identify the kubevirt-csi-node pods for each node, enter the following command:
++
+[source,terminal]
+----
+$ oc get pods -n openshift-cluster-csi-drivers -o wide -l app=kubevirt-csi-driver
+----
+
+* If a PVC cannot bind to a persistent volume (PV), view the logs of the kubevirt-csi-controller component within the hosted control plane namespace. To identify the kubevirt-csi-controller pod within the hosted control plane namespace, enter the following command:
++
+[source,terminal]
+----
+$ oc get pods -n <hcp namespace> -l app=kubevirt-csi-driver
+----
diff --git a/modules/hcp-ts-rhcos.adoc b/modules/hcp-ts-rhcos.adoc
@@ -0,0 +1,73 @@
+// Module included in the following assemblies:
+//
+// * hosted_control_planes/hcp-troubleshooting.adoc
+
+:_mod-docs-content-type: PROCEDURE
+[id="hcp-ts-rhcos_{context}"]
+= {op-system} image mirroring fails
+
+For {hcp} on {VirtProductName} in a disconnected environment, `oc-mirror` fails to automatically mirror the {op-system-first} image to the internal registry. When you create your first hosted cluster, the Kubevirt virtual machine does not boot, because the boot image is not available in the internal registry.
+
+To resolve this issue, manually mirror the {op-system} image to the internal registry.
+
+.Procedure
+
+. Get the internal registry name by running the following command:
++
+[source,terminal]
+----
+$ oc get imagecontentsourcepolicy -o json | jq -r '.items[].spec.repositoryDigestMirrors[0].mirrors[0]'
+----
+
+. Get a payload image by running the following command:
++
+[source,terminal]
+----
+$ oc get clusterversion version -ojsonpath='{.status.desired.image}'
+----
+
+. Extract the `0000_50_installer_coreos-bootimages.yaml` file that contains boot images from your payload image on the hosted cluster. Replace `<payload_image>` with the name of your payload image. Run the following command:
++
+[source,terminal]
+----
+$ oc image extract --file /release-manifests/0000_50_installer_coreos-bootimages.yaml <payload_image> --confirm
+----
+
+. Get the {op-system} image by running the following command:
++
+[source,terminal]
+----
+$ cat 0000_50_installer_coreos-bootimages.yaml | yq -r .data.stream | jq -r '.architectures.x86_64.images.kubevirt."digest-ref"'
+----
+
+. Mirror the {op-system} image to your internal registry. Replace `<rhcos_image>` with your {op-system} image; for example, `quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:d9643ead36b1c026be664c9c65c11433c6cdf71bfd93ba229141d134a4a6dd94`. Replace `<internal_registry>` with the name of your internal registry; for example, `virthost.ostest.test.metalkube.org:5000/localimages/ocp-v4.0-art-dev`. Run the following command:
++
+[source,terminal]
+----
+$ oc image mirror <rhcos_image> <internal_registry>
+----
+
+. Create a YAML file named `rhcos-boot-kubevirt.yaml` that defines the `ImageDigestMirrorSet` object. See the following example configuration:
++
+[source,yaml]
+----
+apiVersion: config.openshift.io/v1
+kind: ImageDigestMirrorSet
+metadata:
+  name: rhcos-boot-kubevirt
+spec:
+  repositoryDigestMirrors:
+    - mirrors:
+        - <rhcos_image_no_digest> <1>
+      source: virthost.ostest.test.metalkube.org:5000/localimages/ocp-v4.0-art-dev <2>
+----
++
+<1> Specify your {op-system} image without its digest, for example, `quay.io/openshift-release-dev/ocp-v4.0-art-dev`.
+<2> Specify the name of your internal registry, for example, `virthost.ostest.test.metalkube.org:5000/localimages/ocp-v4.0-art-dev`.
+
+. Apply the `rhcos-boot-kubevirt.yaml` file to create the `ImageDigestMirrorSet` object by running the following command:
++
+[source,terminal]
+----
+$ oc apply -f rhcos-boot-kubevirt.yaml
+----
diff --git a/modules/hcp-ts-vm-nodes.adoc b/modules/hcp-ts-vm-nodes.adoc
@@ -0,0 +1,13 @@
+// Module included in the following assemblies:
+//
+// * hosted_control_planes/hcp-troubleshooting.adoc
+
+:_mod-docs-content-type: PROCEDURE
+[id="hcp-ts-vm-nodes_{context}"]
+= VM nodes are not correctly joining the cluster
+
+If a hosted control plane is not coming fully online because the VM nodes are not correctly joining the cluster, access the VM console logs.
+
+.Procedure
+
+* To access the VM console logs, complete the steps in link:https://access.redhat.com/solutions/7037705[How to get serial console logs for VMs part of OpenShift Virtualization Hosted Control Plane clusters].
diff --git a/modules/hosted-control-planes-pause-reconciliation.adoc b/modules/hosted-control-planes-pause-reconciliation.adoc
@@ -16,7 +16,10 @@ If you are a cluster instance administrator, you can pause the reconciliation of
 +
 [source,terminal]
 ----
-$ oc patch -n <hosted_cluster_namespace> hostedclusters/<hosted_cluster_name> -p '{"spec":{"pausedUntil":"<timestamp>"}}' --type=merge <1>
+$ oc patch -n <hosted_cluster_namespace> \
+  hostedclusters/<hosted_cluster_name> \
+  -p '{"spec":{"pausedUntil":"<timestamp>"}}' \
+  --type=merge <1>
 ----
 +
 <1> Specify a timestamp in the RFC339 format, for example, `2024-03-03T03:28:48Z`. The reconciliation is paused until the specified time is passed.
@@ -25,7 +28,10 @@ $ oc patch -n <hosted_cluster_namespace> hostedclusters/<hosted_cluster_name> -p
 +
 [source,terminal]
 ----
-$ oc patch -n <hosted_cluster_namespace> hostedclusters/<hosted_cluster_name> -p '{"spec":{"pausedUntil":"true"}}' --type=merge
+$ oc patch -n <hosted_cluster_namespace> \
+  hostedclusters/<hosted_cluster_name> \
+  -p '{"spec":{"pausedUntil":"true"}}' \
+  --type=merge
 ----
 +
 The reconciliation is paused until you remove the field from the `HostedCluster` resource.
@@ -36,5 +42,8 @@ When the pause reconciliation field is populated for the `HostedCluster` resourc
 +
 [source,terminal]
 ----
-$ oc patch -n <hosted_cluster_namespace> hostedclusters/<hosted_cluster_name> -p '{"spec":{"pausedUntil":null}}' --type=merge
+$ oc patch -n <hosted_cluster_namespace> \
+  hostedclusters/<hosted_cluster_name> \
+  -p '{"spec":{"pausedUntil":null}}' \
+  --type=merge
 ----
diff --git a/modules/hosted-control-planes-troubleshooting.adoc b/modules/hosted-control-planes-troubleshooting.adoc
diff --git a/modules/hosted-restart-hcp-components.adoc b/modules/hosted-restart-hcp-components.adoc
diff --git a/modules/scale-down-data-plane.adoc b/modules/scale-down-data-plane.adoc