Skip to content

Commit 7eeb2d5

Browse files
authored
Merge pull request #81438 from lahinson/osdocs-11001-hcp-troubleshooting
[OSDOCS-11001]: Moving HCP troubleshooting docs to OCP
2 parents 8631abc + 3730de6 commit 7eeb2d5

15 files changed

+355
-59
lines changed

hosted_control_planes/hcp-troubleshooting.adoc

Lines changed: 33 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,39 @@ toc::[]
99
If you encounter issues with {hcp}, see the following information to guide you through troubleshooting.
1010

1111
include::modules/hosted-control-planes-troubleshooting.adoc[leveloffset=+1]
12+
include::modules/hcp-must-gather-dc.adoc[leveloffset=+1]
13+
14+
[role="_additional-resources"]
15+
.Additional resources
16+
17+
* link:https://docs.redhat.com/en/documentation/red_hat_advanced_cluster_management_for_kubernetes/2.11/html/clusters/cluster_mce_overview#install-on-disconnected-networks[Install on disconnected networks]
18+
19+
[id="hcp-ts-ocp-virt"]
20+
== Troubleshooting hosted clusters on {VirtProductName}
21+
22+
When you troubleshoot a hosted cluster on {VirtProductName}, start with the top-level `HostedCluster` and `NodePool` resources and then work down the stack until you find the root cause. The following steps can help you discover the root cause of common issues.
23+
24+
include::modules/hcp-ts-hc-stuck.adoc[leveloffset=+2]
25+
include::modules/hcp-ts-no-nodes-reg.adoc[leveloffset=+2]
26+
27+
[role="_additional-resources"]
28+
.Additional resources
29+
30+
* link:https://docs.redhat.com/en/documentation/red_hat_advanced_cluster_management_for_kubernetes/2.11/html/clusters/cluster_mce_overview#identifying-vm-console-logs[Identifying the problem: Access the VM console logs]
31+
32+
include::modules/hcp-ts-nodes-stuck.adoc[leveloffset=+2]
33+
include::modules/hcp-ts-ingress-not-online.adoc[leveloffset=+2]
34+
include::modules/hcp-ts-load-balancer-svcs.adoc[leveloffset=+2]
35+
include::modules/hcp-ts-pvcs-not-avail.adoc[leveloffset=+2]
36+
include::modules/hcp-ts-vm-nodes.adoc[leveloffset=+2]
37+
include::modules/hcp-ts-rhcos.adoc[leveloffset=+2]
38+
include::modules/hcp-ts-non-bm.adoc[leveloffset=+2]
39+
40+
[role="_additional-resources"]
41+
.Additional resources
42+
43+
* link:https://docs.redhat.com/en/documentation/red_hat_advanced_cluster_management_for_kubernetes/2.11/html/clusters/cluster_mce_overview#remove-managed-cluster[Removing a cluster from management]
44+
1245
include::modules/hosted-restart-hcp-components.adoc[leveloffset=+1]
1346
include::modules/hosted-control-planes-pause-reconciliation.adoc[leveloffset=+1]
1447
include::modules/scale-down-data-plane.adoc[leveloffset=+1]

modules/hcp-must-gather-dc.adoc

Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,28 @@
1+
// Module included in the following assemblies:
2+
//
3+
// * hosted_control_planes/hcp-troubleshooting.adoc
4+
5+
:_mod-docs-content-type: PROCEDURE
6+
[id="hcp-must-gather-dc_{context}"]
7+
= Entering the must-gather command in a disconnected environment
8+
9+
Complete the following steps to run the `must-gather` command in a disconnected environment.
10+
11+
.Procedure
12+
13+
. In a disconnected environment, mirror the Red{nbsp}Hat operator catalog images into their mirror registry. For more information, see _Install on disconnected networks_.
14+
15+
. Run the following command to extract logs, which reference the image from their mirror registry:
16+
+
17+
[source,terminal]
18+
----
19+
REGISTRY=registry.example.com:5000
20+
IMAGE=$REGISTRY/multicluster-engine/must-gather-rhel8@sha256:ff9f37eb400dc1f7d07a9b6f2da9064992934b69847d17f59e385783c071b9d8
21+
22+
$ oc adm must-gather \
23+
--image=$IMAGE /usr/bin/gather \
24+
hosted-cluster-namespace=HOSTEDCLUSTERNAMESPACE \
25+
hosted-cluster-name=HOSTEDCLUSTERNAME \
26+
--dest-dir=./data
27+
----
28+

modules/hcp-ts-hc-stuck.adoc

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,18 @@
1+
// Module included in the following assemblies:
2+
//
3+
// * hosted_control_planes/hcp-troubleshooting.adoc
4+
5+
:_mod-docs-content-type: PROCEDURE
6+
[id="hcp-ts-hc-stuck_{context}"]
7+
= HostedCluster resource is stuck in a partial state
8+
9+
If a hosted control plane is not coming fully online because a `HostedCluster` resource is pending, identify the problem by checking prerequisites, resource conditions, and node and Operator status.
10+
11+
.Procedure
12+
13+
* Ensure that you meet all of the prerequisites for a hosted cluster on {VirtProductName}.
14+
* View the conditions on the `HostedCluster` and `NodePool` resources for validation errors that prevent progress.
15+
* By using the `kubeconfig` file of the hosted cluster, inspect the status of the hosted cluster:
16+
17+
** View the output of the `oc get clusteroperators` command to see which cluster Operators are pending.
18+
** View the output of the `oc get nodes` command to ensure that worker nodes are ready.
Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,25 @@
1+
// Module included in the following assemblies:
2+
//
3+
// * hosted_control_planes/hcp-troubleshooting.adoc
4+
5+
:_mod-docs-content-type: PROCEDURE
6+
[id="hcp-ts-ingress-not-online_{context}"]
7+
= Ingress and console cluster operators are not coming online
8+
9+
If a hosted control plane is not coming fully online because the Ingress and console cluster Operators are not online, check the wildcard DNS routes and load balancer.
10+
11+
.Procedure
12+
13+
* If the cluster uses the default Ingress behavior, enter the following command to ensure that wildcard DNS routes are enabled on the {product-title} cluster that the virtual machines (VMs) are hosted on:
14+
+
15+
[source,terminal]
16+
----
17+
$ oc patch ingresscontroller -n openshift-ingress-operator \
18+
default --type=json -p \
19+
'[{ "op": "add", "path": "/spec/routeAdmission", "value": {wildcardPolicy: "WildcardsAllowed"}}]'
20+
----
21+
22+
* If you use a custom base domain for the hosted control plane, complete the following steps:
23+
24+
** Ensure that the load balancer is targeting the VM pods correctly.
25+
** Ensure that the wildcard DNS entry is targeting the load balancer IP address.
Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
1+
// Module included in the following assemblies:
2+
//
3+
// * hosted_control_planes/hcp-troubleshooting.adoc
4+
5+
:_mod-docs-content-type: PROCEDURE
6+
[id="hcp-ts-load-balancer-svcs_{context}"]
7+
= Load balancer services for the hosted cluster are not available
8+
9+
If a hosted control plane is not coming fully online because the load balancer services are not becoming available, check events, details, and the Kubernetes Cluster Configuration Manager (KCCM) pod.
10+
11+
.Procedure
12+
13+
* Look for events and details that are associated with the load balancer service within the hosted cluster.
14+
15+
* By default, load balancers for the hosted cluster are handled by the kubevirt-cloud-controller-manager within the hosted control plane namespace. Ensure that the KCCM pod is online and view its logs for errors or warnings. To identify the KCCM pod in the hosted control plane namespace, enter the following command:
16+
+
17+
[source,terminal]
18+
----
19+
$ oc get pods -n <hosted_control_plane_namespace> -l app=cloud-controller-manager
20+
----

modules/hcp-ts-no-nodes-reg.adoc

Lines changed: 40 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,40 @@
1+
// Module included in the following assemblies:
2+
//
3+
// * hosted_control_planes/hcp-troubleshooting.adoc
4+
5+
:_mod-docs-content-type: PROCEDURE
6+
[id="hcp-ts-no-nodes-reg_{context}"]
7+
= No worker nodes are registered
8+
9+
If a hosted control plane is not coming fully online because the hosted control plane has no worker nodes registered, identify the problem by checking the status of various parts of the hosted control plane.
10+
11+
.Procedure
12+
13+
* View the `HostedCluster` and `NodePool` conditions for failures that indicate what the problem might be.
14+
15+
* Enter the following command to view the KubeVirt worker node virtual machine (VM) status for the `NodePool` resource:
16+
+
17+
[source,terminal]
18+
----
19+
$ oc get vm -n <namespace>
20+
----
21+
22+
* If the VMs are stuck in the provisioning state, enter the following command to view the CDI import pods within the VM namespace for clues about why the importer pods have not completed:
23+
+
24+
[source,terminal]
25+
----
26+
$ oc get pods -n <namespace> | grep "import"
27+
----
28+
29+
* If the VMs are stuck in the starting state, enter the following command to view the status of the virt-launcher pods:
30+
+
31+
[source,terminal]
32+
----
33+
$ oc get pods -n <namespace> -l kubevirt.io=virt-launcher
34+
----
35+
+
36+
If the virt-launcher pods are in a pending state, investigate why the pods are not being scheduled. For example, not enough resources might exist to run the virt-launcher pods.
37+
38+
* If the VMs are running but they are not registered as worker nodes, use the web console to gain VNC access to one of the affected VMs. The VNC output indicates whether the ignition configuration was applied. If a VM cannot access the hosted control plane ignition server on startup, the VM cannot be provisioned correctly.
39+
40+
* If the ignition configuration was applied but the VM is still not registering as a node, see _Identifying the problem: Access the VM console logs_ to learn how to access the VM console logs during startup.

modules/hcp-ts-nodes-stuck.adoc

Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,27 @@
1+
// Module included in the following assemblies:
2+
//
3+
// * hosted_control_planes/hcp-troubleshooting.adoc
4+
5+
:_mod-docs-content-type: PROCEDURE
6+
[id="hcp-ts-nodes-stuck_{context}"]
7+
= Worker nodes are stuck in the NotReady state
8+
9+
During cluster creation, nodes enter the `NotReady` state temporarily while the networking stack is rolled out. This part of the process is normal. However, if this part of the process takes longer than 15 minutes, an issue might have occurred.
10+
11+
.Procedure
12+
13+
Identify the problem by investigating the node object and pods:
14+
15+
* Enter the following command to view the conditions on the node object and determine why the node is not ready:
16+
+
17+
[source,terminal]
18+
----
19+
$ oc get nodes -o yaml
20+
----
21+
22+
* Enter the following command to look for failing pods within the cluster:
23+
+
24+
[source,terminal]
25+
----
26+
$ oc get pods -A --field-selector=status.phase!=Running,status,phase!=Succeeded
27+
----

modules/hcp-ts-non-bm.adoc

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
// Module included in the following assemblies:
2+
//
3+
// * hosted_control_planes/hcp-troubleshooting.adoc
4+
5+
:_mod-docs-content-type: PROCEDURE
6+
[id="hcp-ts-non-bm_{context}"]
7+
= Return non-bare-metal clusters to the late binding pool
8+
9+
If you are using late binding managed clusters without `BareMetalHosts`, you must complete additional manual steps to delete a late binding cluster and return the nodes back to the Discovery ISO.
10+
11+
For late binding managed clusters without `BareMetalHosts`, removing cluster information does not automatically return all nodes to the Discovery ISO.
12+
13+
.Procedure
14+
15+
To unbind the non-bare-metal nodes with late binding, complete the following steps:
16+
17+
. Remove the cluster information. For more information, see _Removing a cluster from management_.
18+
19+
. Clean the root disks.
20+
21+
. Reboot manually with the Discovery ISO.

modules/hcp-ts-pvcs-not-avail.adoc

Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,27 @@
1+
// Module included in the following assemblies:
2+
//
3+
// * hosted_control_planes/hcp-troubleshooting.adoc
4+
5+
:_mod-docs-content-type: PROCEDURE
6+
[id="hcp-ts-pvcs-not-avail_{context}"]
7+
= Hosted cluster PVCs are not available
8+
9+
If a hosted control plane is not coming fully online because the persistent volume claims (PVCs) for a hosted cluster are not available, check the PVC events and details, and component logs.
10+
11+
.Procedure
12+
13+
* Look for events and details that are associated with the PVC to understand which errors are occurring.
14+
15+
* If a PVC is failing to attach to a pod, view the logs for the kubevirt-csi-node `daemonset` component within the hosted cluster to further investigate the problem. To identify the kubevirt-csi-node pods for each node, enter the following command:
16+
+
17+
[source,terminal]
18+
----
19+
$ oc get pods -n openshift-cluster-csi-drivers -o wide -l app=kubevirt-csi-driver
20+
----
21+
22+
* If a PVC cannot bind to a persistent volume (PV), view the logs of the kubevirt-csi-controller component within the hosted control plane namespace. To identify the kubevirt-csi-controller pod within the hosted control plane namespace, enter the following command:
23+
+
24+
[source,terminal]
25+
----
26+
$ oc get pods -n <hcp namespace> -l app=kubevirt-csi-driver
27+
----

modules/hcp-ts-rhcos.adoc

Lines changed: 73 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,73 @@
1+
// Module included in the following assemblies:
2+
//
3+
// * hosted_control_planes/hcp-troubleshooting.adoc
4+
5+
:_mod-docs-content-type: PROCEDURE
6+
[id="hcp-ts-rhcos_{context}"]
7+
= {op-system} image mirroring fails
8+
9+
For {hcp} on {VirtProductName} in a disconnected environment, `oc-mirror` fails to automatically mirror the {op-system-first} image to the internal registry. When you create your first hosted cluster, the Kubevirt virtual machine does not boot, because the boot image is not available in the internal registry.
10+
11+
To resolve this issue, manually mirror the {op-system} image to the internal registry.
12+
13+
.Procedure
14+
15+
. Get the internal registry name by running the following command:
16+
+
17+
[source,terminal]
18+
----
19+
$ oc get imagecontentsourcepolicy -o json | jq -r '.items[].spec.repositoryDigestMirrors[0].mirrors[0]'
20+
----
21+
22+
. Get a payload image by running the following command:
23+
+
24+
[source,terminal]
25+
----
26+
$ oc get clusterversion version -ojsonpath='{.status.desired.image}'
27+
----
28+
29+
. Extract the `0000_50_installer_coreos-bootimages.yaml` file that contains boot images from your payload image on the hosted cluster. Replace `<payload_image>` with the name of your payload image. Run the following command:
30+
+
31+
[source,terminal]
32+
----
33+
$ oc image extract --file /release-manifests/0000_50_installer_coreos-bootimages.yaml <payload_image> --confirm
34+
----
35+
36+
. Get the {op-system} image by running the following command:
37+
+
38+
[source,terminal]
39+
----
40+
$ cat 0000_50_installer_coreos-bootimages.yaml | yq -r .data.stream | jq -r '.architectures.x86_64.images.kubevirt."digest-ref"'
41+
----
42+
43+
. Mirror the {op-system} image to your internal registry. Replace `<rhcos_image>` with your {op-system} image; for example, `quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:d9643ead36b1c026be664c9c65c11433c6cdf71bfd93ba229141d134a4a6dd94`. Replace `<internal_registry>` with the name of your internal registry; for example, `virthost.ostest.test.metalkube.org:5000/localimages/ocp-v4.0-art-dev`. Run the following command:
44+
+
45+
[source,terminal]
46+
----
47+
$ oc image mirror <rhcos_image> <internal_registry>
48+
----
49+
50+
. Create a YAML file named `rhcos-boot-kubevirt.yaml` that defines the `ImageDigestMirrorSet` object. See the following example configuration:
51+
+
52+
[source,yaml]
53+
----
54+
apiVersion: config.openshift.io/v1
55+
kind: ImageDigestMirrorSet
56+
metadata:
57+
name: rhcos-boot-kubevirt
58+
spec:
59+
repositoryDigestMirrors:
60+
- mirrors:
61+
- <rhcos_image_no_digest> <1>
62+
source: virthost.ostest.test.metalkube.org:5000/localimages/ocp-v4.0-art-dev <2>
63+
----
64+
+
65+
<1> Specify your {op-system} image without its digest, for example, `quay.io/openshift-release-dev/ocp-v4.0-art-dev`.
66+
<2> Specify the name of your internal registry, for example, `virthost.ostest.test.metalkube.org:5000/localimages/ocp-v4.0-art-dev`.
67+
68+
. Apply the `rhcos-boot-kubevirt.yaml` file to create the `ImageDigestMirrorSet` object by running the following command:
69+
+
70+
[source,terminal]
71+
----
72+
$ oc apply -f rhcos-boot-kubevirt.yaml
73+
----

0 commit comments

Comments
 (0)