Skip to content

Commit d7791c9

Browse files
Merge pull request #751 from run-ai/shaibi/RUN-16065-v2.17
RUN-16065 Revamp Cluster Installation Troubleshooting
2 parents 8d09bc7 + 0a4f3ef commit d7791c9

File tree

4 files changed

+75
-88
lines changed

4 files changed

+75
-88
lines changed

docs/admin/runai-setup/cluster-setup/cluster-install.md

Lines changed: 7 additions & 86 deletions
Original file line numberDiff line numberDiff line change
@@ -25,93 +25,14 @@ On the next page:
2525

2626
* (SaaS and remote self-hosted cluster only) Install a trusted certificate to the domain entered above.
2727
* Run the [Helm](https://helm.sh/docs/intro/install/) command provided in the wizard.
28+
* In case of a failure, see the [Installation troubleshooting guide](../../troubleshooting/troubleshooting.md#installation).
29+
30+
## Verify your cluster's health
31+
32+
* Verify that the cluster status in the Run:ai Control Plane's [Clusters Table](#cluster-table) is `Connected`.
33+
* Go to the [Overview Dashboard](../../admin-ui-setup/dashboard-analysis.md#overview-dashboard) and verify that the number of GPUs on the top right reflects your GPU resources on your cluster and the list of machines with GPU resources appears on the bottom line.
34+
* In case of issues, see the [Troubleshooting guide](../../troubleshooting/cluster-health-check.md).
2835

29-
## Verify your Installation
30-
31-
* Go to `<company-name>.run.ai/dashboards/now`.
32-
* Verify that the number of GPUs on the top right reflects your GPU resources on your cluster and the list of machines with GPU resources appears on the bottom line.
33-
* Run: `kubectl get cm runai-public -n runai -o jsonpath='{.data}' | yq -P`
34-
35-
(assumes that [yq](https://mikefarah.gitbook.io/yq/v/v3.x/){target=_blank} is instaled)
36-
37-
Example output:
38-
39-
``` YAML
40-
cluster-version: 2.9.0
41-
runai-public:
42-
version: 2.9.0
43-
runaiConfigStatus:
44-
conditions:
45-
- type: DependenciesFulfilled # (1)
46-
status: "True"
47-
reason: dependencies_fulfilled
48-
message: Dependencies are fulfilled
49-
- type: Deployed
50-
status: "True"
51-
reason: deployed
52-
message: Resources Deployed
53-
- type: Available
54-
status: "True"
55-
reason: available
56-
message: System Available
57-
- type: Reconciled # (2)
58-
status: "True"
59-
reason: reconciled
60-
message: Reconciliation completed successfully
61-
optional: # (3)
62-
knative: # (4)
63-
components:
64-
hpa:
65-
available: true
66-
knative:
67-
available: true
68-
kourier:
69-
available: true
70-
mpi: # (5)
71-
available: true
72-
```
73-
74-
1. Verifies that all mandatory dependencies are met: NVIDIA GPU Operator, Prometheus and NGINX controller.
75-
2. Checks whether optional product dependencies have been met.
76-
3. See [Inference prerequisites](cluster-prerequisites.md#inference).
77-
4. See [distributed training prerequisites](cluster-prerequisites.md#distributed-training).
78-
79-
<!-- For a more extensive verification of cluster health, see [Determining the health of a cluster](../../troubleshooting/cluster-health-check.md). -->
80-
81-
### Troubleshooting your installation
82-
83-
#### Dependencies are not fulfilled
84-
85-
1. Make sure to install the missing dependencies.
86-
2. If dependencies are installed, make sure that the CRDs of said dependency are installed, and that the version is supported
87-
3. Make sure there are no necessary adjustments for specific flavors as noted in the [Cluster prerequisites](cluster-prerequisites.md)
88-
89-
#### Resources not deployed / System Unavailable / Reconciliation Failed
90-
91-
1. Run the [Preinstall diagnostic script](cluster-prerequisites.md#pre-install-script) and check for issues.
92-
2. Run
93-
94-
```
95-
kubectl get pods -n runai
96-
kubectl get pods -n monitoring
97-
```
98-
99-
You can also run `kubectl logs <pod_name>` to get logs from any failing pod.
100-
101-
#### Common Issues
102-
103-
1. Run:ai was previously installed in the cluster and was deleted unsuccessfully, resulting in remaining CRDs.
104-
1. Diagnosis: This can be detected by running `kubectl get crds` in the relevant namespaces (or adding `-A` and checking for Run:ai CRDs).
105-
2. Solution: Force delete the listed CRDs and reinstall.
106-
2. One or more of the pods have issues around valid certificates.
107-
1. Diagnosis: The logs contains a message similar to the following `failed to verify certificate: x509: certificate signed by unknown authority`.
108-
2. Solution:
109-
1. This is usually due to an expired or invalid certificate in the cluster, and if so, renew the certificate.
110-
2. If the certificate is valid, but is signed by a local CA, make sure you have followed the procedure for a [local certificate authority](../config/org-cert.md).
111-
112-
#### Get Installation Logs
113-
114-
You can use the [get instllation logs](https://github.com/run-ai/public/blob/main/installation/get-installation-logs.sh) script to obtain any relevant installation logs in case of an error.
11536
## Researcher Authentication
11637

11738
If you will be using the Run:ai [command-line interface](../../researcher-setup/cli-install.md) or sending [YAMLs directly](../../../developer/cluster-api/submit-yaml.md) to Kubernetes, you must now set up [Researcher Access Control](../authentication/researcher-authentication.md).

docs/admin/runai-setup/cluster-setup/cluster-prerequisites.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -142,6 +142,7 @@ Follow the [Getting Started guide](https://docs.nvidia.com/datacenter/cloud-nati
142142
* NVIDIA drivers may already be installed on the nodes. In such cases, use the NVIDIA GPU Operator flags `--set driver.enabled=false`. [DGX OS](https://docs.nvidia.com/dgx/index.html){target=_blank} is one such example as it comes bundled with NVIDIA Drivers.
143143
<!-- * To work with *containerd* (e.g. for Tanzu), use the [defaultRuntime](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/getting-started.html#chart-customization-options){target=_blank} flag accordingly. -->
144144
* To use [Dynamic MIG](../../../Researcher/scheduling/fractions.md#dynamic-mig), the GPU Operator must be installed with the flag `mig.strategy=mixed`. If the GPU Operator is already installed, edit the clusterPolicy by running ```kubectl patch clusterPolicy cluster-policy -n gpu-operator --type=merge -p '{"spec":{"mig":{"strategy": "mixed"}}}```
145+
* For troubleshooting information, see the [NVIDIA GPU Operator Troubleshooting Guide](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/troubleshooting.html){target=_blank}.
145146

146147
### Ingress Controller
147148

@@ -225,6 +226,9 @@ helm install prometheus prometheus-community/kube-prometheus-stack \
225226

226227
1. The Grafana component is not required for Run:ai.
227228

229+
!!! Notes
230+
* For troubleshooting information, see the [Prometheus Troubleshooting Guide](https://github.com/prometheus-operator/prometheus-operator/blob/main/Documentation/troubleshooting.md){target=_blank}.
231+
228232
## Optional Software Requirements
229233

230234
The following software enables specific features of Run:ai

docs/admin/troubleshooting/cluster-health-check.md

Lines changed: 35 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,7 @@ date: 2024-Jan-17
88
---
99

1010
This troubleshooting guide helps you diagnose and resolve issues you may find in your cluster.
11+
The cluster status is displayed in the Run:ai Contol Plane. See [Cluster Status](../runai-setup/cluster-setup/cluster-install.md#cluster-status) a list of possible statuses.
1112

1213
## Cluster is disconnected
1314

@@ -62,7 +63,7 @@ Use the following steps to troubleshoot the issue:
6263
!!! Note
6364
The previous steps can be used if you installed the cluster and the status is stuck in *Waiting to connect* for a long time.
6465

65-
## Cluster has service issues
66+
## Cluster has *service issues*
6667

6768
When a cluster's status shows *Service issues*, this means that one or more Run:ai services that are running in the cluster are not available.
6869
@@ -90,6 +91,16 @@ When a cluster's status shows *Service issues*, this means that one or more Run:
9091
9192
3. If the issue persists, contact Run:ai support for assistance.
9293
94+
## Cluster has *missing prerequisites*
95+
96+
When a cluster's status displays *Missing prerequisites*, it indicates that at least one of the [Mandatory Prerequisites](../runai-setup/cluster-setup/cluster-prerequisites.md#prerequisites-in-a-nutshell) has not been fulfilled. In such cases, Run:ai services may not function properly.
97+
98+
If you have ensured that all prerequisites are installed and the status still shows *Missing prerequisites*, follow these steps:
99+
100+
1. Check the message in the Control Plane for further details regarding the missing prerequisites.
101+
2. Inspect the [runai-public ConfigMap](#runai-public-configmap) and look for the `dependencies.required` field to obtain detailed information about the missing resources.
102+
3. If the issue persists, contact Run:ai support for assistance.
103+
93104
## General tests to verify the Run:ai cluster health
94105

95106
Use the following tests regularly to determine the health of the Run:ai cluster, regardless of the cluster status and the troubleshooting options previously described.
@@ -161,3 +172,26 @@ Submitting a Job allows you to verify that the Run:ai scheduling service is runn
161172

162173
Log into the Run:ai user interface, and verify that you have a `Researcher` or `Research Manager` role.
163174
Go to the `Jobs` area. On the top right, press the button to create a Job. Once the form opens, you can submit a Job.
175+
176+
## Advanced troubleshooting
177+
178+
### Run:ai public ConfigMap
179+
180+
Run:ai services use the `runai-public` ConfigMap to store information about the cluster status. This ConfigMap can be helpful in troubleshooting issues with Run:ai services.
181+
Inspect the ConfigMap by running:
182+
183+
```bash
184+
kubectl get cm runai-public -oyaml
185+
```
186+
187+
### Resources not deployed / System unavailable / Reconciliation failed
188+
189+
1. Run the [Preinstall diagnostic script](cluster-prerequisites.md#pre-install-script) and check for issues.
190+
2. Run
191+
192+
```
193+
kubectl get pods -n runai
194+
kubectl get pods -n monitoring
195+
```
196+
197+
Look for any failing pods and check their logs for more information by running `kubectl describe pod -n <pod_namespace> <pod_name>`.

docs/admin/troubleshooting/troubleshooting.md

Lines changed: 29 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,35 @@
1515
ingress-nginx:
1616
enabled: false
1717
```
18-
18+
19+
??? "How to get installation logs"
20+
__Symptom:__ Installation fails and you need to troubleshoot the issue.
21+
22+
__Resolution:__ Run the following script to obtain any relevant installation logs in case of an error.
23+
24+
```bash
25+
curl -fsSL https://raw.githubusercontent.com/run-ai/public/main/installation/get-installation-logs.sh | bash
26+
```
27+
28+
??? "Upgrade fails with "rendered manifests contain a resource that already exists" error"
29+
__Symptom:__ The installation fails with error: `Error: rendered manifests contain a resource that already exists. Unable to continue with install:...`
30+
31+
__Root cause:__ The Run:ai installation is trying to create a resource that already exists, which may be due to a previous installation that was not properly removed.
32+
33+
__Resolution:__ Run the following script to remove all Run:ai resources and reinstall:
34+
35+
```bash
36+
helm template <release-name> <chart-name> --namespace <namespace> | kubectl delete -f -
37+
```
38+
39+
Then reinstall Run:ai.
40+
41+
??? "Pods are failing due to certificate issues"
42+
__Symptom:__ Pods are failing with certificate issues.
43+
44+
__Root cause:__ The certificate provided during the Control Plane's installation is not valid.
45+
46+
__Resolution:__ Verify that the certificate is valid and trusted. If the certificate is valid, but is signed by a local CA, make sure you have followed the procedure for a [local certificate authority](../runai-setup/config/org-cert.md).
1947

2048
## Dashboard Issues
2149

0 commit comments

Comments
 (0)