Merge pull request #751 from run-ai/shaibi/RUN-16065-v2.17

jasonnovichRunAI · web-flow · commit d7791c91732f · 2024-04-08T15:06:54.000+03:00
RUN-16065 Revamp Cluster Installation Troubleshooting
diff --git a/docs/admin/runai-setup/cluster-setup/cluster-install.md b/docs/admin/runai-setup/cluster-setup/cluster-install.md
@@ -25,93 +25,14 @@ On the next page:
 
 * (SaaS and remote self-hosted cluster only) Install a trusted certificate to the domain entered above.
 * Run the [Helm](https://helm.sh/docs/intro/install/) command provided in the wizard.
+* In case of a failure, see the [Installation troubleshooting guide](../../troubleshooting/troubleshooting.md#installation).
+
+## Verify your cluster's health
+
+* Verify that the cluster status in the Run:ai Control Plane's [Clusters Table](#cluster-table) is `Connected`.
+* Go to the [Overview Dashboard](../../admin-ui-setup/dashboard-analysis.md#overview-dashboard) and verify that the number of GPUs on the top right reflects your GPU resources on your cluster and the list of machines with GPU resources appears on the bottom line.
+* In case of issues, see the [Troubleshooting guide](../../troubleshooting/cluster-health-check.md).
 
-## Verify your Installation
-
-* Go to `<company-name>.run.ai/dashboards/now`.
-* Verify that the number of GPUs on the top right reflects your GPU resources on your cluster and the list of machines with GPU resources appears on the bottom line.
-* Run: `kubectl get cm runai-public -n runai -o jsonpath='{.data}' | yq -P`
-
-(assumes that [yq](https://mikefarah.gitbook.io/yq/v/v3.x/){target=_blank} is instaled)
-
-Example output:
-
-``` YAML
-cluster-version: 2.9.0
-runai-public: 
-  version: 2.9.0
-  runaiConfigStatus:
-    conditions:
-    - type: DependenciesFulfilled   # (1)
-      status: "True"
-      reason: dependencies_fulfilled
-      message: Dependencies are fulfilled
-    - type: Deployed               
-      status: "True"
-      reason: deployed
-      message: Resources Deployed
-    - type: Available
-      status: "True"
-      reason: available
-      message: System Available
-    - type: Reconciled              # (2)
-      status: "True"
-      reason: reconciled
-      message: Reconciliation completed successfully
-  optional:  # (3)
-    knative:    # (4)  
-      components:
-        hpa:
-          available: true
-        knative:
-          available: true
-        kourier:
-          available: true
-    mpi:        # (5) 
-      available: true
-```
-
-1. Verifies that all mandatory dependencies are met: NVIDIA GPU Operator, Prometheus and NGINX controller. 
-2. Checks whether optional product dependencies have been met.
-3. See [Inference prerequisites](cluster-prerequisites.md#inference).
-4. See [distributed training prerequisites](cluster-prerequisites.md#distributed-training).
-
-<!-- For a more extensive verification of cluster health, see [Determining the health of a cluster](../../troubleshooting/cluster-health-check.md). -->
-
-### Troubleshooting your installation
-
-#### Dependencies are not fulfilled
-
-1. Make sure to install the missing dependencies.
-2. If dependencies are installed, make sure that the CRDs of said dependency are installed, and that the version is supported
-3. Make sure there are no necessary adjustments for specific flavors as noted in the [Cluster prerequisites](cluster-prerequisites.md)
-
-#### Resources not deployed / System Unavailable / Reconciliation Failed
-
-1. Run the [Preinstall diagnostic script](cluster-prerequisites.md#pre-install-script) and check for issues.
-2. Run
-
-```
-   kubectl get pods -n runai
-   kubectl get pods -n monitoring
-```
-
-You can also run `kubectl logs <pod_name>` to get logs from any failing pod.
-
-#### Common Issues
-
-1. Run:ai was previously installed in the cluster and was deleted unsuccessfully, resulting in remaining CRDs.
-    1. Diagnosis: This can be detected by running `kubectl get crds` in the relevant namespaces (or adding `-A` and checking for Run:ai CRDs).
-    2. Solution: Force delete the listed CRDs and reinstall.
-2. One or more of the pods have issues around valid certificates.
-    1. Diagnosis: The logs contains a message similar to the following `failed to verify certificate: x509: certificate signed by unknown authority`.
-    2. Solution:
-        1. This is usually due to an expired or invalid certificate in the cluster, and if so, renew the certificate.
-        2. If the certificate is valid, but is signed by a local CA, make sure you have followed the procedure for a [local certificate authority](../config/org-cert.md).
-
-#### Get Installation Logs
-
-You can use the [get instllation logs](https://github.com/run-ai/public/blob/main/installation/get-installation-logs.sh) script to obtain any relevant installation logs in case of an error.
 ## Researcher Authentication
 
 If you will be using the Run:ai [command-line interface](../../researcher-setup/cli-install.md) or sending [YAMLs directly](../../../developer/cluster-api/submit-yaml.md) to Kubernetes, you must now set up [Researcher Access Control](../authentication/researcher-authentication.md).
diff --git a/docs/admin/runai-setup/cluster-setup/cluster-prerequisites.md b/docs/admin/runai-setup/cluster-setup/cluster-prerequisites.md
@@ -142,6 +142,7 @@ Follow the [Getting Started guide](https://docs.nvidia.com/datacenter/cloud-nati
     * NVIDIA drivers may already be installed on the nodes. In such cases, use the NVIDIA GPU Operator flags `--set driver.enabled=false`. [DGX OS](https://docs.nvidia.com/dgx/index.html){target=_blank} is one such example as it comes bundled with NVIDIA Drivers.
     <!-- * To work with *containerd* (e.g. for Tanzu), use the [defaultRuntime](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/getting-started.html#chart-customization-options){target=_blank} flag accordingly. -->
     * To use [Dynamic MIG](../../../Researcher/scheduling/fractions.md#dynamic-mig), the GPU Operator must be installed with the flag `mig.strategy=mixed`. If the GPU Operator is already installed, edit the clusterPolicy by running ```kubectl patch clusterPolicy cluster-policy -n gpu-operator --type=merge -p '{"spec":{"mig":{"strategy": "mixed"}}}```
+    * For troubleshooting information, see the [NVIDIA GPU Operator Troubleshooting Guide](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/troubleshooting.html){target=_blank}.
 
 ### Ingress Controller
 
@@ -225,6 +226,9 @@ helm install prometheus prometheus-community/kube-prometheus-stack \
 
 1. The Grafana component is not required for Run:ai.
 
+!!! Notes
+    * For troubleshooting information, see the [Prometheus Troubleshooting Guide](https://github.com/prometheus-operator/prometheus-operator/blob/main/Documentation/troubleshooting.md){target=_blank}.
+
 ## Optional Software Requirements
 
 The following software enables specific features of Run:ai
diff --git a/docs/admin/troubleshooting/cluster-health-check.md b/docs/admin/troubleshooting/cluster-health-check.md
@@ -8,6 +8,7 @@ date: 2024-Jan-17
 ---
 
 This troubleshooting guide helps you diagnose and resolve issues you may find in your cluster.
+The cluster status is displayed in the Run:ai Contol Plane. See [Cluster Status](../runai-setup/cluster-setup/cluster-install.md#cluster-status) a list of possible statuses.
 
 ## Cluster is disconnected
 
@@ -62,7 +63,7 @@ Use the following steps to troubleshoot the issue:
     !!! Note
         The previous steps can be used if you installed the cluster and the status is stuck in *Waiting to connect* for a long time.
 
-## Cluster has service issues
+## Cluster has *service issues*
 
 When a cluster's status shows *Service issues*, this means that one or more Run:ai services that are running in the cluster are not available.
 
@@ -90,6 +91,16 @@ When a cluster's status shows *Service issues*, this means that one or more Run:
 
 3. If the issue persists, contact Run:ai support for assistance.
 
+## Cluster has *missing prerequisites*
+
+When a cluster's status displays *Missing prerequisites*, it indicates that at least one of the [Mandatory Prerequisites](../runai-setup/cluster-setup/cluster-prerequisites.md#prerequisites-in-a-nutshell) has not been fulfilled. In such cases, Run:ai services may not function properly.
+
+If you have ensured that all prerequisites are installed and the status still shows *Missing prerequisites*, follow these steps:
+
+1. Check the message in the Control Plane for further details regarding the missing prerequisites.
+2. Inspect the [runai-public ConfigMap](#runai-public-configmap) and look for the `dependencies.required` field to obtain detailed information about the missing resources.
+3. If the issue persists, contact Run:ai support for assistance.
+
 ## General tests to verify the Run:ai cluster health
 
 Use the following tests regularly to determine the health of the Run:ai cluster, regardless of the cluster status and the troubleshooting options previously described.
@@ -161,3 +172,26 @@ Submitting a Job allows you to verify that the Run:ai scheduling service is runn
 
 Log into the Run:ai user interface, and verify that you have a `Researcher` or `Research Manager` role.
 Go to the `Jobs` area. On the top right, press the button to create a Job. Once the form opens, you can submit a Job.
+
+## Advanced troubleshooting
+
+### Run:ai public ConfigMap
+
+Run:ai services use the `runai-public` ConfigMap to store information about the cluster status. This ConfigMap can be helpful in troubleshooting issues with Run:ai services.
+Inspect the ConfigMap by running:
+
+```bash
+kubectl get cm runai-public -oyaml
+```
+
+### Resources not deployed / System unavailable / Reconciliation failed
+
+1. Run the [Preinstall diagnostic script](cluster-prerequisites.md#pre-install-script) and check for issues.
+2. Run
+
+```
+   kubectl get pods -n runai
+   kubectl get pods -n monitoring
+```
+
+Look for any failing pods and check their logs for more information by running `kubectl describe pod -n <pod_namespace> <pod_name>`.
diff --git a/docs/admin/troubleshooting/troubleshooting.md b/docs/admin/troubleshooting/troubleshooting.md
@@ -15,7 +15,35 @@
     ingress-nginx:
         enabled: false
     ```
- 
+
+??? "How to get installation logs"
+    __Symptom:__ Installation fails and you need to troubleshoot the issue.
+
+    __Resolution:__ Run the following script to obtain any relevant installation logs in case of an error.
+
+    ```bash
+    curl -fsSL https://raw.githubusercontent.com/run-ai/public/main/installation/get-installation-logs.sh | bash
+    ```
+
+??? "Upgrade fails with "rendered manifests contain a resource that already exists" error"
+    __Symptom:__ The installation fails with error: `Error: rendered manifests contain a resource that already exists. Unable to continue with install:...`
+
+    __Root cause:__ The Run:ai installation is trying to create a resource that already exists, which may be due to a previous installation that was not properly removed.
+
+    __Resolution:__ Run the following script to remove all Run:ai resources and reinstall:
+
+    ```bash
+    helm template <release-name> <chart-name> --namespace <namespace> | kubectl delete -f -
+    ```
+
+    Then reinstall Run:ai.
+
+??? "Pods are failing due to certificate issues"
+    __Symptom:__ Pods are failing with certificate issues.
+
+    __Root cause:__ The certificate provided during the Control Plane's installation is not valid.
+
+    __Resolution:__ Verify that the certificate is valid and trusted. If the certificate is valid, but is signed by a local CA, make sure you have followed the procedure for a [local certificate authority](../runai-setup/config/org-cert.md).
 
 ## Dashboard Issues