You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/admin/runai-setup/cluster-setup/cluster-install.md
+7-86Lines changed: 7 additions & 86 deletions
Original file line number
Diff line number
Diff line change
@@ -25,93 +25,14 @@ On the next page:
25
25
26
26
* (SaaS and remote self-hosted cluster only) Install a trusted certificate to the domain entered above.
27
27
* Run the [Helm](https://helm.sh/docs/intro/install/) command provided in the wizard.
28
+
* In case of a failure, see the [Installation troubleshooting guide](../../troubleshooting/troubleshooting.md#installation).
29
+
30
+
## Verify your cluster's health
31
+
32
+
* Verify that the cluster status in the Run:ai Control Plane's [Clusters Table](#cluster-table) is `Connected`.
33
+
* Go to the [Overview Dashboard](../../admin-ui-setup/dashboard-analysis.md#overview-dashboard) and verify that the number of GPUs on the top right reflects your GPU resources on your cluster and the list of machines with GPU resources appears on the bottom line.
34
+
* In case of issues, see the [Troubleshooting guide](../../troubleshooting/cluster-health-check.md).
28
35
29
-
## Verify your Installation
30
-
31
-
* Go to `<company-name>.run.ai/dashboards/now`.
32
-
* Verify that the number of GPUs on the top right reflects your GPU resources on your cluster and the list of machines with GPU resources appears on the bottom line.
33
-
* Run: `kubectl get cm runai-public -n runai -o jsonpath='{.data}' | yq -P`
34
-
35
-
(assumes that [yq](https://mikefarah.gitbook.io/yq/v/v3.x/){target=_blank} is instaled)
36
-
37
-
Example output:
38
-
39
-
```YAML
40
-
cluster-version: 2.9.0
41
-
runai-public:
42
-
version: 2.9.0
43
-
runaiConfigStatus:
44
-
conditions:
45
-
- type: DependenciesFulfilled # (1)
46
-
status: "True"
47
-
reason: dependencies_fulfilled
48
-
message: Dependencies are fulfilled
49
-
- type: Deployed
50
-
status: "True"
51
-
reason: deployed
52
-
message: Resources Deployed
53
-
- type: Available
54
-
status: "True"
55
-
reason: available
56
-
message: System Available
57
-
- type: Reconciled # (2)
58
-
status: "True"
59
-
reason: reconciled
60
-
message: Reconciliation completed successfully
61
-
optional: # (3)
62
-
knative: # (4)
63
-
components:
64
-
hpa:
65
-
available: true
66
-
knative:
67
-
available: true
68
-
kourier:
69
-
available: true
70
-
mpi: # (5)
71
-
available: true
72
-
```
73
-
74
-
1. Verifies that all mandatory dependencies are met: NVIDIA GPU Operator, Prometheus and NGINX controller.
75
-
2. Checks whether optional product dependencies have been met.
76
-
3. See [Inference prerequisites](cluster-prerequisites.md#inference).
77
-
4. See [distributed training prerequisites](cluster-prerequisites.md#distributed-training).
78
-
79
-
<!-- For a more extensive verification of cluster health, see [Determining the health of a cluster](../../troubleshooting/cluster-health-check.md). -->
80
-
81
-
### Troubleshooting your installation
82
-
83
-
#### Dependencies are not fulfilled
84
-
85
-
1. Make sure to install the missing dependencies.
86
-
2. If dependencies are installed, make sure that the CRDs of said dependency are installed, and that the version is supported
87
-
3. Make sure there are no necessary adjustments for specific flavors as noted in the [Cluster prerequisites](cluster-prerequisites.md)
88
-
89
-
#### Resources not deployed / System Unavailable / Reconciliation Failed
90
-
91
-
1. Run the [Preinstall diagnostic script](cluster-prerequisites.md#pre-install-script) and check for issues.
92
-
2. Run
93
-
94
-
```
95
-
kubectl get pods -n runai
96
-
kubectl get pods -n monitoring
97
-
```
98
-
99
-
You can also run `kubectl logs <pod_name>` to get logs from any failing pod.
100
-
101
-
#### Common Issues
102
-
103
-
1. Run:ai was previously installed in the cluster and was deleted unsuccessfully, resulting in remaining CRDs.
104
-
1. Diagnosis: This can be detected by running `kubectl get crds` in the relevant namespaces (or adding `-A` and checking for Run:ai CRDs).
105
-
2. Solution: Force delete the listed CRDs and reinstall.
106
-
2. One or more of the pods have issues around valid certificates.
107
-
1. Diagnosis: The logs contains a message similar to the following `failed to verify certificate: x509: certificate signed by unknown authority`.
108
-
2. Solution:
109
-
1. This is usually due to an expired or invalid certificate in the cluster, and if so, renew the certificate.
110
-
2. If the certificate is valid, but is signed by a local CA, make sure you have followed the procedure for a [local certificate authority](../config/org-cert.md).
111
-
112
-
#### Get Installation Logs
113
-
114
-
You can use the [get instllation logs](https://github.com/run-ai/public/blob/main/installation/get-installation-logs.sh) script to obtain any relevant installation logs in case of an error.
115
36
## Researcher Authentication
116
37
117
38
If you will be using the Run:ai[command-line interface](../../researcher-setup/cli-install.md) or sending [YAMLs directly](../../../developer/cluster-api/submit-yaml.md) to Kubernetes, you must now set up [Researcher Access Control](../authentication/researcher-authentication.md).
Copy file name to clipboardExpand all lines: docs/admin/runai-setup/cluster-setup/cluster-prerequisites.md
+4Lines changed: 4 additions & 0 deletions
Original file line number
Diff line number
Diff line change
@@ -142,6 +142,7 @@ Follow the [Getting Started guide](https://docs.nvidia.com/datacenter/cloud-nati
142
142
* NVIDIA drivers may already be installed on the nodes. In such cases, use the NVIDIA GPU Operator flags `--set driver.enabled=false`. [DGX OS](https://docs.nvidia.com/dgx/index.html){target=_blank} is one such example as it comes bundled with NVIDIA Drivers.
143
143
<!-- * To work with *containerd* (e.g. for Tanzu), use the [defaultRuntime](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/getting-started.html#chart-customization-options){target=_blank} flag accordingly. -->
144
144
* To use [Dynamic MIG](../../../Researcher/scheduling/fractions.md#dynamic-mig), the GPU Operator must be installed with the flag `mig.strategy=mixed`. If the GPU Operator is already installed, edit the clusterPolicy by running ```kubectl patch clusterPolicy cluster-policy -n gpu-operator --type=merge -p '{"spec":{"mig":{"strategy": "mixed"}}}```
145
+
* For troubleshooting information, see the [NVIDIA GPU Operator Troubleshooting Guide](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/troubleshooting.html){target=_blank}.
1. The Grafana component is not required for Run:ai.
227
228
229
+
!!! Notes
230
+
* For troubleshooting information, see the [Prometheus Troubleshooting Guide](https://github.com/prometheus-operator/prometheus-operator/blob/main/Documentation/troubleshooting.md){target=_blank}.
231
+
228
232
## Optional Software Requirements
229
233
230
234
The following software enables specific features of Run:ai
Copy file name to clipboardExpand all lines: docs/admin/troubleshooting/cluster-health-check.md
+35-1Lines changed: 35 additions & 1 deletion
Original file line number
Diff line number
Diff line change
@@ -8,6 +8,7 @@ date: 2024-Jan-17
8
8
---
9
9
10
10
This troubleshooting guide helps you diagnose and resolve issues you may find in your cluster.
11
+
The cluster status is displayed in the Run:ai Contol Plane. See [Cluster Status](../runai-setup/cluster-setup/cluster-install.md#cluster-status) a list of possible statuses.
11
12
12
13
## Cluster is disconnected
13
14
@@ -62,7 +63,7 @@ Use the following steps to troubleshoot the issue:
62
63
!!! Note
63
64
The previous steps can be used if you installed the cluster and the status is stuck in*Waiting to connect*for a long time.
64
65
65
-
## Cluster has service issues
66
+
## Cluster has *service issues*
66
67
67
68
When a cluster's status shows *Service issues*, this means that one or more Run:ai services that are running in the cluster are not available.
68
69
@@ -90,6 +91,16 @@ When a cluster's status shows *Service issues*, this means that one or more Run:
90
91
91
92
3. If the issue persists, contact Run:ai support for assistance.
92
93
94
+
## Cluster has *missing prerequisites*
95
+
96
+
When a cluster's status displays *Missing prerequisites*, it indicates that at least one of the [Mandatory Prerequisites](../runai-setup/cluster-setup/cluster-prerequisites.md#prerequisites-in-a-nutshell) has not been fulfilled. In such cases, Run:ai services may not functionproperly.
97
+
98
+
If you have ensured that all prerequisites are installed and the status still shows *Missing prerequisites*, follow these steps:
99
+
100
+
1. Check the message in the Control Plane for further details regarding the missing prerequisites.
101
+
2. Inspect the [runai-public ConfigMap](#runai-public-configmap) and look for the `dependencies.required` field to obtain detailed information about the missing resources.
102
+
3. If the issue persists, contact Run:ai support for assistance.
103
+
93
104
## General tests to verify the Run:ai cluster health
94
105
95
106
Use the following tests regularly to determine the health of the Run:ai cluster, regardless of the cluster status and the troubleshooting options previously described.
@@ -161,3 +172,26 @@ Submitting a Job allows you to verify that the Run:ai scheduling service is runn
161
172
162
173
Log into the Run:ai user interface, and verify that you have a `Researcher` or `Research Manager` role.
163
174
Go to the `Jobs` area. On the top right, press the button to create a Job. Once the form opens, you can submit a Job.
175
+
176
+
## Advanced troubleshooting
177
+
178
+
### Run:ai public ConfigMap
179
+
180
+
Run:ai services use the `runai-public` ConfigMap to store information about the cluster status. This ConfigMap can be helpful in troubleshooting issues with Run:ai services.
181
+
Inspect the ConfigMap by running:
182
+
183
+
```bash
184
+
kubectl get cm runai-public -oyaml
185
+
```
186
+
187
+
### Resources not deployed / System unavailable / Reconciliation failed
188
+
189
+
1. Run the [Preinstall diagnostic script](cluster-prerequisites.md#pre-install-script) and check for issues.
190
+
2. Run
191
+
192
+
```
193
+
kubectl get pods -n runai
194
+
kubectl get pods -n monitoring
195
+
```
196
+
197
+
Look for any failing pods and check their logs for more information by running `kubectl describe pod -n <pod_namespace><pod_name>`.
??? "Upgrade fails with "rendered manifests contain a resource that already exists" error"
29
+
__Symptom:__ The installation fails with error: `Error: rendered manifests contain a resource that already exists. Unable to continue with install:...`
30
+
31
+
__Root cause:__ The Run:ai installation is trying to create a resource that already exists, which may be due to a previous installation that was not properly removed.
32
+
33
+
__Resolution:__ Run the following script to remove all Run:ai resources and reinstall:
__Symptom:__ Pods are failing with certificate issues.
43
+
44
+
__Root cause:__ The certificate provided during the Control Plane's installation is not valid.
45
+
46
+
__Resolution:__ Verify that the certificate is valid and trusted. If the certificate is valid, but is signed by a local CA, make sure you have followed the procedure for a [local certificate authority](../runai-setup/config/org-cert.md).
0 commit comments