You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/admin/runai-setup/cluster-setup/cluster-install.md
+3-1Lines changed: 3 additions & 1 deletion
Original file line number
Diff line number
Diff line change
@@ -25,11 +25,13 @@ On the next page:
25
25
26
26
* (SaaS and remote self-hosted cluster only) Install a trusted certificate to the domain entered above.
27
27
* Run the [Helm](https://helm.sh/docs/intro/install/) command provided in the wizard.
28
+
* In case of a failure, see the [Installation troubleshooting guide](../../troubleshooting/troubleshooting.md#installation).
28
29
29
-
## Verify your Installation
30
+
## Verify your cluster's health
30
31
31
32
* Verify that the cluster status in the Run:ai Control Plane's [Clusters Table](#cluster-table) is `Connected`.
32
33
* Go to the [Overview Dashboard](../../admin-ui-setup/dashboard-analysis.md#overview-dashboard) and verify that the number of GPUs on the top right reflects your GPU resources on your cluster and the list of machines with GPU resources appears on the bottom line.
34
+
* In case of issues, see the [Troubleshooting guide](../../troubleshooting/cluster-health-check.md).
Copy file name to clipboardExpand all lines: docs/admin/troubleshooting/cluster-health-check.md
+8-8Lines changed: 8 additions & 8 deletions
Original file line number
Diff line number
Diff line change
@@ -63,7 +63,7 @@ Use the following steps to troubleshoot the issue:
63
63
!!! Note
64
64
The previous steps can be used if you installed the cluster and the status is stuck in*Waiting to connect*for a long time.
65
65
66
-
## Cluster has service issues
66
+
## Cluster has *service issues*
67
67
68
68
When a cluster's status shows *Service issues*, this means that one or more Run:ai services that are running in the cluster are not available.
69
69
@@ -91,13 +91,13 @@ When a cluster's status shows *Service issues*, this means that one or more Run:
91
91
92
92
3. If the issue persists, contact Run:ai support for assistance.
93
93
94
-
## Cluster has missing prerequisites
94
+
## Cluster has *missing prerequisites*
95
95
96
-
When a cluster's status displays Missing prerequisites, it indicates that at least one of the [Mandatory Prerequisites](../runai-setup/cluster-setup/cluster-prerequisites.md#prerequisites-in-a-nutshell) has not been fulfilled. In such cases, Run:ai services may not functionproperly.
96
+
When a cluster's status displays *Missing prerequisites*, it indicates that at least one of the [Mandatory Prerequisites](../runai-setup/cluster-setup/cluster-prerequisites.md#prerequisites-in-a-nutshell) has not been fulfilled. In such cases, Run:ai services may not functionproperly.
97
97
98
-
If you have ensured that all prerequisites are installed and the status still shows Missing prerequisites, follow these steps:
98
+
If you have ensured that all prerequisites are installed and the status still shows *Missing prerequisites*, follow these steps:
99
99
100
-
1. Check the message in the UIfor further details regarding the missing prerequisites.
100
+
1. Check the message in the Control Planefor further details regarding the missing prerequisites.
101
101
2. Inspect the [runai-public ConfigMap](#runai-public-configmap) and look for the `dependencies.required` field to obtain detailed information about the missing resources.
102
102
3. If the issue persists, contact Run:ai support for assistance.
103
103
@@ -173,18 +173,18 @@ Submitting a Job allows you to verify that the Run:ai scheduling service is runn
173
173
Log into the Run:ai user interface, and verify that you have a `Researcher` or `Research Manager` role.
174
174
Go to the `Jobs` area. On the top right, press the button to create a Job. Once the form opens, you can submit a Job.
175
175
176
-
## Advanced Troubleshooting
176
+
## Advanced troubleshooting
177
177
178
178
### Run:ai public ConfigMap
179
179
180
180
Run:ai services use the `runai-public` ConfigMap to store information about the cluster status. This ConfigMap can be helpful in troubleshooting issues with Run:ai services.
181
181
Inspect the ConfigMap by running:
182
182
183
183
```bash
184
-
kubectl get cm runai-public -oyaml| yq .data.runai-public
184
+
kubectl get cm runai-public -oyaml
185
185
```
186
186
187
-
### Resources not deployed / System Unavailable / Reconciliation Failed
187
+
### Resources not deployed / System unavailable / Reconciliation failed
188
188
189
189
1. Run the [Preinstall diagnostic script](cluster-prerequisites.md#pre-install-script) and check for issues.
0 commit comments