Skip to content

Commit 0a4f3ef

Browse files
committed
.
1 parent 7cffe5c commit 0a4f3ef

File tree

2 files changed

+11
-9
lines changed

2 files changed

+11
-9
lines changed

docs/admin/runai-setup/cluster-setup/cluster-install.md

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -25,11 +25,13 @@ On the next page:
2525

2626
* (SaaS and remote self-hosted cluster only) Install a trusted certificate to the domain entered above.
2727
* Run the [Helm](https://helm.sh/docs/intro/install/) command provided in the wizard.
28+
* In case of a failure, see the [Installation troubleshooting guide](../../troubleshooting/troubleshooting.md#installation).
2829

29-
## Verify your Installation
30+
## Verify your cluster's health
3031

3132
* Verify that the cluster status in the Run:ai Control Plane's [Clusters Table](#cluster-table) is `Connected`.
3233
* Go to the [Overview Dashboard](../../admin-ui-setup/dashboard-analysis.md#overview-dashboard) and verify that the number of GPUs on the top right reflects your GPU resources on your cluster and the list of machines with GPU resources appears on the bottom line.
34+
* In case of issues, see the [Troubleshooting guide](../../troubleshooting/cluster-health-check.md).
3335

3436
## Researcher Authentication
3537

docs/admin/troubleshooting/cluster-health-check.md

Lines changed: 8 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -63,7 +63,7 @@ Use the following steps to troubleshoot the issue:
6363
!!! Note
6464
The previous steps can be used if you installed the cluster and the status is stuck in *Waiting to connect* for a long time.
6565

66-
## Cluster has service issues
66+
## Cluster has *service issues*
6767

6868
When a cluster's status shows *Service issues*, this means that one or more Run:ai services that are running in the cluster are not available.
6969
@@ -91,13 +91,13 @@ When a cluster's status shows *Service issues*, this means that one or more Run:
9191
9292
3. If the issue persists, contact Run:ai support for assistance.
9393
94-
## Cluster has missing prerequisites
94+
## Cluster has *missing prerequisites*
9595
96-
When a cluster's status displays Missing prerequisites, it indicates that at least one of the [Mandatory Prerequisites](../runai-setup/cluster-setup/cluster-prerequisites.md#prerequisites-in-a-nutshell) has not been fulfilled. In such cases, Run:ai services may not function properly.
96+
When a cluster's status displays *Missing prerequisites*, it indicates that at least one of the [Mandatory Prerequisites](../runai-setup/cluster-setup/cluster-prerequisites.md#prerequisites-in-a-nutshell) has not been fulfilled. In such cases, Run:ai services may not function properly.
9797

98-
If you have ensured that all prerequisites are installed and the status still shows Missing prerequisites, follow these steps:
98+
If you have ensured that all prerequisites are installed and the status still shows *Missing prerequisites*, follow these steps:
9999

100-
1. Check the message in the UI for further details regarding the missing prerequisites.
100+
1. Check the message in the Control Plane for further details regarding the missing prerequisites.
101101
2. Inspect the [runai-public ConfigMap](#runai-public-configmap) and look for the `dependencies.required` field to obtain detailed information about the missing resources.
102102
3. If the issue persists, contact Run:ai support for assistance.
103103

@@ -173,18 +173,18 @@ Submitting a Job allows you to verify that the Run:ai scheduling service is runn
173173
Log into the Run:ai user interface, and verify that you have a `Researcher` or `Research Manager` role.
174174
Go to the `Jobs` area. On the top right, press the button to create a Job. Once the form opens, you can submit a Job.
175175

176-
## Advanced Troubleshooting
176+
## Advanced troubleshooting
177177

178178
### Run:ai public ConfigMap
179179

180180
Run:ai services use the `runai-public` ConfigMap to store information about the cluster status. This ConfigMap can be helpful in troubleshooting issues with Run:ai services.
181181
Inspect the ConfigMap by running:
182182

183183
```bash
184-
kubectl get cm runai-public -oyaml | yq .data.runai-public
184+
kubectl get cm runai-public -oyaml
185185
```
186186

187-
### Resources not deployed / System Unavailable / Reconciliation Failed
187+
### Resources not deployed / System unavailable / Reconciliation failed
188188

189189
1. Run the [Preinstall diagnostic script](cluster-prerequisites.md#pre-install-script) and check for issues.
190190
2. Run

0 commit comments

Comments
 (0)