You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
| **Impact** | Possible impact - no info/partial info from the cluster is being synced back to the control-plane |
192
192
| **Severity** | Critical |
193
193
| **Diagnosis** | `kubectl get pod -n runai` to see if the `cluster-sync` pod is running |
194
-
| **Troubleshooting/Mitigation** | To diagnose issues with the `cluster-sync` pod, follow these steps: **Paste the following command to your terminal, to receive detailed information about the** `cluster-sync` deployment:`kubectl describe deployment cluster-sync -n runai` **Check the Logs**: Use the following command to view the logs of the `cluster-sync` deployment:`kubectl logs deployment/cluster-sync -n runai` **Analyze the Logs and Pod Details**: From the information provided by the logs and the deployment details, attempt to identify the reason why the `cluster-sync` pod is not functioning correctly **Check Connectivity**: Ensure there is a stable network connection between the cluster and the Run:ai Control Plane. A connectivity issue may be the root cause of the problem. **Contact Support**: If the network connection is stable and you are still unable to resolve the issue, contact Run:ai support for further assistance |
194
+
| **Troubleshooting/Mitigation** | To diagnose issues with the `cluster-sync` pod, follow these steps: **Paste the following command to your terminal, to receive detailed information about the** `cluster-sync` deployment:`kubectl describe deployment cluster-sync -n runai` **Check the Logs**: Use the following command to view the logs of the `cluster-sync` deployment:`kubectl logs deployment/cluster-sync -n runai` **Analyze the Logs and Pod Details**: From the information provided by the logs and the deployment details, attempt to identify the reason why the `cluster-sync` pod is not functioning correctly **Check Connectivity**: Ensure there is a stable network connection between the cluster and the Run:ai Control Plane. A connectivity issue may be the root cause of the problem. **Contact Support**: If the network connection is stable and you are still unable to resolve the issue, contact Run:ai support for further assistance |
195
195
196
196
Runai agent pull rate low
197
197
198
198
| Meaning | The `runai-agent` pod may be too loaded, is slow in processing data (possible in very big clusters), or the `runai-agent` pod itself in the `runai` namespace may not be functioning properly. |
199
199
| :---- | :---- |
200
-
| **Impact** | Possible impact \- no info/partial info from the control-plane is bein[g synced i](http://running.To)n the cluster |
200
+
| **Impact** | Possible impact - no info/partial info from the control-plane is bein[g synced i](http://running.To)n the cluster |
201
201
| **Severity** | Critical |
202
202
| **Diagnosis** | Run: `kubectl get pod -n runai`And see if the `runai-agent` pod is running. |
203
-
| **Troubleshooting/Mitigation** | To diagnose issues with the `runai-agent` pod, follow these steps: **Describe the Deployment**: Run the following command to get detailed information about the `runai-agent` deployment:`kubectl describe deployment runai-agent -n runai` **Check the Logs**: Use the following command to view the logs of the `runai-agent` deployment:`kubectl logs deployment/runai-agent -n runai` **Analyze the Logs and Pod Details**: From the information provided by the logs and the deployment details, attempt to identify the reason why the `runai-agent` pod is not functioning correctly. There may be a connectivity issue with the control plane. **Check Connectivity**: Ensure there is a stable network connection between the `runai-agent` and the control plane. A connectivity issue may be the root cause of the problem. **Consider Cluster Load**: If the `runai-agent` appears to be functioning properly but the cluster is very large and heavily loaded, it may take more time for the agent to process data from the control plane. **Adjust Alert Threshold**: If the cluster load is causing the alert to fire, you can adjust the threshold at which the alert triggers. The default value is 0.05. You can try changing it to a lower value (e.g., 0.045 or 0.04).To edit the value, paste the following in your terminal:`kubectl edit runaiconfig -n runai`In the editor, navigate to:spec: prometheus: agentPullPushRateMinForAlert: \<new\_value\>If the `agentPullPushRateMinForAlert` value does not exist, add it under `spec -> prometheus` |
203
+
| **Troubleshooting/Mitigation** | To diagnose issues with the `runai-agent` pod, follow these steps: **Describe the Deployment**: Run the following command to get detailed information about the `runai-agent` deployment:`kubectl describe deployment runai-agent -n runai` **Check the Logs**: Use the following command to view the logs of the `runai-agent` deployment:`kubectl logs deployment/runai-agent -n runai` **Analyze the Logs and Pod Details**: From the information provided by the logs and the deployment details, attempt to identify the reason why the `runai-agent` pod is not functioning correctly. There may be a connectivity issue with the control plane. **Check Connectivity**: Ensure there is a stable network connection between the `runai-agent` and the control plane. A connectivity issue may be the root cause of the problem. **Consider Cluster Load**: If the `runai-agent` appears to be functioning properly but the cluster is very large and heavily loaded, it may take more time for the agent to process data from the control plane. **Adjust Alert Threshold**: If the cluster load is causing the alert to fire, you can adjust the threshold at which the alert triggers. The default value is 0.05. You can try changing it to a lower value (e.g., 0.045 or 0.04).To edit the value, paste the following in your terminal:`kubectl edit runaiconfig -n runai`In the editor, navigate to:spec: prometheus: agentPullPushRateMinForAlert: <new_value>If the `agentPullPushRateMinForAlert` value does not exist, add it under `spec -> prometheus` |
204
204
205
205
Runai container memory usage critical
206
206
@@ -226,8 +226,8 @@ Runai container restarting
226
226
| :---- | :---- |
227
227
| **Impact** | The container might become unavailable and impact the Run:ai system |
228
228
| **Severity** | Warning |
229
-
| **Diagnosis** | To diagnose the issue and identify the problematic pods, paste this into your terminal: `kubectl get pods -n runai kubectl get pods -n runai-backend`One or more of the pods have a restart count \>= 2\. |
230
-
| **Troubleshooting/Mitigation** | Paste this into your terminal:`kubectl logs -n NAMESPACE POD_NAME`Replace `NAMESPACE` and `POD_NAME` with the relevant pod information from the previous step. Check the logs for any standout issues and verify that the container has sufficient resources. If you need further assistance, contact Run:ai |
229
+
| **Diagnosis** | To diagnose the issue and identify the problematic pods, paste this into your terminal: `kubectl get pods -n runai kubectl get pods -n runai-backend`One or more of the pods have a restart count >= 2. |
230
+
| **Troubleshooting/Mitigation** | Paste this into your terminal:`kubectl logs -n NAMESPACE POD_NAME`Replace `NAMESPACE` and `POD_NAME` with the relevant pod information from the previous step. Check the logs for any standout issues and verify that the container has sufficient resources. If you need further assistance, contact Run:ai |
| **Impact** | No fractional GPU workloads support |
254
254
| **Severity** | Critical |
255
255
| **Diagnosis** | Paste the following command to your terminal: `kubectl get daemonset -n runai-backend`In the result of this command, identify the daemonset(s) that don’t have any running pods |
256
-
| **Troubleshooting/Mitigation** | Paste the following command to your terminal, where `daemonsetX` is the problematic daemonset from the pervious step: `kubectl describe daemonsetX -n runai`on the relevant deamonset(s) from the previous step. The next step is to look for the specific error which prevents it from creating pods. Possible reasons might be:**Node Resource Constraints**: The nodes in the cluster may lack sufficient resources (CPU, memory, etc.) to accommodate new pods from the daemonset. **Node Selector or Affinity Rules**: The daemonset may have node selector or affinity rules that are not matching with any nodes currently available in the cluster, thus preventing pod creation. |
256
+
| **Troubleshooting/Mitigation** | Paste the following command to your terminal, where `daemonsetX` is the problematic daemonset from the pervious step: `kubectl describe daemonsetX -n runai`on the relevant deamonset(s) from the previous step. The next step is to look for the specific error which prevents it from creating pods. Possible reasons might be:**Node Resource Constraints**: The nodes in the cluster may lack sufficient resources (CPU, memory, etc.) to accommodate new pods from the daemonset. **Node Selector or Affinity Rules**: The daemonset may have node selector or affinity rules that are not matching with any nodes currently available in the cluster, thus preventing pod creation. |
257
257
258
258
Runai deployment insufficient replicas / Runai deployment no available replicas /RunaiDeploymentUnavailableReplicas
259
259
260
260
| Meaning | `Runai` deployment has one or more unavailable pods |
261
261
| :---- | :---- |
262
262
| **Impact** | When this happens, there may be scale issues. Additionally, new versions cannot be deployed, potentially resulting in missing features. |
263
263
| **Severity** | Critical |
264
-
| **Diagnosis** | Paste the following commands to your terminal, in order to get the status of the deployments in the `runai` and `runai-backend` namespaces:`kubectl get deployment -n runai kubectl get deployment -n runai-backend`Identify any deployments that have missing pods. Look for discrepancies in the `DESIRED` and `AVAILABLE` columns. If the number of `AVAILABLE` pods is less than the `DESIRED` pods, it indicates that there are missing pods. |
265
-
| **Troubleshooting/Mitigation** | Paste the following commands to your terminal, to receive detailed information about the problematic deployment:`kubectl describe deployment <DEPLOYMENT_NAME> -n runai kubectl describe deployment <DEPLOYMENT_NAME> -n runai-backend` Paste the following commands to your terminal, to check the replicaset details associated with the deployment:`kubectl describe replicaset <REPLICASET_NAME> -n runai kubectl describe replicaset <REPLICASET_NAME> -n runai-backend` Paste the following commands to your terminal to retrieve the logs for the deployment to identify any errors or issues:`kubectl logs deployment/<DEPLOYMENT_NAME> -n runai kubectl logs deployment/<DEPLOYMENT_NAME> -n runai-backend` From the logs and the detailed information provided by the `describe` commands, analyze the reasons why the deployment is unable to create pods. Look for common issues such as: Resource constraints (CPU, memory) Misconfigured deployment settings or replicasets Node selector or affinity rules preventing pod schedulingIf the issue persists, contact Run:ai. |
264
+
| **Diagnosis** | Paste the following commands to your terminal, in order to get the status of the deployments in the `runai` and `runai-backend` namespaces:`kubectl get deployment -n runai kubectl get deployment -n runai-backend`Identify any deployments that have missing pods. Look for discrepancies in the `DESIRED` and `AVAILABLE` columns. If the number of `AVAILABLE` pods is less than the `DESIRED` pods, it indicates that there are missing pods. |
265
+
| **Troubleshooting/Mitigation** | Paste the following commands to your terminal, to receive detailed information about the problematic deployment:`kubectl describe deployment <DEPLOYMENT_NAME> -n runai kubectl describe deployment <DEPLOYMENT_NAME> -n runai-backend` Paste the following commands to your terminal, to check the replicaset details associated with the deployment:`kubectl describe replicaset <REPLICASET_NAME> -n runai kubectl describe replicaset <REPLICASET_NAME> -n runai-backend` Paste the following commands to your terminal to retrieve the logs for the deployment to identify any errors or issues:`kubectl logs deployment/<DEPLOYMENT_NAME> -n runai kubectl logs deployment/<DEPLOYMENT_NAME> -n runai-backend` From the logs and the detailed information provided by the `describe` commands, analyze the reasons why the deployment is unable to create pods. Look for common issues such as: Resource constraints (CPU, memory) Misconfigured deployment settings or replicasets Node selector or affinity rules preventing pod schedulingIf the issue persists, contact Run:ai. |
266
266
267
267
Runai project controller reconcile failure
268
268
269
269
| Meaning | The `project-controller` in `runai` namespace had errors while reconciling projects |
270
270
| :---- | :---- |
271
271
| **Impact** | Some projects might not be in the “Ready” state. This means that they are not fully operational and may not have all the necessary components running or configured correctly. |
272
272
| **Severity** | Critical |
273
-
| **Diagnosis** | Retrieve the logs for the `project-controller` deployment by pasting the following command in your terminal:`kubectl logs deployment/project-controller -n runai`Carefully examine the logs for any errors or warning messages. These logs help you understand what might be going wrong with the project controller. |
274
-
| **Troubleshooting/Mitigation** | Once errors in the log have been identified, follow these steps to mitigate the issue: The error messages in the logs should provide detailed information about the problem. Read through them to understand the nature of the issue. If the logs indicate which project failed to reconcile, you can further investigate by checking the status of that specific project. Run the following command, replacing `<PROJECT_NAME>` with the name of the problematic project:`kubectl get project <PROJECT_NAME> -o yaml` Review the status section in the YAML output. This section describes the current state of the project and provide insights into what might be causing the failure.If the issue persists, contact Run:ai. |
273
+
| **Diagnosis** | Retrieve the logs for the `project-controller` deployment by pasting the following command in your terminal:`kubectl logs deployment/project-controller -n runai` Carefully examine the logs for any errors or warning messages. These logs help you understand what might be going wrong with the project controller. |
274
+
| **Troubleshooting/Mitigation** | Once errors in the log have been identified, follow these steps to mitigate the issue: The error messages in the logs should provide detailed information about the problem. Read through them to understand the nature of the issue. If the logs indicate which project failed to reconcile, you can further investigate by checking the status of that specific project. Run the following command, replacing `<PROJECT_NAME>` with the name of the problematic project:`kubectl get project <PROJECT_NAME> -o yaml` Review the status section in the YAML output. This section describes the current state of the project and provide insights into what might be causing the failure.If the issue persists, contact Run:ai. |
275
275
276
276
Runai StatefulSet insufficient replicas / Runai StatefulSet no available replicas
277
277
278
278
| Meaning | `Runai` statefulset has no available pods |
279
279
| :---- | :---- |
280
-
| **Impact** | Absence of Metrics Database Unavailability |
280
+
| **Impact** | Absence of Metrics Database Unavailability |
281
281
| **Severity** | Critical |
282
-
| **Diagnosis** | To diagnose the issue, follow these steps: Check the status of the stateful sets in the `runai-backend` namespace by running the following command:`kubectl get statefulset -n runai-backend` Identify any stateful sets that have no running pods. These are the ones that might be causing the problem. |
283
-
| **Troubleshooting/Mitigation** | Once you've identified the problematic stateful sets, follow these steps to mitigate the issue: Describe the stateful set to get detailed information on why it cannot create pods. Replace `X` with the name of the stateful set:`kubectl describe statefulset X -n runai-backend` Review the description output to understand the root cause of the issue. Look for events or error messages that explain why the pods are not being created. If you're unable to resolve the issue based on the information gathered, contact Run:ai support for further assistance. |
282
+
| **Diagnosis** | To diagnose the issue, follow these steps: Check the status of the stateful sets in the `runai-backend` namespace by running the following command:`kubectl get statefulset -n runai-backend` Identify any stateful sets that have no running pods. These are the ones that might be causing the problem. |
283
+
| **Troubleshooting/Mitigation** | Once you've identified the problematic stateful sets, follow these steps to mitigate the issue: Describe the stateful set to get detailed information on why it cannot create pods. Replace `X` with the name of the stateful set:`kubectl describe statefulset X -n runai-backend` Review the description output to understand the root cause of the issue. Look for events or error messages that explain why the pods are not being created. If you're unable to resolve the issue based on the information gathered, contact Run:ai support for further assistance. |
0 commit comments