Change the repository type filter
All
Repositories list
19 repositories
deep_learning_examples
Publicaegis
PublicAegis is an LLM-powered AI cluster autonomous operations system, focused on intelligent capabilities such as Fault Diagnosis, Self-healing, Root Cause Analysis, Cluster Inspection, and Alert Optimization.sichek
PublicSichek is a tool for detecting and diagnosing node-level issues in AI environments, ensuring the reliability and high performance of GPU-intensive workloads. It proactively identifies hardware and software problems, and triggers automated corrective actions, including task retries and operational maintenance timelyscitix-ib-exporter
Publicarks
Publicnetpulse
PublicMegatron-LM
Publicoptimize-hpc-nic
PublicMEAP
Publicofed_docker
Publicrdma-service
Publicsiperf-common
Publicroce-operator
Publicnccl-tests
PublicSiLLM
Publicnccl
Publicargo-workflows
Publicsidiag
Publicnccl-exts
Public