Skip to content
Change the repository type filter

All

    Repositories list

    • arks

      Public
      Arks is a cloud-native inference framework running on Kubernetes
      Go
      44300Updated Oct 14, 2025Oct 14, 2025
    • collect ib counter in service
      Go
      1000Updated Oct 14, 2025Oct 14, 2025
    • OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.
      Python
      671000Updated Oct 13, 2025Oct 13, 2025
    • aegis

      Public
      Aegis is an LLM-powered AI cluster autonomous operations system, focused on intelligent capabilities such as Fault Diagnosis, Self-healing, Root Cause Analysis, Cluster Inspection, and Alert Optimization.
      Go
      1900Updated Sep 30, 2025Sep 30, 2025
    • Contains example scripts for deep learning
      Shell
      2000Updated Sep 15, 2025Sep 15, 2025
    • Python
      0000Updated Sep 12, 2025Sep 12, 2025
    • SiLLM

      Public
      Python
      0000Updated Aug 20, 2025Aug 20, 2025
    • sichek

      Public
      Sichek is a tool for detecting and diagnosing node-level issues in AI environments, ensuring the reliability and high performance of GPU-intensive workloads. It proactively identifies hardware and software problems, and triggers automated corrective actions, including task retries and operational maintenance timely
      Go
      21303Updated Aug 7, 2025Aug 7, 2025
    • netpulse

      Public
      API Server for Network Automation
      Python
      21100Updated Jul 20, 2025Jul 20, 2025
    • Ongoing research training transformer models at scale
      Python
      3.2k000Updated Jun 14, 2025Jun 14, 2025
    • auto set the hpc nic ring buffer match the nic spec
      Go
      0000Updated May 19, 2025May 19, 2025
    • MEAP

      Public
      Mask-Enhanced Autoregressive Prediction: Pay Less Attention to Learn More
      Python
      33410Updated May 17, 2025May 17, 2025
    • dockerfile for build ofed inside image
      Dockerfile
      0000Updated May 16, 2025May 16, 2025
    • check rdma status
      Go
      0000Updated Apr 25, 2025Apr 25, 2025
    • Launching, tuning and monitoring tools and scripts for benchmarking at Scitix
      Shell
      0000Updated Mar 22, 2025Mar 22, 2025
    • 0000Updated Mar 11, 2025Mar 11, 2025
    • NCCL Tests
      Cuda
      321000Updated Feb 5, 2025Feb 5, 2025
    • nccl

      Public
      Optimized primitives for collective multi-GPU communication
      C++
      1k000Updated Jan 6, 2025Jan 6, 2025
    • Workflow Engine for Kubernetes
      Go
      3.4k000Updated Dec 13, 2024Dec 13, 2024
    • sidiag

      Public
      Fast diagnosis and problem resolution for SiCL-based distributed jobs
      0000Updated Dec 12, 2024Dec 12, 2024
    • nccl-exts

      Public
      SiCL extensions for NCCL
      C++
      0200Updated Dec 9, 2024Dec 9, 2024