Skip to content
Change the repository type filter

All

    Repositories list

    • Contains example scripts for deep learning
      Shell
      2000Updated Aug 10, 2025Aug 10, 2025
    • aegis

      Public
      Aegis is an LLM-powered AI cluster autonomous operations system, focused on intelligent capabilities such as Fault Diagnosis, Self-healing, Root Cause Analysis, Cluster Inspection, and Alert Optimization.
      Go
      1900Updated Aug 10, 2025Aug 10, 2025
    • sichek

      Public
      Sichek is a tool for detecting and diagnosing node-level issues in AI environments, ensuring the reliability and high performance of GPU-intensive workloads. It proactively identifies hardware and software problems, and triggers automated corrective actions, including task retries and operational maintenance timely
      Go
      21103Updated Aug 7, 2025Aug 7, 2025
    • collect ib counter in service
      Go
      1000Updated Jul 30, 2025Jul 30, 2025
    • arks

      Public
      Arks is a cloud-native inference framework running on Kubernetes
      Go
      44301Updated Jul 29, 2025Jul 29, 2025
    • netpulse

      Public
      API Server for Network Automation
      Python
      21000Updated Jul 20, 2025Jul 20, 2025
    • Ongoing research training transformer models at scale
      Python
      3k000Updated Jun 14, 2025Jun 14, 2025
    • auto set the hpc nic ring buffer match the nic spec
      Go
      0000Updated May 19, 2025May 19, 2025
    • MEAP

      Public
      Mask-Enhanced Autoregressive Prediction: Pay Less Attention to Learn More
      Python
      33310Updated May 17, 2025May 17, 2025
    • dockerfile for build ofed inside image
      Dockerfile
      0000Updated May 16, 2025May 16, 2025
    • check rdma status
      Go
      0000Updated Apr 25, 2025Apr 25, 2025
    • Launching, tuning and monitoring tools and scripts for benchmarking at Scitix
      Shell
      0000Updated Mar 22, 2025Mar 22, 2025
    • 0000Updated Mar 11, 2025Mar 11, 2025
    • NCCL Tests
      Cuda
      302000Updated Feb 5, 2025Feb 5, 2025
    • SiLLM

      Public
      Python
      0000Updated Jan 14, 2025Jan 14, 2025
    • nccl

      Public
      Optimized primitives for collective multi-GPU communication
      C++
      980000Updated Jan 6, 2025Jan 6, 2025
    • Workflow Engine for Kubernetes
      Go
      3.4k000Updated Dec 13, 2024Dec 13, 2024
    • sidiag

      Public
      Fast diagnosis and problem resolution for SiCL-based distributed jobs
      0000Updated Dec 12, 2024Dec 12, 2024
    • nccl-exts

      Public
      SiCL extensions for NCCL
      C++
      0200Updated Dec 9, 2024Dec 9, 2024