Skip to content

RoCE Networking‐Based LLM Post‐Training Solution with Intel Enterprise AI Foundation for OpenShift

MartinXu edited this page Jul 25, 2025 · 21 revisions
Martin Xu
martin.xu.nl@gmail.com

Introduction

OpenAI’s GPT-3 marked the dawn of a new era, proving that AI could profoundly transform human life. Built on the Transformer architecture introduced in the seminal paper, 'Attention Is All You Need,' GPT-3 demonstrated the unprecedented potential of scaling laws and illuminated the path for generative AI (GenAI) and artificial general intelligence (AGI).

This field is still rapidly evolving, with research papers emerging almost weekly introducing innovative approaches like Mixture of Experts (MoE), GRPO-based Machine Learning (ML) and Multi-head Latent Attention (MLA) etc. Industry adoption follows swiftly, as seen in the proliferation of both open-source models (e.g., Llama 4, DeepSeek-R1, Qwen3) and proprietary systems (e.g., GPT-4.1, Gemini 2.5, Grok-3, Claude 4)

However, pre-training these models on homogenized public Internet data with the massive computing resources hit a performance bottleneck, causing them to exhibit strikingly similar behaviors and limiting further breakthroughs. See Will LLMs Scaling Hit the Wall

Working in the AI area for years, we boldly envision that the next wave of innovation will leverage private domain-specific data for post-training and fine-tuning of foundation models. By injecting industry-specific knowledge through cutting-edge supervised fine-tuning (SFT) and reinforcement learning (RL) algorithms, these models can achieve enhanced reasoning performance. Besides, Mixture of Experts (MoE) technology will augment transformer models, enabling them to specialize in narrow domain areas (e.g., healthcare diagnostics or legal contract analysis). Combined with widely adopted knowledge distillation techniques, this approach will yield efficient, compact enterprise models capable of fast inference even in resource-constrained environments. Microsoft Phi-3 small language models and DeepSeek distilled reasoning models supports the vision.

To fulfill the vision, an efficient, affordable, scalable and production-grade enterprise training solution basing on Intel AI hardware & software technology seamlessly integrated with Red Hat AI platform is introduced in this paper.

Distributed Training & AI Network

To efficiently post-train a Large Language Model (LLM), same with pre-training, we must balance computation, communication, and memory through distributed parallelism algorithms.

As training clusters rapidly scale up to accommodate growing model sizes, various parallelism strategies—such as data parallelism (DP), tensor parallelism (TP), pipeline parallelism (PP), Expert Parallelism (EP) and context parallelism (CP) —have been developed, alongside optimizations like DeepSpeed ZeRO and PyTorch FSDP. These techniques significantly improve training efficiency by maximizing the utilization of expensive hardware resources.

Nowadays, these distributed training technologies are widely used not only for pre-training but also for post-training tasks, such as fine-tuning models on specialized data and reinforcement learning (RL) to enhance reasoning performance.

All of these algorithms rely on collective communication algorithms, which are supported by the underlying AI network. Thus, a reliable, low-latency, and high-throughput network—scaling both intra-node (scale-up) and inter-node (scale-out)—is critical for the overall post-training process. see Demystifying NCCL

The major AI network technologies include RoCE (RDMA over Converged Ethernet) and InfiniBand.

RoCE leverages existing Ethernet fabric and switches as the physical layer, with RoCEv2 building connectivity over traditional UDP/IP (network and transport layers). This cost-effective, reliable solution is very suitable for the post-training for Enterprise AI Area.

Besides, Meta is using RoCE to train the Llama models.

Intel AI Technologies for Distributed Training

Intel delivers a full suite of cost-effective, scalable and high-performance computing, networking, and memory solutions optimized for distributed parallel training in enterprise AI environments.

Intel® Gaudi® AI Accelerators, tailored for distributed parallel training workloads, feature a heterogeneous architecture comprising a cluster of fully programmable VLIW SIMD Tensor Processing Cores (TPCs) and a configurable Matrix Math Engine, along with onboard HBM2E memory and local SRAM.

As the first AI accelerator featuring built-in RoCE-v2 engines on-chip, it delivers unified "scaling via standard Ethernet switches. Its three-tier network topology supports flexible cluster scaling—from 2 worker nodes (16 accelerators) up to 512 nodes (4,096 accelerators)—addressing diverse workload requirements

These Gaudi computing and networking resources have been provisioned on OpenShift/Kubernetes with OpenShift/Kubernetes Operator technology. The parallel training algorithms, together with the Collective Communication Library (CCL), have been integrated into Gaudi software, which is seamlessly provisioned with Red Hat OpenShift AI.

Now let’s describe the scalable and production-grade enterprise training solution in detail.

Intel Enterprise AI foundation for OpenShift Post-training Solution

This end2end solution - comprising Infrastructure, Intel AI Computing and Networking Provisioning with OpenShift and Intel Distributed Training Software with OpenShift AI – is demonstrated in the diagram below.

Intel Enterprise AI Foundation for OpenShift Training Solution

Fig 1. Intel Enterprise AI foundation for OpenShift Training Solution

Infrastructure

The solution features two independent networks:

  • Host Networking – OpenShift/Kubernetes Primary Network defined by the Kubernetes network model and managed by Container Network Interface(CNI) plugins (e.g., OVN-Kubernetes used by OpenShift). It is the traditional network used to connect the nodes to a datacenter fabric for management, orchestration and storage usage such as data ingestion, checkpointing, and logging. For the concise purpose, the host networking is not demonstrated on the diagram shown in fig. 1. Please reference for details about OpenShift Host networking.

  • AI Networking – Secondary Network Configured and managed by the network operator. It is the AI Computing fabric dedicated to AI computing based on the distributed inferencing and parallel training algorithms. The Gadi RoCE-v2 Networking is used for the AI networking

RoCE-v2 establishes connectivity over traditional UDP/IP networks at the transport and network (L3) layers. Which leverages standard Ethernet fabric and switches for the physical (L1) and data link (L2) layers. With RoCE-v2 natively embedded into the Gaudi processor, the networking solution is highly cost-efficient, reliable, low-latency, and easy to manage.

The training solution diagram showed in fig. 1 is inspired by Accelerate model training on OpenShift AI with NVIDIA GPUDirect RDMA. Compared to RDMA scale out over Host NICs, Gaudi native built-in Ethernet ports simplify the solution by eliminating the dependency on third-party Ethernet NICs and reduces one hop to copy data between accelerators and ethernet NICs, which lowers latency and improves throughput.

Note: While scale-out via host NIC is also supported, this solution focuses on scale-out using Gaudi’s built-in Ethernet ports.

In addition, the built-in Ethernet ports simplify networking configuration compared to host NIC-based solutions, enabling the use of a lightweight networking operator for configuration.

Each Gaudi accelerator features 21 Ethernet ports for scale-up all-to-all connectivity within a node, plus 3 dedicated Ethernet ports for scale-out connectivity between nodes. As per Intel’s Gaudi network configuration, an 8-accelerator Gaudi3 node delivers 1,200 GB/s of bidirectional scale-out bandwidth and 1,050 GB/s of bidirectional scale-up bandwidth.

For enterprise AI use cases, the leaf switches in each layer of the three-tier topology normally meet user requirements. For even larger clusters, spine nodes can be added to scale the system up to 512 nodes (4,096 Gaudi accelerators), accommodating diverse workload demands.

A there-ply network configuration is demonstrated in below diagram.

Three-ply Gaudi RoCE-v2 Network Topology
Fig. 2 Three-ply Gaudi RoCE-v2 Network Topology

Red Hat OpenShift Container Platform is a production-grade Kubernetes solution widely adopted by Cloud Service Providers (CSPs) and enterprise users. In this solution, high availability is typically achieved using three Intel Xeon CPU-based nodes as control plane nodes. For OpenShift provisioning and host networking configuration, please refer OpenShift Container Platform documentation.

AI Computing and Networking Provisioning with OpenShift

Provisioning AI computing and networking resources on a scalable OpenShift cluster while ensuring the manageability of the infrastructure and platforms presents significant challenges. To address this, the general Operator concept has been proposed and implemented.

Instead of relying on a single monolithic operator to handle everything, the operator best practice— “do one thing and do it well”—is applied. This industrial-leading approach significantly simplifies both Operator development and the AI resources provisioning process.

In the future, a Converged AI Operator will be used to simplify the usage of the general operators and provision AI features as a unified single-entry point through Customer Resource Definition (CRD). Users can select from a configurable, extensible portfolio of general operators tailored for specific AI features. For detail see AI Accelerators & Network Provisioning.

The RoCEv2-based L3 networking layer is configured and managed by the Intel Network Operator. Due to the built-in Ethernet ports, the Gaudi networking configuration is much more straightforward. It leverages LLDP (Link Layer Discovery Protocol) to acquire network configurations from connected switches, then configures the scale-out Ethernet ports with IP addresses and routes within the OpenShift pod. This enables the Collective Communication Library (HCCL/OneCCL)-based distributed parallel training algorithms in the container image (Intel AI SW Notebook Container) to run seamlessly.

The IP networks of the switches must be carefully configured and avoid conflicting with the Host Networking.

Intel Distributed Training Software with OpenShift AI

The Habana Collective Communications Library (HCCL) is Gaudi emulation layer of the NVIDIA Collective Communication Library (NCCL) and is included in Intel® Gaudi® software software suite.

Gaudi software suite is rapidly evolving, with support for Fully Sharded Data Parallel (FSDP), Gaudi DeepSpeed support alongside HCCL for the Distributed Parallelism Training with PyTorch and hugging face Transformer.

Above Distributed Training portfolio is packed with the Graph Compiler and Runtime and optimized kernel libraries into Intel AI SW container Notebook container image seamlessly integrated into OpenShift AI by Intel Gaudi AI SW Tools Operator

For details on fine-tuning Large Language Models (LLMs) using this solution, refer to the guide: Fine-tune LLMs with Kubeflow Trainer on OpenShift AI.

Work ongoing

Cluster and Network manageability for performance and availability

In enterprise AI environments, the following tools must be integrated into the solution to help users triage and diagnose networking and cluster issues. These tools are based on NCCL and run on the Gaudi RoCE-v2 network:

bisection bandwidth testing on leaf switch bisection bandwidth testing on leaf and spine switches bisection bandwidth testing on all the nodes with leaf and spine switches bisection bandwidth testing on all the Gaudis with leaf and spine switches
Fig. 3 Bisection Collective Communication Performance test on Gaudi RoCE-v2 Network
congestion test on single leaf switch congestion test on leaf and spine switches
Fig. 4 Congestion Test on Gaudi RoCE-v2 Network

HCCL tuning and optimizing

To promote the network performance, a lot of research work has been done, all of our research points to HCCL optimization and tuning. Below works are summarized.

  • The HCCL testing & optimizing to the three-ply topology is in plan
  • The HCCL CTS mechanism to avoid Congestion
  • The HCCL loader balance enhancement with QP for E-ECMP
  • the other optimizations and tuning like the multi-channels and multi-slots mechanism for computation and communication overlapping
  • The other optimizations like the level switch optimization

Key post-training workloads

The field of distributed parallel training in Enterprise AI is evolving rapidly. The following workloads require further research as part of an end-to-end solution:

1.Mixture of Experts (MoE) With MoE being adopted in Llama 4.0, it has become the dominant architecture for Transformer models.Our plan includes leveraging MoE techniques to build domain-specific, cost-efficient small models tailored for Enterprise AI applications.

2.Reinforcement Learning (RL) OpenAI has demonstrated that RL is a critical step in enhancing reasoning performance for large language models (LLMs) see Training language models to follow instructions with human feedback. GRPO-based Machine Learning (ML), validated by DeepSeek at DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models, shows that RL can be executed with reduced compute and networking resources. This makes RL feasible for Enterprise AI adoption. R&D efforts are underway to develop a post-training solution supporting this cost-effective RL algorithm.

Conclusion

The RoCE Networking-Based LLM Post-Training Solution with Intel Enterprise AI Foundation for OpenShift delivers a end-to-end solution for large language model (LLM) post-training workloads for enterprise AI area. By harnessing Intel's advanced hardware, such as Gaudi accelerators, and the built-in RoCE, it provides a cost-efficient, scalable and high-performance solution.

The seamless integration with Red Hat OpenShift and OpenShift AI provides a production-grade platform for deploying and managing AI workloads, offering enterprises a robust, flexible, and user-friendly environment.