PaddlePaddle
diff --git a/‎build.sh
Lines changed: 8 additions & 0 deletions b/‎build.sh
Lines changed: 8 additions & 0 deletions
diff --git a/‎custom_ops/setup_ops.py
Lines changed: 11 additions & 0 deletions b/‎custom_ops/setup_ops.py
Lines changed: 11 additions & 0 deletions
diff --git a/‎docs/get_started/installation/Enflame_gcu.md
Lines changed: 23 additions & 17 deletions b/‎docs/get_started/installation/Enflame_gcu.md
Lines changed: 23 additions & 17 deletions
diff --git a/‎docs/zh/get_started/installation/Enflame_gcu.md
Lines changed: 22 additions & 16 deletions b/‎docs/zh/get_started/installation/Enflame_gcu.md
Lines changed: 22 additions & 16 deletions
diff --git a/‎fastdeploy/model_executor/layers/activation.py
Lines changed: 18 additions & 1 deletion b/‎fastdeploy/model_executor/layers/activation.py
Lines changed: 18 additions & 1 deletion
diff --git a/‎fastdeploy/model_executor/layers/backends/__init__.py
Lines changed: 19 additions & 9 deletions b/‎fastdeploy/model_executor/layers/backends/__init__.py
Lines changed: 19 additions & 9 deletions
diff --git a/‎fastdeploy/model_executor/layers/backends/gcu/__init__.py
Lines changed: 31 additions & 0 deletions b/‎fastdeploy/model_executor/layers/backends/gcu/__init__.py
Lines changed: 31 additions & 0 deletions
diff --git a/‎fastdeploy/model_executor/layers/backends/gcu/attention/__init__.py
Lines changed: 21 additions & 0 deletions b/‎fastdeploy/model_executor/layers/backends/gcu/attention/__init__.py
Lines changed: 21 additions & 0 deletions
@@ -113,6 +113,14 @@ function copy_ops(){
       return
     fi
 
+    is_gcu=`$python -c "import paddle; print(paddle.is_compiled_with_custom_device('gcu'))"`
+    if [ "$is_gcu" = "True" ]; then
+      DEVICE_TYPE="gcu"
+      cp -r ${OPS_TMP_DIR}/${WHEEL_NAME}/* ../fastdeploy/model_executor/ops/gcu
+      echo -e "gcu ops have been copy to fastdeploy"
+      return
+    fi
+
     DEVICE_TYPE="cpu"
     cp -r ./${OPS_TMP_DIR_BASE}/${WHEEL_BASE_NAME}/* ../fastdeploy/model_executor/ops/base
     cd ../../../../
 
@@ -501,6 +501,17 @@ def find_end_files(directory, end_str):
             ],
         ),
     )
+elif paddle.is_compiled_with_custom_device("gcu"):
+    setup(
+        name="fastdeploy_ops",
+        ext_modules=CppExtension(
+            sources=[
+                "gpu_ops/save_with_output_msg.cc",
+                "gpu_ops/get_output.cc",
+                "gpu_ops/get_output_msg_with_topk.cc",
+            ]
+        ),
+    )
 else:
     use_bf16 = envs.FD_CPU_USE_BF16 == "True"
 
 
@@ -1,8 +1,8 @@
-# Running ERNIE-4.5-21B-A3B with FastDeploy
+# Running ERNIE 4.5 Series Models with FastDeploy
 
 The Enflame S60 ([Learn about Enflame](https://www.enflame-tech.com/)) is a next-generation AI inference accelerator card designed for large-scale deployment in data centers. It meets the demands of large language models (LLMs), search/advertising/recommendation systems, and traditional models. Characterized by broad model coverage, user-friendliness, and high portability, it is widely applicable to mainstream inference scenarios such as image and text generation applications, search and recommendation systems, and text/image/speech recognition.
 
-FastDeploy has deeply adapted and optimized the ernie-4_5-21b-a3b-bf16-paddle model for the Enflame S60, achieving a unified inference interface between GCU and GPU. This allows seamless migration of inference tasks without code modifications.
+FastDeploy has deeply adapted and optimized the ERNIE 4.5 Series Models for the Enflame S60, achieving a unified inference interface between GCU and GPU. This allows seamless migration of inference tasks without code modifications.
 
 ## 🚀 Quick Start 🚀
 
@@ -27,15 +27,15 @@ lspci | grep S60
 3b:00.0 Processing accelerators: Shanghai Enflame Technology Co. Ltd S60 [Enflame] (rev 01)
 3c:00.0 Processing accelerators: Shanghai Enflame Technology Co. Ltd S60 [Enflame] (rev 01)
 ```
-### 1. Environment Setup (Estimated time: 5–10 minutes)
+### 1. Environment Setup (Estimated time: 5-10 minutes)
 1. Pull the Docker image
 ```bash
 # Note: This image only contains the Paddle development environment, not precompiled PaddlePaddle packages
-docker pull ccr-2vdh3abv-pub.cnc.bj.baidubce.com/device/paddle-gcu:topsrider3.4.623-ubuntu20-x86_64-gcc84
+docker pull ccr-2vdh3abv-pub.cnc.bj.baidubce.com/device/paddle-gcu:topsrider3.5.102-ubuntu20-x86_64-gcc84
 ```
 2. Start the container
 ```bash
-docker run --name paddle-gcu-llm -v /home:/home -v /work:/work --network=host --ipc=host -it --privileged ccr-2vdh3abv-pub.cnc.bj.baidubce.com/device/paddle-gcu:topsrider3.4.623-ubuntu20-x86_64-gcc84 /bin/bash
+docker run --name paddle-gcu-llm -v /home:/home -v /work:/work --network=host --ipc=host -it --privileged ccr-2vdh3abv-pub.cnc.bj.baidubce.com/device/paddle-gcu:topsrider3.5.102-ubuntu20-x86_64-gcc84 /bin/bash
 ```
 3. Obtain and install drivers<br/>
 **Full software packages are preloaded in the Docker container. Copy them to an external directory, e.g., ```/home/workspace/deps/```**
@@ -67,39 +67,45 @@ python -m pip install paddle-custom-gcu==3.1.0 -i https://www.paddlepaddle.org.c
 7. Install FastDeploy and dependencies
 ```bash
 python -m pip install fastdeploy -i https://www.paddlepaddle.org.cn/packages/stable/gcu/ --extra-index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simplels
-apt install python3.10-distutils
+# For source compilation, refer to the following steps
+git clone https://github.com/PaddlePaddle/FastDeploy
+cd FastDeploy
+python -m pip install -r requirements.txt --extra-index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simplels
+bash build.sh 1
 ```
-### 2. Data Preparation (Estimated time: 2–5 minutes)
+### 2. Data Preparation (Estimated time: 2-5 minutes)
 Use a trained model for inference on GSM8K dataset:
 ```bash
 mkdir -p /home/workspace/benchmark/ && cd /home/workspace/benchmark/
 wget https://raw.githubusercontent.com/openai/grade-school-math/master/grade_school_math/data/test.jsonl
 ```
-Place model weights in a directory, e.g., ```/work/models/ernie-4_5-21b-a3b-bf16-paddle/```
-### 3. Inference (Estimated time: 2–5 minutes)
+Place model weights in a directory, e.g., ```/work/models/ERNIE-4.5-300B-A47B-Paddle/```
+### 3. Inference (Estimated time: 2-5 minutes)
 Start the inference service:
 ```bash
 python -m fastdeploy.entrypoints.openai.api_server \
-    --model "/work/models/ernie-4_5-21b-a3b-bf16-paddle/" \
+    --model "/work/models/ERNIE-4.5-300B-A47B-Paddle/" \
     --port 8188 \
     --metrics-port 8200 \
-    --tensor-parallel-size 4 \
-    --max-model-len 8192 \
-    --num-gpu-blocks-override 1024
+    --tensor-parallel-size 8 \
+    --max-model-len 32768 \
+    --num-gpu-blocks-override 4096 \
+    --max-num-batched-tokens 32768 \
+    --quantization "wint4"
 ```
 Query the model service:
 ```bash
 curl -X POST "http://0.0.0.0:8188/v1/chat/completions" \
 -H "Content-Type: application/json" \
 -d '{
   "messages": [
-    {"role": "user", "content": "The largest ocean is"}
+    {"role": "user", "content": "Where is Beijing?"}
   ]
 }'
 ```
 Successful execution returns inference results, e.g.:
 ```json
-{"id":"chatcmpl-5cd96f3b-eff3-4dc0-8aa2-8b5d7b7b86f2","object":"chat.completion","created":1751167862,"model":"default","choices":[{"index":0,"message":{"role":"assistant","content":"3. **Pacific Ocean**: The Pacific Ocean is the largest and deepest of the world's oceans. It covers an area of approximately 181,344,000 square kilometers, which is more than 30% of the Earth's surface. It is located between the Americas to the west and east, and Asia and Australia to the north and south. The Pacific Ocean is known for its vastness, diverse marine life, and numerous islands.\n\nIn summary, the largest ocean in the world is the Pacific Ocean.","reasoning_content":null,"tool_calls":null},"finish_reason":"stop"}],"usage":{"prompt_tokens":11,"total_tokens":127,"completion_tokens":116,"prompt_tokens_details":{"cached_tokens":0}}}
+{"id":"chatcmpl-20f1210d-6943-4110-ad2d-c76ba11604ad","object":"chat.completion","created":1751621261,"model":"default","choices":[{"index":0,"message":{"role":"assistant","content":"Beijing is the capital city of the People's Republic of China, located in the northern part of the country. It is situated in the North China Plain, bordered by the mountains to the west, north, and northeast. Beijing serves as China's political, cultural, and international exchange center, playing a crucial role in the nation's development and global interactions.","reasoning_content":null,"tool_calls":null},"finish_reason":"stop"}],"usage":{"prompt_tokens":11,"total_tokens":88,"completion_tokens":77,"prompt_tokens_details":{"cached_tokens":0}}}
 ```
 ### 4. Accuracy Testing (Estimated time: 60–180 minutes)
 Place the accuracy script ```bench_gsm8k.py``` in ```/home/workspace/benchmark/``` and modify sampling parameters, e.g.:
@@ -120,10 +126,10 @@ data = {
 Run accuracy tests:
 ```bash
 cd /home/workspace/benchmark/
-python -u bench_gsm8k.py --port 8188 --num-questions 1319 --num-shots 5 --parallel 2
+python -u bench_gsm8k.py --port 8188 --num-questions 1319 --num-shots 5 --parallel 8
 ```
 Upon completion, accuracy results are saved in ```result.jsonl```, e.g.:
 ```json
-{"task": "gsm8k", "backend": "paddlepaddle", "num_gpus": 1, "latency": 365.548, "accuracy": 0.967, "num_requests": 30, "other": {"num_questions": 30, "parallel": 2}}
+{"task": "gsm8k", "backend": "paddlepaddle", "num_gpus": 1, "latency": 13446.01, "accuracy": 0.956, "num_requests": 1319, "other": {"num_questions": 1319, "parallel": 8}}
 ```
 
@@ -1,8 +1,8 @@
-# 使用 FastDeploy 在燧原 S60 上运行 ERNIE-4.5-21B-A3B模型
+# 使用 FastDeploy 在燧原 S60 上运行 ERNIE 4.5 系列模型
 
 燧原 S60（[了解燧原](https://www.enflame-tech.com/)）是面向数据中心大规模部署的新一代人工智能推理加速卡，满足大语言模型、搜广推及传统模型的需求，具有模型覆盖面广、易用性强、易迁移易部署等特点，可广泛应用于图像及文本生成等应用、搜索与推荐、文本、图像及语音识别等主流推理场景。
 
-FastDeploy 在燧原 S60 上对 ernie-4_5-21b-a3b-bf16-paddle 模型进行了深度适配和优化，实现了 GCU 推理入口和 GPU 的统一，无需修改即可完成推理任务的迁移。
+FastDeploy 在燧原 S60 上对 ERNIE 4.5 系列模型进行了深度适配和优化，实现了 GCU 推理入口和 GPU 的统一，无需修改即可完成推理任务的迁移。
 
 ## 🚀 快速开始 🚀
 
@@ -30,11 +30,11 @@ lspci | grep S60
 1. 拉取镜像
 ```bash
 # 注意此镜像仅为paddle开发环境，镜像中不包含预编译的飞桨安装包
-docker pull ccr-2vdh3abv-pub.cnc.bj.baidubce.com/device/paddle-gcu:topsrider3.4.623-ubuntu20-x86_64-gcc84
+docker pull ccr-2vdh3abv-pub.cnc.bj.baidubce.com/device/paddle-gcu:topsrider3.5.102-ubuntu20-x86_64-gcc84
 ```
 2. 参考如下命令启动容器
 ```bash
-docker run --name paddle-gcu-llm -v /home:/home -v /work:/work --network=host --ipc=host -it --privileged ccr-2vdh3abv-pub.cnc.bj.baidubce.com/device/paddle-gcu:topsrider3.4.623-ubuntu20-x86_64-gcc84 /bin/bash
+docker run --name paddle-gcu-llm -v /home:/home -v /work:/work --network=host --ipc=host -it --privileged ccr-2vdh3abv-pub.cnc.bj.baidubce.com/device/paddle-gcu:topsrider3.5.102-ubuntu20-x86_64-gcc84 /bin/bash
 ```
 3. 获取并安装驱动<br/>
 **docker 内提前放置了全量软件包，需拷贝至 docker 外目录，如：```/home/workspace/deps/```**
@@ -63,42 +63,48 @@ python -m pip install paddlepaddle==3.1.0a0 -i https://www.paddlepaddle.org.cn/p
 python -m pip install paddle-custom-gcu==3.1.0 -i https://www.paddlepaddle.org.cn/packages/stable/gcu/
 # 如想源码编译安装，请参考https://github.com/PaddlePaddle/PaddleCustomDevice/blob/develop/backends/gcu/README_cn.md
 ```
-7. 安装 FastDeploy 和 依赖<br/>
+7. 安装 FastDeploy <br/>
 ```bash
 python -m pip install fastdeploy -i https://www.paddlepaddle.org.cn/packages/stable/gcu/ --extra-index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simplels
-apt install python3.10-distutils
+# 如想源码编译安装，请参考如下步骤
+git clone https://github.com/PaddlePaddle/FastDeploy
+cd FastDeploy
+python -m pip install -r requirements.txt --extra-index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simplels
+bash build.sh 1
 ```
 ### 2. 数据准备：(这将花费您 2～5min 时间)
 使用训练好的模型，在 GSM8K 上推理
 ```bash
 mkdir -p /home/workspace/benchmark/ && cd /home/workspace/benchmark/
 wget https://raw.githubusercontent.com/openai/grade-school-math/master/grade_school_math/data/test.jsonl
 ```
-准备模型和权重，置于环境目录，如：```/work/models/ernie-4_5-21b-a3b-bf16-paddle/```
+准备模型和权重，置于环境目录，如：```/work/models/ERNIE-4.5-300B-A47B-Paddle/```
 ### 3. 推理：(这将花费您 2~5min 时间)
 执行如下命令启动推理服务
 ```bash
 python -m fastdeploy.entrypoints.openai.api_server \
-    --model "/work/models/ernie-4_5-21b-a3b-bf16-paddle/" \
+    --model "/work/models/ERNIE-4.5-300B-A47B-Paddle/" \
     --port 8188 \
     --metrics-port 8200 \
-    --tensor-parallel-size 4 \
-    --max-model-len 8192 \
-    --num-gpu-blocks-override 1024
+    --tensor-parallel-size 8 \
+    --max-model-len 32768 \
+    --num-gpu-blocks-override 4096 \
+    --max-num-batched-tokens 32768 \
+    --quantization "wint4"
 ```
 使用如下命令请求模型服务
 ```bash
 curl -X POST "http://0.0.0.0:8188/v1/chat/completions" \
 -H "Content-Type: application/json" \
 -d '{
   "messages": [
-    {"role": "user", "content": "The largest ocean is"}
+    {"role": "user", "content": "Where is Beijing?"}
   ]
 }'
 ```
 成功运行后，可以查看到推理结果的生成，样例如下
 ```json
-{"id":"chatcmpl-5cd96f3b-eff3-4dc0-8aa2-8b5d7b7b86f2","object":"chat.completion","created":1751167862,"model":"default","choices":[{"index":0,"message":{"role":"assistant","content":"3. **Pacific Ocean**: The Pacific Ocean is the largest and deepest of the world's oceans. It covers an area of approximately 181,344,000 square kilometers, which is more than 30% of the Earth's surface. It is located between the Americas to the west and east, and Asia and Australia to the north and south. The Pacific Ocean is known for its vastness, diverse marine life, and numerous islands.\n\nIn summary, the largest ocean in the world is the Pacific Ocean.","reasoning_content":null,"tool_calls":null},"finish_reason":"stop"}],"usage":{"prompt_tokens":11,"total_tokens":127,"completion_tokens":116,"prompt_tokens_details":{"cached_tokens":0}}}
+{"id":"chatcmpl-20f1210d-6943-4110-ad2d-c76ba11604ad","object":"chat.completion","created":1751621261,"model":"default","choices":[{"index":0,"message":{"role":"assistant","content":"Beijing is the capital city of the People's Republic of China, located in the northern part of the country. It is situated in the North China Plain, bordered by the mountains to the west, north, and northeast. Beijing serves as China's political, cultural, and international exchange center, playing a crucial role in the nation's development and global interactions.","reasoning_content":null,"tool_calls":null},"finish_reason":"stop"}],"usage":{"prompt_tokens":11,"total_tokens":88,"completion_tokens":77,"prompt_tokens_details":{"cached_tokens":0}}}
 ```
 ### 4. 精度测试：(这将花费您 60~180min 时间)
 准备精度脚本 ```bench_gsm8k.py``` 置于 ```/home/workspace/benchmark/``` ，并修改采样参数，如：
@@ -119,10 +125,10 @@ data = {
 执行以下命令启动精度测试
 ```bash
 cd /home/workspace/benchmark/
-python -u bench_gsm8k.py --port 8188 --num-questions 1319 --num-shots 5 --parallel 2
+python -u bench_gsm8k.py --port 8188 --num-questions 1319 --num-shots 5 --parallel 8
 ```
-执行成功运行后，当前目录可以查看到精度结果的生成，文件为 ```result.jsonl```，样例如下（部分数据集，仅示例）
+执行成功运行后，当前目录可以查看到精度结果的生成，文件为 ```result.jsonl```，样例如下
 ```json
-{"task": "gsm8k", "backend": "paddlepaddle", "num_gpus": 1, "latency": 365.548, "accuracy": 0.967, "num_requests": 30, "other": {"num_questions": 30, "parallel": 2}}
+{"task": "gsm8k", "backend": "paddlepaddle", "num_gpus": 1, "latency": 13446.01, "accuracy": 0.956, "num_requests": 1319, "other": {"num_questions": 1319, "parallel": 8}}
 ```
 
@@ -19,7 +19,7 @@
 
 import paddle
 from paddle import nn
-from paddle.incubate.nn.functional import fused_bias_act
+from paddle.incubate.nn.functional import fused_bias_act, swiglu
 
 from fastdeploy.config import FDConfig
 from fastdeploy.platforms import current_platform
@@ -66,6 +66,8 @@ def __init__(
         if current_platform.is_cuda() or current_platform.is_xpu(
         ) or current_platform.is_iluvatar():
             self.forward = self.forward_cuda
+        elif current_platform.is_gcu():
+            self.forward = self.forward_gcu
         else:
             raise NotImplementedError
 
@@ -123,3 +125,18 @@ def forward_cuda(self, x: paddle.Tensor) -> paddle.Tensor:
             quant_max_bound=self.quant_max_bound,
             quant_min_bound=self.quant_min_bound,
         )
+
+    def forward_gcu(self, x):
+        """
+        Forward propagation of the custom activation layer.
+
+        Args:
+            x (Tensor): Input tensor to the activation layer.
+
+        Returns:
+            Tensor: Output tensor.
+        """
+        out = swiglu(x)
+        if self.bias is not None:
+            out = out + self.bias
+        return out
@@ -16,14 +16,24 @@
 all backends methods
 """
 
-from .xpu import *
-from .npu import *
+from fastdeploy.platforms import current_platform
 
 __all__ = []
-from . import npu
-if hasattr(npu, '__all__'):
-    __all__.extend(npu.__all__)
-    
-from . import xpu
-if hasattr(xpu, '__all__'):
-    __all__.extend(xpu.__all__)
+
+if current_platform.is_xpu():
+    from . import xpu
+    from .xpu import *
+    if hasattr(xpu, '__all__'):
+        __all__.extend(xpu.__all__)
+
+if current_platform.is_npu():
+    from . import npu
+    from .npu import *
+    if hasattr(npu, '__all__'):
+        __all__.extend(npu.__all__)
+
+if current_platform.is_gcu():
+    from . import gcu
+    from .gcu import *
+    if hasattr(gcu, '__all__'):
+        __all__.extend(gcu.__all__)
@@ -0,0 +1,31 @@
+# Copyright (c) 2025 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""
+gcu backend methods
+"""
+
+from .attention.flash_attn_backend import GCUFlashAttnBackend
+from .attention.mem_efficient_attn_backend import GCUMemEfficientAttnBackend
+from .moe.fused_moe_method_gcu_backend import (GCUFusedMoeMethod,
+                                               GCUWeightOnlyMoEMethod)
+from .quantization.weight_only import GCUWeightOnlyLinearMethod
+
+__all__ = [
+    'GCUFlashAttnBackend',
+    'GCUMemEfficientAttnBackend',
+    'GCUFusedMoeMethod',
+    'GCUWeightOnlyMoEMethod',
+    'GCUWeightOnlyLinearMethod',
+]
@@ -0,0 +1,21 @@
+# Copyright (c) 2025 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from .flash_attn_backend import GCUFlashAttnBackend
+from .mem_efficient_attn_backend import GCUMemEfficientAttnBackend
+
+__all__ = [
+    "GCUFlashAttnBackend",
+    "GCUMemEfficientAttnBackend",
+]