Hygon DCU K100AI get UP but very SLOW #999

yeungtuzi · 2025-03-31T04:33:19Z

yeungtuzi
Mar 31, 2025

AMD Ryzen Threadripper PRO 7975WX 32-Cores
DDR5-5200 64GB*8
海光DCU K100AI 64GB @400W

能跑，但是速度非常慢，期待优化

2025-03-31 04:30:25,145 INFO /opt/anaconda3/envs/kt/lib/python3.11/site-packages/ktransformers/server/backend/base.py[65]: Performance(T/s): prefill 3.235154026502005, decode 0.9897533569741352. Time(s): tokenize 0.03024148941040039, prefill 4.945668697357178, decode 579.9424633979797

GPU占用率保持100%

以下是我的安装（折腾）过程

总体基于doc/en/rocm.md

全新安装ubuntu 22.04 server

可选： LINUX 22.04默认内核对TR7975-WX支持不好，不能睿频，升级内核
apt install linux-generic-hwe-22.04 （新内核有问题，dcu驱动无法安装，需要手工添加-Wno-error=missing-prototypes）
重启, 等待驱动更新，不过目前性能也没有卡在CPU上

安装cuda-tool-kit 11.7，这一步不能省略，有的指南似乎有些问题

apt install cuda-toolkit-11-7
apt install -y gcc g++ cmake automake  libelf-dev libdrm-amdgpu1 libtinfo5 pciutils libdrm-dev
apt install -y linux-headers-`uname -r`
/mnt/ai/DCU/rock-6.3.3-V1.8.0.run

安装dtk

cd /opt
tar -zvxf /mnt/dcu/DTK-25.04-Ubuntu22.04-x86_64.tar.gz   
ln -s dtk-25.04 dtk
ln -s dtk-25.04 rocm
source /opt/dtk/env.sh
 hy-smi

安装pytorch/torchvision/torchaudio等厂家DCU预编译包

	torch-2.4.1+das.opt2.dtk2504-cp311-cp311-manylinux_2_28_x86_64.whl                                                                                                                                                                                 
	torchaudio-2.1.2+das.opt1.dtk24043-cp311-cp311-manylinux_2_28_x86_64.whl                                                                                                                                                                           
	torchvision-0.19.1+das.opt2.dtk2504-cp311-cp311-manylinux_2_28_x86_64.whl 
	flash_attn-2.6.1+das.opt4.dtk2504-cp311-cp311-manylinux_2_28_x86_64.whl
	fastpt-2.0.0+das.dtk2504-py3-none-any.whl
	triton-3.0.0+das.opt4.dtk2504-cp311-cp311-manylinux_2_28_x86_64.whl
	xformers-0.0.25+das.opt1.dtk24043-cp311-cp311-manylinux_2_28_x86_64.whl(这个会自动安装numpy-1.24.3 xformers-0.0.25+das.opt1.dtk24042)

手工查看requirements.txt，如果有厂家预编译的优先使用

如果提示numpy2不能用，pip install numpy==1.26.4 -i https://pypi.tuna.tsinghua.edu.cn/simple，但是上一步应该已经安装好了numpy-1.24.3

如果编译报错

/opt/anaconda3/envs/kt/lib/python3.11/site-packages/torch/include/c10/util/Flags.h:78:10: fatal error: gflags/gflags.h: No such file or directory

是缺少 google gflags和google glog尝试

apt install libgflags-dev python3-gflags libgoogle-glog-dev libunilog-dev 
pip install python-gflags glog

最后编译完成
pip show ktransformers

Name: ktransformers
Version: 0.2.3.post2+torch24fancy
Summary: KTransformers, pronounced as Quick Transformers, is designed to enhance your Transformers experience with advanced kernel optimizations and placement/parallelism strategies.

尝试运行一下DeepSeek-R1-Q4模型
提示缺少一些包
pip install openai pytest
海光DCU使用--optimize_config_path ktransformers/optimize/optimize_rules/rocm/DeepSeek-V3-Chat.yaml

可能碰到这个BUG https://github.com/kvcache-ai/ktransformers/issues/983，注释掉这一行
然后终于跑起来了

yeungtuzi · 2025-03-31T04:44:36Z

yeungtuzi
Mar 31, 2025
Author

run log

(base) root@lf:~/ktransformers# cat start.sh

export HF_ENDPOINT="https://hf-mirror.com"
#export TORCH_BLAS_PREFER_HIPBLASLT=0
ktransformers --force_think --model_path deepseek-ai/DeepSeek-R1 --gguf_path /opt/DeepSeek-R1-Q4_K_M --cpu_infer 30 --port 10002 --max_new_tokens=16384 --cache_lens=16384  --max_response_tokens 16384 --model_name "DeepSeek-R1" --optimize_config_path ktransformers/optimize/optimize_rules/rocm/DeepSeek-V3-Chat.yaml

(kt) root@lf:~/ktransformers# bash start.sh
kt-log.md

0 replies

south-ocean · 2025-04-08T09:33:35Z

south-ocean
Apr 8, 2025

你这是编译的DEBUG 版本么？

1 reply

yeungtuzi Apr 10, 2025
Author

没特意去设置这个选项，用的make dev_install

chenle2 · 2025-04-12T15:46:41Z

chenle2
Apr 12, 2025

您好，请问下您在海光dcu平台编译kt的时候有遇到过hipcc编译器报错“未指定nvidia还是amd平台”的这个错误吗，我看了编译命令，flag中有“-D__HIP_PLATFORM_AMD__=1”，但报错显示我并未指定该属性。相关日志如下：
/opt/dtk/bin/hipcc -I/root/miniconda3/envs/kt/lib/python3.11/site-packages/torch/include -I/root/miniconda3/envs/kt/lib/python3.11/site-packages/torch/include/torch/csrc/api/include -I/root/miniconda3/envs/kt/lib/python3.11/site-packages/torch/include/TH -I/root/miniconda3/envs/kt/lib/python3.11/site-packages/torch/include/THC -I/root/miniconda3/envs/kt/lib/python3.11/site-packages/torch/include/THH -I/opt/dtk/include -I/root/miniconda3/envs/kt/include/python3.11 -c ktransformers/ktransformers_ext/hip/custom_gguf/dequant.hip -o /tmp/tmpaup26x2z.build-temp/ktransformers/ktransformers_ext/hip/custom_gguf/dequant.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DHIPBLAS_V2 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -use_fast_math -Xcompiler -fPIC -DKTRANSFORMERS_USE_CUDA -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="_gcc" -DPYBIND11_STDLIB="_libstdcpp" -DPYBIND11_BUILD_ABI="_cxxabi1014" -DTORCH_EXTENSION_NAME=KTransformersOps -D_GLIBCXX_USE_CXX11_ABI=1 --offload-arch=gfx906 --offload-arch=gfx926 --offload-arch=gfx928 --offload-arch=gfx936 -fno-gpu-rdc -std=c++17
In file included from ktransformers/ktransformers_ext/hip/custom_gguf/dequant.hip:12:
/opt/dtk/include/hip/hip_runtime.h:66:2: error: ("Must define exactly one of HIP_PLATFORM_AMD or HIP_PLATFORM_NVIDIA");

0 replies

yeungtuzi · 2025-04-16T12:44:09Z

yeungtuzi
Apr 16, 2025
Author

没碰到类似情况，不过我编译的是0.23post1版本，是不是您编译的版本比较新？在发现性能不太令人满意之后我没有继续跟进这个硬件平台。

…

________________________________ 发件人: BrainCH ***@***.***> 发送时间: 2025年4月12日 23:47 收件人: kvcache-ai/ktransformers ***@***.***> 抄送: dahema ***@***.***>; Author ***@***.***> 主题: Re: [kvcache-ai/ktransformers] Hygon DCU K100AI get UP but very SLOW (Discussion #999) 您好，请问下您在海光dcu平台编译kt的时候有遇到过hipcc编译器报错“未指定nvidia还是amd平台”的这个错误吗，我看了编译命令，flag中有“-D__HIP_PLATFORM_AMD__=1”，但报错显示我并未指定该属性。相关日志如下： /opt/dtk/bin/hipcc -I/root/miniconda3/envs/kt/lib/python3.11/site-packages/torch/include -I/root/miniconda3/envs/kt/lib/python3.11/site-packages/torch/include/torch/csrc/api/include -I/root/miniconda3/envs/kt/lib/python3.11/site-packages/torch/include/TH -I/root/miniconda3/envs/kt/lib/python3.11/site-packages/torch/include/THC -I/root/miniconda3/envs/kt/lib/python3.11/site-packages/torch/include/THH -I/opt/dtk/include -I/root/miniconda3/envs/kt/include/python3.11 -c ktransformers/ktransformers_ext/hip/custom_gguf/dequant.hip -o /tmp/tmpaup26x2z.build-temp/ktransformers/ktransformers_ext/hip/custom_gguf/dequant.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DHIPBLAS_V2 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -use_fast_math -Xcompiler -fPIC -DKTRANSFORMERS_USE_CUDA -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="_gcc" -DPYBIND11_STDLIB="_libstdcpp" -DPYBIND11_BUILD_ABI="_cxxabi1014" -DTORCH_EXTENSION_NAME=KTransformersOps -D_GLIBCXX_USE_CXX11_ABI=1 --offload-arch=gfx906 --offload-arch=gfx926 --offload-arch=gfx928 --offload-arch=gfx936 -fno-gpu-rdc -std=c++17 In file included from ktransformers/ktransformers_ext/hip/custom_gguf/dequant.hip:12: /opt/dtk/include/hip/hip_runtime.h:66:2: error: ("Must define exactly one of HIP_PLATFORM_AMD or HIP_PLATFORM_NVIDIA"); ― Reply to this email directly, view it on GitHub<#999 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ADLPPACV2EK7W7T5PDT5BJT2ZEYPNAVCNFSM6AAAAAB2DLIEEWVHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTEOBRGQ2DANA>. You are receiving this because you authored the thread.Message ID: ***@***.***>

0 replies

yeungtuzi · 2025-04-17T15:08:23Z

yeungtuzi
Apr 17, 2025
Author

7975wx 32C 8x DDR5-5200 64GB 1x DCU K100AI 性能~2tps（对比4090D可以跑到10+）

…

________________________________ 发件人: BrainCH ***@***.***> 发送时间: 2025年4月16日 20:53 收件人: kvcache-ai/ktransformers ***@***.***> 抄送: dahema ***@***.***>; Author ***@***.***> 主题: Re: [kvcache-ai/ktransformers] Hygon DCU K100AI get UP but very SLOW (Discussion #999) 请问下您编译的是官网仓库还是south-ocean仓库的代码呀。以及请问下您设备的硬件配置与大模型性能具体是什么样的呢 ― Reply to this email directly, view it on GitHub<#999 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ADLPPABSQAM4KRIOYKSKBDL2ZZHEZAVCNFSM6AAAAAB2DLIEEWVHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTEOBVGQ2TMNQ>. You are receiving this because you authored the thread.Message ID: ***@***.***>

0 replies

fxzjshm · 2025-04-18T05:56:33Z

fxzjshm
Apr 18, 2025

NVIDIA 30 系及之后有 marlin 算子支持, ROCm/HIP 和 DTK 暂无 marlin, 无法直接对比.
作为参考, AMD Radeon 7900 XTX 48GB w/ hipblaslt 3.27 tokens/s, w/ hipblas 4.49 tokens/s (0.2.3post2).

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Hygon DCU K100AI get UP but very SLOW #999

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 6 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Hygon DCU K100AI get UP but very SLOW #999

Uh oh!

Uh oh!

yeungtuzi Mar 31, 2025

Replies: 6 comments · 1 reply

Uh oh!

Uh oh!

yeungtuzi Mar 31, 2025 Author

Uh oh!

south-ocean Apr 8, 2025

Uh oh!

yeungtuzi Apr 10, 2025 Author

Uh oh!

chenle2 Apr 12, 2025

Uh oh!

yeungtuzi Apr 16, 2025 Author

Uh oh!

yeungtuzi Apr 17, 2025 Author

Uh oh!

fxzjshm Apr 18, 2025

yeungtuzi
Mar 31, 2025

Replies: 6 comments 1 reply

yeungtuzi
Mar 31, 2025
Author

south-ocean
Apr 8, 2025

yeungtuzi Apr 10, 2025
Author

chenle2
Apr 12, 2025

yeungtuzi
Apr 16, 2025
Author

yeungtuzi
Apr 17, 2025
Author

fxzjshm
Apr 18, 2025