GLM4.5v使用48GB的GPU 进行4卡部署出错

### System Info / 系統信息

我在RTX 5880 进行 GLM4.5v-fp8 tp=4 部署遇到报错，按理说显存应该足够权重加载和 kvcache 使用，但这个貌似是moe的问题？请教一下该怎么解决，还是得等官方更新优化？

不使用`--disable-cuda-graph`时，加载权重后报错；
使用`--disable-cuda-graph`时，权重顺利加载，推理时报错。
```
triton.runtime.errors.OutOfResources: out of resource: shared memory, Required: 147456, Hardware limit: 101376. Reducing block sizes or `num_stages` may help.
```

[sglang issue 9393](https://github.com/sgl-project/sglang/issues/9393)

### Who can help? / 谁可以帮助到您？

@zRzRzRzRzRzRzR 

### Information / 问题信息

- [ ] The official example scripts / 官方的示例脚本
- [ ] My own modified scripts / 我自己修改的脚本和任务

### Reproduction / 复现过程

*

### Expected behavior / 期待表现

模型能够正常推理

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

GLM4.5v使用48GB的GPU 进行4卡部署出错 #154

System Info / 系統信息

Who can help? / 谁可以帮助到您？

Information / 问题信息

Reproduction / 复现过程

Expected behavior / 期待表现

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

GLM4.5v使用48GB的GPU 进行4卡部署出错 #154

Description

System Info / 系統信息

Who can help? / 谁可以帮助到您？

Information / 问题信息

Reproduction / 复现过程

Expected behavior / 期待表现

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions