Fix Gemma3n not executed as CUDA_GRAPH on NVGPUs #14741

ORippler · 2025-07-17T16:29:52Z

Gemma3n uses Matrix-Matrix addition as part of project_per_layer_input, erroneously triggering CUDA_GRAPH disablement on NVGPUs even when a batch-size of 1 is used. This PR fixes this issue, while still detecting batched execution for graphs with > 1 GGML_OP_ADD node.

Perf before:

| model                          |       size |     params | backend    | ngl | n_batch |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | --------------: | -------------------: |
| gemma3n E2B Q8_0               |   4.45 GiB |     4.46 B | CUDA       |  99 |       1 |    pp1000+tg200 |         47.86 ± 1.27 |

Perf after:

| model                          |       size |     params | backend    | ngl | n_batch |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | --------------: | -------------------: |
| gemma3n E2B Q8_0               |   4.45 GiB |     4.46 B | CUDA       |  99 |       1 |    pp1000+tg200 |        133.08 ± 0.23 |

In the long run, I feel we should either fully support batched inference with CUDA Graphs or refactor the way batch sizes are detected (maybe moving ownership elsewhere?), but I'm still too unfamiliar with the code base to mage suggestions here.

Thoughts?

Gemma3n uses Matrix-Matrix addition as part of their input processing, wrongly triggering CUDA_GRAPH disablement on NVGPUs even when batch-size of 1 is used.

slaren · 2025-07-17T18:00:34Z

In the long run the solution will be to move the implementation to the graph plan API, then the heuristics to determine if the graph should be captured or not will be removed. I cannot tell if this workaround will break something else.

Author : Olivier Simons

Fix Gemma3n not executed as CUDA_GRAPH on NVGPUs

8e35380

Gemma3n uses Matrix-Matrix addition as part of their input processing, wrongly triggering CUDA_GRAPH disablement on NVGPUs even when batch-size of 1 is used.

github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Jul 17, 2025

Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Jul 17, 2025

Fix Gemma3n not executed as CUDA_GRAPH on NVGPUs ggml-org#14741

d781bed

Author : Olivier Simons

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix Gemma3n not executed as CUDA_GRAPH on NVGPUs #14741

Fix Gemma3n not executed as CUDA_GRAPH on NVGPUs #14741

ORippler commented Jul 17, 2025

Uh oh!

slaren commented Jul 17, 2025

Uh oh!

Uh oh!

Fix Gemma3n not executed as CUDA_GRAPH on NVGPUs #14741

Are you sure you want to change the base?

Fix Gemma3n not executed as CUDA_GRAPH on NVGPUs #14741

Conversation

ORippler commented Jul 17, 2025

Uh oh!

slaren commented Jul 17, 2025

Uh oh!

Uh oh!