Skip to content

Fix Gemma3n not executed as CUDA_GRAPH on NVGPUs #14741

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

ORippler
Copy link

Gemma3n uses Matrix-Matrix addition as part of project_per_layer_input, erroneously triggering CUDA_GRAPH disablement on NVGPUs even when a batch-size of 1 is used. This PR fixes this issue, while still detecting batched execution for graphs with > 1 GGML_OP_ADD node.

Perf before:

| model                          |       size |     params | backend    | ngl | n_batch |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | --------------: | -------------------: |
| gemma3n E2B Q8_0               |   4.45 GiB |     4.46 B | CUDA       |  99 |       1 |    pp1000+tg200 |         47.86 ± 1.27 |

Perf after:

| model                          |       size |     params | backend    | ngl | n_batch |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | --------------: | -------------------: |
| gemma3n E2B Q8_0               |   4.45 GiB |     4.46 B | CUDA       |  99 |       1 |    pp1000+tg200 |        133.08 ± 0.23 |

In the long run, I feel we should either fully support batched inference with CUDA Graphs or refactor the way batch sizes are detected (maybe moving ownership elsewhere?), but I'm still too unfamiliar with the code base to mage suggestions here.

Thoughts?

Gemma3n uses Matrix-Matrix addition as part of their input processing,
wrongly triggering CUDA_GRAPH disablement on NVGPUs even when batch-size
of 1 is used.
@github-actions github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Jul 17, 2025
@slaren
Copy link
Member

slaren commented Jul 17, 2025

In the long run the solution will be to move the implementation to the graph plan API, then the heuristics to determine if the graph should be captured or not will be removed. I cannot tell if this workaround will break something else.

Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Jul 17, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants