perf : CUDA FA is slower with Gemma models. Is this expected? #10684

ggerganov · 2024-12-06T09:04:11Z

ggerganov
Dec 6, 2024
Maintainer

The Gemma models have head size of 256. I am trying the Gemma-7B model: https://huggingface.co/google/gemma-7b

With RTX 2060, enabling FA leads to performance regression:

./bin/llama-bench -m ../models/gemma-7b/ggml-model-q4_0.gguf -ngl 99 -p 4096 -b 4096 -ub 512 -fa 0,1

ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 2060 SUPER, compute capability 7.5, VMM: yes

model	size	params	backend	ngl	n_batch	fa	test	t/s
gemma 7B Q4_0	4.66 GiB	8.54 B	CUDA	99	4096	0	pp4096	1388.31 ± 6.14
gemma 7B Q4_0	4.66 GiB	8.54 B	CUDA	99	4096	0	tg128	54.05 ± 0.04
gemma 7B Q4_0	4.66 GiB	8.54 B	CUDA	99	4096	1	pp4096	1248.15 ± 4.68
gemma 7B Q4_0	4.66 GiB	8.54 B	CUDA	99	4096	1	tg128	53.56 ± 0.42

build: 7736837 (4274)

@JohannesGaessler Is this expected? I thought that head sizes of 128 and 256 should have good FA performance.

JohannesGaessler · 2024-12-06T11:33:00Z

JohannesGaessler
Dec 6, 2024
Collaborator

I think this is due to a combination of the following factors:

At the time that I wrote the FA kernels my understanding of how to best utilize tensor cores was bad.
Head size 256 is better to optimize performance for than e.g. head size 80 but with the current code there are still issues with shared memory limits and register pressure that are worse at large head sizes.
The FA kernels decompose the compute task into tiles of whole tiles of static size vs. e.g. the current MMQ kernels where the task can be decomposed into partial tiles in such a way that all streaming multiprocessors receive an equal workload. The FA heuristics for how to choose the tile sizes can work quite poorly depending on the input parameters. An RTX 2060 has 34 streaming multiprocessors and should with these inputs receive 64 tasks to work on so 4/68 = 6% of the GPU peak performance is lost to tail effects.

2 replies

ggerganov Dec 6, 2024
Maintainer Author

Thanks. I am mainly asking in case I missed some configuration as I don't use CUDA very often and was surprised to see this result. Do you observe similar slow-down on your end too?

Head size 256 is better to optimize performance for than e.g. head size 80 but with the current code there are still issues with shared memory limits and register pressure that are worse at large head sizes.

Yes, the Metal implementation suffers from the same problems.

An RTX 2060 has 34 streaming multiprocessors and should with these inputs receive 64 tasks to work on so 4/68 = 6% of the GPU peak performance is lost to tail effects.

FYI, similar regression is also observed with A100.

So from what I gather, the results for head size 256 are expected, correct?

JohannesGaessler Dec 6, 2024
Collaborator

I did not expect that this particular model on this particular GPU would perform poorly. However, because GPU code has poor portability in terms of performance I did expect that cases like this exist. I will revisit the FA CUDA code in the next few months in the context of training performance and hopefully after that they provide a consistent speedup.

On my NVIDIA GPUs (for which I optimized the code) the performance is consistently better:

GPU	model	n_ubatch	fa	test	t/s
RTX 3090	gemma 7B Q4_0	32	0	pp4096	1849.16 ± 0.00
RTX 3090	gemma 7B Q4_0	32	1	pp4096	1926.66 ± 0.00
RTX 3090	gemma 7B Q4_0	64	0	pp4096	2720.38 ± 0.00
RTX 3090	gemma 7B Q4_0	64	1	pp4096	2815.88 ± 0.00
RTX 3090	gemma 7B Q4_0	128	0	pp4096	3407.55 ± 0.00
RTX 3090	gemma 7B Q4_0	128	1	pp4096	3477.78 ± 0.00
RTX 3090	gemma 7B Q4_0	256	0	pp4096	3744.38 ± 0.00
RTX 3090	gemma 7B Q4_0	256	1	pp4096	3816.53 ± 0.00
RTX 3090	gemma 7B Q4_0	512	0	pp4096	3944.85 ± 0.00
RTX 3090	gemma 7B Q4_0	512	1	pp4096	4004.98 ± 0.00
RTX 3090	gemma 7B Q4_0	1024	0	pp4096	3968.58 ± 0.00
RTX 3090	gemma 7B Q4_0	1024	1	pp4096	4055.36 ± 0.00
RTX 3090	gemma 7B Q4_0	2048	0	pp4096	3826.76 ± 0.00
RTX 3090	gemma 7B Q4_0	2048	1	pp4096	3903.60 ± 0.00
RTX 3090	gemma 7B Q4_0	4096	0	pp4096	3833.49 ± 0.00
RTX 3090	gemma 7B Q4_0	4096	1	pp4096	3894.61 ± 0.00
RTX 4090	gemma 7B Q4_0	32	0	pp4096	3020.24 ± 0.08
RTX 4090	gemma 7B Q4_0	32	1	pp4096	3136.66 ± 0.03
RTX 4090	gemma 7B Q4_0	64	0	pp4096	5379.95 ± 0.18
RTX 4090	gemma 7B Q4_0	64	1	pp4096	5604.03 ± 1.17
RTX 4090	gemma 7B Q4_0	128	0	pp4096	7616.13 ± 0.18
RTX 4090	gemma 7B Q4_0	128	1	pp4096	7693.16 ± 5.09
RTX 4090	gemma 7B Q4_0	256	0	pp4096	8980.33 ± 20.15
RTX 4090	gemma 7B Q4_0	256	1	pp4096	9129.80 ± 26.39
RTX 4090	gemma 7B Q4_0	512	0	pp4096	9051.14 ± 11.69
RTX 4090	gemma 7B Q4_0	512	1	pp4096	9300.61 ± 34.45
RTX 4090	gemma 7B Q4_0	1024	0	pp4096	8575.69 ± 0.63
RTX 4090	gemma 7B Q4_0	1024	1	pp4096	8822.02 ± 8.87
RTX 4090	gemma 7B Q4_0	2048	0	pp4096	7795.82 ± 19.60
RTX 4090	gemma 7B Q4_0	2048	1	pp4096	8150.21 ± 25.87
RTX 4090	gemma 7B Q4_0	4096	0	pp4096	7800.85 ± 8.26
RTX 4090	gemma 7B Q4_0	4096	1	pp4096	8130.37 ± 0.62

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perf : CUDA FA is slower with Gemma models. Is this expected? #10684

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

perf : CUDA FA is slower with Gemma models. Is this expected? #10684

Uh oh!

ggerganov Dec 6, 2024 Maintainer

Replies: 1 comment · 2 replies

Uh oh!

JohannesGaessler Dec 6, 2024 Collaborator

Uh oh!

ggerganov Dec 6, 2024 Maintainer Author

Uh oh!

JohannesGaessler Dec 6, 2024 Collaborator

ggerganov
Dec 6, 2024
Maintainer

Replies: 1 comment 2 replies

JohannesGaessler
Dec 6, 2024
Collaborator

ggerganov Dec 6, 2024
Maintainer Author

JohannesGaessler Dec 6, 2024
Collaborator