Replies: 1 comment 2 replies
-
I think this is due to a combination of the following factors:
|
Beta Was this translation helpful? Give feedback.
2 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
The Gemma models have head size of 256. I am trying the Gemma-7B model: https://huggingface.co/google/gemma-7b
With RTX 2060, enabling FA leads to performance regression:
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 2060 SUPER, compute capability 7.5, VMM: yes
build: 7736837 (4274)
@JohannesGaessler Is this expected? I thought that head sizes of 128 and 256 should have good FA performance.
Beta Was this translation helpful? Give feedback.
All reactions