Anyone running Gemma 3 27B with llama.cpp? #13313

Fulll3 · 2025-05-05T13:05:59Z

Fulll3
May 5, 2025

Hi,

I'm testing Gemma 3 27B via llama.cpp, but running into attention/context issues during chat. After a few turns, it seems to forget the system prompt and loses track of the conversation.

Performance-wise, llama.cpp is ~3x faster than Ollama, but Ollama handles context much better for Gemma 3.

I couldn't find any official chat template for Gemma 3, so I'm sending it as prompt and formatting messages like this:

<start_of_turn>user
knock knock<end_of_turn>
<start_of_turn>model
who is there<end_of_turn>
<start_of_turn>user
Gemma<end_of_turn>
<start_of_turn>model
Gemma who?<end_of_turn>

Anyone found a better way to structure prompts or maintain context? Tips appreciated!

ali0une · 2025-05-07T06:22:54Z

ali0une
May 7, 2025

Try adding the -keep 1 argument (see llama-server -help)

0 replies

steampunque · 2025-05-08T14:34:31Z

steampunque
May 8, 2025

There is a problem with activations range for f16 quants in Gemma 3 which might be influencing this:

https://www.unsloth.ai/blog/gemma3

Option possibly to move to bf16 or f32 for the activations quants.

3 replies

ggerganov May 8, 2025
Maintainer

Activations in llama.cpp are already in F32 so I doubt this has anything to do with the problem of @Fulll3. They should provide specific reproduction steps, otherwise we just assume a user error.

steampunque May 8, 2025

Activations in llama.cpp are already in F32 so I doubt this has anything to do with the problem of @Fulll3. They should provide specific reproduction steps, otherwise we just assume a user error.

OK, thanks for clarification. I see some bf16 code in cuda backend so possible option to use that too but I am uncertain about how to get it into the quant or if it would help any.

Also if activations had such wide dynamic range anything downstream not f32 is going to quickly saturate. Don't know if that could impact anything.

Fulll3 May 14, 2025
Author

Hi, I am running GGUF_6 below are startup params

./llama-server -hf bartowski/mlabonne_gemma-3-27b-it-abliterated-GGUF:Q6_K_L -c 4096 --port 2242 --n-gpu-layers 999 --host 0.0.0.0 --predict 256

Additionally this what I am sending with each completion request
"temperature": 1.0,
"min_p": 0.1,
"top_k": 64,
"top_p": 0.95,
"repeat_penalty": 1.0,

When I was posting this my main main issue was that the conversation history overreached max context length.

I have fixed that and results are much better. However I still see stronger degrading of system prompt than I would expect. When there is certain pattern in last 5-10 messages it takes priority over system prompt instructions

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Anyone running Gemma 3 27B with llama.cpp? #13313

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Anyone running Gemma 3 27B with llama.cpp? #13313

Uh oh!

Fulll3 May 5, 2025

Replies: 2 comments · 3 replies

Uh oh!

ali0une May 7, 2025

Uh oh!

steampunque May 8, 2025

Uh oh!

ggerganov May 8, 2025 Maintainer

Uh oh!

Uh oh!

steampunque May 8, 2025

Uh oh!

Uh oh!

Fulll3 May 14, 2025 Author

Fulll3
May 5, 2025

Replies: 2 comments 3 replies

ali0une
May 7, 2025

steampunque
May 8, 2025

ggerganov May 8, 2025
Maintainer

Fulll3 May 14, 2025
Author