Replies: 2 comments 3 replies
-
Try adding the -keep 1 argument (see llama-server -help) |
Beta Was this translation helpful? Give feedback.
0 replies
-
There is a problem with activations range for f16 quants in Gemma 3 which might be influencing this: https://www.unsloth.ai/blog/gemma3 Option possibly to move to bf16 or f32 for the activations quants. |
Beta Was this translation helpful? Give feedback.
3 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Hi,
I'm testing Gemma 3 27B via llama.cpp, but running into attention/context issues during chat. After a few turns, it seems to forget the system prompt and loses track of the conversation.
Performance-wise, llama.cpp is ~3x faster than Ollama, but Ollama handles context much better for Gemma 3.
I couldn't find any official chat template for Gemma 3, so I'm sending it as prompt and formatting messages like this:
<start_of_turn>user
knock knock<end_of_turn>
<start_of_turn>model
who is there<end_of_turn>
<start_of_turn>user
Gemma<end_of_turn>
<start_of_turn>model
Gemma who?<end_of_turn>
Anyone found a better way to structure prompts or maintain context? Tips appreciated!
Beta Was this translation helpful? Give feedback.
All reactions