Replies: 2 comments
-
Happened to me with some weights/LLMs combinations, but most work well so I did not bother. E.g the one I run now is mistral-7b-instruct-v.0.2-q4_k_s and --flash-attn works well with -c 32768. Which are you running? |
Beta Was this translation helpful? Give feedback.
0 replies
-
Seems to happen for me too, on Smaug Llama 3 70b 32k. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
By activating flash attention with -fa and setting the maximum context length greater than 8k (-c 16384. -c 24576, etc.) and using the server of llama.cpp, the context length can easily exceed 8k after several turns of conversation.
However, in this case, the response is gibberish with simple characters always being repeated when the prompt exceeds 8k tokens. Does this only happen to me? I am using Apple Silicon by building with LLAMA_METAL=1.
Beta Was this translation helpful? Give feedback.
All reactions