Flash attention with more than 8k context is always gibberish. #7680

beebopkim · 2024-06-01T03:55:41Z

beebopkim
Jun 1, 2024

By activating flash attention with -fa and setting the maximum context length greater than 8k (-c 16384. -c 24576, etc.) and using the server of llama.cpp, the context length can easily exceed 8k after several turns of conversation.

However, in this case, the response is gibberish with simple characters always being repeated when the prompt exceeds 8k tokens. Does this only happen to me? I am using Apple Silicon by building with LLAMA_METAL=1.

izard · 2024-06-03T18:42:11Z

izard
Jun 3, 2024

Happened to me with some weights/LLMs combinations, but most work well so I did not bother. E.g the one I run now is mistral-7b-instruct-v.0.2-q4_k_s and --flash-attn works well with -c 32768. Which are you running?

0 replies

dylanetaft · 2024-08-16T20:50:50Z

dylanetaft
Aug 16, 2024

Seems to happen for me too, on Smaug Llama 3 70b 32k.
I turned off fa and it stops repeating itself in a loop.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Flash attention with more than 8k context is always gibberish. #7680

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Flash attention with more than 8k context is always gibberish. #7680

Uh oh!

beebopkim Jun 1, 2024

Replies: 2 comments

Uh oh!

izard Jun 3, 2024

Uh oh!

dylanetaft Aug 16, 2024

beebopkim
Jun 1, 2024

izard
Jun 3, 2024

dylanetaft
Aug 16, 2024