Issues with YARN #4019
Replies: 2 comments 4 replies
-
If you mean you're writing a program using llama.cpp as an API, it's doubtful anyone could really help you without seeing the source. If it's large (or not something you can share), what I'd recommend is making a minimal example that reproduces your problem. It's likely something is going wrong with KV cache manipulation, adding batches, etc in your app. |
Beta Was this translation helpful? Give feedback.
-
I should have added, this issue is specific to YARN. The same code using NTK or non scaling works perfectly fine, literally all I do is switch to using YARN and the whole thing falls apart. It's the act of passing in the yarn parameters when creating the context that causes the issue, everything else is unchanged. As a result, I don't believe it's the result of batching or cache management, unless there's some fundamental difference in how YARN interacts with these things. I've been trying to get a "minimum reproduction" for about a week now, but haven't been having much luck. That's why I was hoping someone might have some brilliant idea based on the description and the fact that everything else works perfectly fine. A long shot for sure |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
I've been trying to implement YARN and its giving me the weirdest issues, and I cant nail down exactly why.
I have two test cases:
The only real difference between the two is that the second gen takes the form of a multi-turn conversation, and the first is creative writing.
The first gen, appears to work perfectly fine. I've run it a few times and never seen any errors.
The second gen, fails in the exact same way after the same number of tokens, every time. Its the way in which the second gen fails that's confusing the hell out of me though, because it doesn't seem to make any sense.
After approx 1000 tokens, when I'm using the Llama.dll directly, the output will become garbage. Not PURE garbage, but rather like the attention is skipping around the context randomly. The model will start responding as the user, respond to messages early on in the conversation, regurgitate parts of its prompt verbatim, etc. all within the space of a single message. Its like someone took a bunch of responses and shuffled them around before writing them down. At a high level the PARTS of the responses make sense, the tokens are valid and many of the words are contextually sound, but rather the "context" will switch rapidly.
To add greatly to the confusion, it doesn't seem to be a position within the context window causing the issue. Its literally the number of back-and-forth messages. If the prompt is 50 tokens, the issue will start around 1050. If the prompt is 7000 tokens, the issue will start around 7050
The only thing I can think is that maybe it has something to do with cache fragmentation, but after trying a few tests with pumping data through, I cant seem to replicate it there either. No matter what I try, I cant seem to pin down whats actually causing the problem.
I'm at a loss. Aside from debugging line by line, which could take days, I'm not sure what to check next. It appears as though all of my parameters match, but there must be something I'm doing wrong here.
Beta Was this translation helpful? Give feedback.
All reactions