Issues with YARN #4019

MrJackSpade · 2023-11-10T13:35:47Z

MrJackSpade
Nov 10, 2023

I've been trying to implement YARN and its giving me the weirdest issues, and I cant nail down exactly why.

I have two test cases:

Running a gen using Main.exe, and command line parameters
Running a gen by interop'ing with the Llama.dll

The only real difference between the two is that the second gen takes the form of a multi-turn conversation, and the first is creative writing.

The first gen, appears to work perfectly fine. I've run it a few times and never seen any errors.

The second gen, fails in the exact same way after the same number of tokens, every time. Its the way in which the second gen fails that's confusing the hell out of me though, because it doesn't seem to make any sense.

After approx 1000 tokens, when I'm using the Llama.dll directly, the output will become garbage. Not PURE garbage, but rather like the attention is skipping around the context randomly. The model will start responding as the user, respond to messages early on in the conversation, regurgitate parts of its prompt verbatim, etc. all within the space of a single message. Its like someone took a bunch of responses and shuffled them around before writing them down. At a high level the PARTS of the responses make sense, the tokens are valid and many of the words are contextually sound, but rather the "context" will switch rapidly.

To add greatly to the confusion, it doesn't seem to be a position within the context window causing the issue. Its literally the number of back-and-forth messages. If the prompt is 50 tokens, the issue will start around 1050. If the prompt is 7000 tokens, the issue will start around 7050

The only thing I can think is that maybe it has something to do with cache fragmentation, but after trying a few tests with pumping data through, I cant seem to replicate it there either. No matter what I try, I cant seem to pin down whats actually causing the problem.

I'm at a loss. Aside from debugging line by line, which could take days, I'm not sure what to check next. It appears as though all of my parameters match, but there must be something I'm doing wrong here.

KerfuffleV2 · 2023-11-10T15:08:24Z

KerfuffleV2
Nov 10, 2023
Collaborator

If you mean you're writing a program using llama.cpp as an API, it's doubtful anyone could really help you without seeing the source. If it's large (or not something you can share), what I'd recommend is making a minimal example that reproduces your problem.

It's likely something is going wrong with KV cache manipulation, adding batches, etc in your app.

0 replies

MrJackSpade · 2023-11-11T18:46:58Z

MrJackSpade
Nov 11, 2023
Author

I should have added, this issue is specific to YARN.

The same code using NTK or non scaling works perfectly fine, literally all I do is switch to using YARN and the whole thing falls apart. It's the act of passing in the yarn parameters when creating the context that causes the issue, everything else is unchanged.

As a result, I don't believe it's the result of batching or cache management, unless there's some fundamental difference in how YARN interacts with these things.

I've been trying to get a "minimum reproduction" for about a week now, but haven't been having much luck. That's why I was hoping someone might have some brilliant idea based on the description and the fact that everything else works perfectly fine.

A long shot for sure

4 replies

KerfuffleV2 Nov 11, 2023
Collaborator

I agree, if it only happens with YARN then it's probably not what I suggested but it's still definitely possible (at least in some cases). I'm not sure what language your app is written in, but if it's C/C++ then effects from stuff like memory errors can show up far from the source of the problem.

Can you be more specific about what you mean by switching to YARN? Like just setting the scaling type to yarn triggers the issue, even when everything else like rope scaling is still at the default value? From what I know, the YARN calculation is identical to non-YARN when the scale and other parameters are at their defaults.

MrJackSpade Nov 12, 2023
Author

The application is C# and it calls through interop. There could definitely be memory issues with not freeing objects, but due to the nature of C# the only real area that could occur would be when prepping the C++ structs.

I have all of the configurations stored in a json object that I use to bind the context/model parameters at init. If I switch the scaling type to Yarn, along with the other required parameters, it causes the problem. This is leaving all other settings the same, temp/penalties/context size/offloading, etc

I just threw a debug statement into Llama.cpp and grabbed the cparams/hparams from within the batch decode from both Main.exe and my application, and they're both the exact same. For reference, within the decode function, these are the parameters as set

n_ctx: 8192
n_batch: 512
n_threads: 20
n_threads_batch: 20
rope_freq_base: 10000.0000
rope_freq_scale: 0.500000000
n_yarn_orig_ctx: 4096
yarn_ext_factor: 1.00000000
yarn_attn_factor: 1.00000000
yarn_beta_fast: 32.0000000
yarn_beta_slow: 1.00000000
mul_mat_q: true
vocab_only: false
n_vocab: 32000
n_ctx_train: 4096
n_embd: 8192
n_head: 64
n_head_kv: 8
n_layer: 80
n_rot: 128
n_ff: 28672
f_norm_eps: 0.00000000
f_norm_rms_eps: 9.99999975e-06
rope_freq_base_train: 10000.0000
rope_freq_scale_train: 1.00000000
n_yarn_orig_ctx: 4096
rope_scaling_type_train: 1 '\x1'
rope_finetuned: false
f_clamp_kqv: 0.00000000
f_max_alibi_bias: 0.00000000

I just took the application I'm using to wrap the Llama dll, and modified it to call the DLL with the same "request" to write a story that I used in testing Main.exe. It ran for about twice as long as it usually does before reaching the end of its story. It didn't go "full garbage" however near the end it did get caught in a pretty weird loop that I didn't see across a number of tests with Main.exe, however that could have been a coincidence, or the result of me suppressing EOS to try and force it to write as long as possible for the sake of the test.

Given that, I do some custom "sampling" outside of Llama.cpp, but unless there's some substantial differences with the probability of logits between scaling types, I don't see how that would cause any issues.

KerfuffleV2 Nov 12, 2023
Collaborator

but unless there's some substantial differences with the probability of logits between scaling types,

I seem to remember reading something about applying temperature to the logits in the YARN paper. So if your sampling depends on absolute rather than relative values in the logits, then that possibly could lead to the sort of problem you're describing.

MrJackSpade Nov 13, 2023
Author

I may possibly have found the cause.

At the very least, I've managed to surpass the barrier twice now.

I noticed that the (successful) main.exe test I was running was still using the old Vicuna prompt format. The model I'm using is Airoboros 3.1.2, which was finetuned to use the new Llama-2-Chat format. My application, was configured to use the "correct" Llama-2-Chat format prompt, which works perfectly fine without YARN. So I switched my application to use the "incorrect" Vicuna format (while using YARN), and the garbage response problem completely vanished.

I don't know how this could possibly lead to the effects that I was seeing, but one of the symptoms that I saw was the model regurgitating parts of the prompt for no reason. I also remember Jon Durbin saying the new Llama-2-Chat format on Airoboros 3.1.2 was "Sticky".

The only thing I can think right now is that the change in the attention was somehow leading to the problems I saw, possibly in combination with the finetune being overly trained on the prompt. I don't actually know that the Frequency Base I was using before switching to YARN was "optimal", so that might have further obscured the problem

It could also be a complete coincidence. I got garbage responses ~7 times in a row before switching the prompt format, and clean responses twice after. There's not a high chance of that being a coincidence, but its not exactly unbelievable either.

I need to do more testing to see if I can validate whether or not simply switching the prompt format really did resolve all of the issues, but at first glance the difference appears to be night and day.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Issues with YARN #4019

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 4 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Issues with YARN #4019

Uh oh!

MrJackSpade Nov 10, 2023

Replies: 2 comments · 4 replies

Uh oh!

KerfuffleV2 Nov 10, 2023 Collaborator

Uh oh!

MrJackSpade Nov 11, 2023 Author

Uh oh!

KerfuffleV2 Nov 11, 2023 Collaborator

Uh oh!

Uh oh!

MrJackSpade Nov 12, 2023 Author

Uh oh!

KerfuffleV2 Nov 12, 2023 Collaborator

Uh oh!

Uh oh!

MrJackSpade Nov 13, 2023 Author

MrJackSpade
Nov 10, 2023

Replies: 2 comments 4 replies

KerfuffleV2
Nov 10, 2023
Collaborator

MrJackSpade
Nov 11, 2023
Author

KerfuffleV2 Nov 11, 2023
Collaborator

MrJackSpade Nov 12, 2023
Author

KerfuffleV2 Nov 12, 2023
Collaborator

MrJackSpade Nov 13, 2023
Author