You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Like many of us, including some real contributors, I'm often dwelling with rope issues in order to set up the best perplexity curve for the model I'm running.
Basic knowledge is that Llama 1 & 2 have a base rope frequency (theta?) of 10,000, while Code Llama 2 has a base rope of 1,000,000.
Llama 1 is trained of 2,048 tokens sequences, Llama 2 on 4,096, CodeLlama 2 on 16,384. Some people pretrain/train/finetune (what's the difference?) custom models on longer sequences, notably on Llama 1 & 2.
Scale factor or Polar interpolation (like SuperHot, is that the same?) basically work on extended context / original context. Scale 2 = 2048x2 = 4096 context, at the cost of an overall loss of perplexity. Base rope scale = 1/scale factor.
NTK v1 is working differently, using scale frequency to be set up, or an Alpha value : there's an equation linking both in Llama 1 & 2, and another (approximate) linking the Alpha/Base rope frequency to the optimal max context (that's where I am for now), and it's not been figured out for CodeLlama 2 if I understand properly, but CodeLlama is much more steady on its base rope of 1,000,000 no matter what is the context length).
On Llama 1 & 2, we can even use together PI and NTK to reach a higher context length without too much damages on perplexity, but that makes an even more complex equation to link both and chose the correct couple of base and scale, and I'm not algebra savvy.
My question is simple, but calls for a complex answer 👍
Can the Rope experts around here make a wiki about the various techniques of Rope, how to use them and even combine accordingly to the Llama models we use, their inner base model and later customization, or even better, integrate a reliable rope calculation system in the Llama.cpp engine accordingly to all the relevant parameters for Llama 1, Llama 2 and CodeLlama (this one is more tricky) ?
documentationImprovements or additions to documentation
1 participant
Heading
Bold
Italic
Quote
Code
Link
Numbered list
Unordered list
Task list
Attach files
Mention
Reference
Menu
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hey Ladies & Gents,
Like many of us, including some real contributors, I'm often dwelling with rope issues in order to set up the best perplexity curve for the model I'm running.
Basic knowledge is that Llama 1 & 2 have a base rope frequency (theta?) of 10,000, while Code Llama 2 has a base rope of 1,000,000.
Llama 1 is trained of 2,048 tokens sequences, Llama 2 on 4,096, CodeLlama 2 on 16,384. Some people pretrain/train/finetune (what's the difference?) custom models on longer sequences, notably on Llama 1 & 2.
Scale factor or Polar interpolation (like SuperHot, is that the same?) basically work on extended context / original context. Scale 2 = 2048x2 = 4096 context, at the cost of an overall loss of perplexity. Base rope scale = 1/scale factor.
NTK v1 is working differently, using scale frequency to be set up, or an Alpha value : there's an equation linking both in Llama 1 & 2, and another (approximate) linking the Alpha/Base rope frequency to the optimal max context (that's where I am for now), and it's not been figured out for CodeLlama 2 if I understand properly, but CodeLlama is much more steady on its base rope of 1,000,000 no matter what is the context length).
On Llama 1 & 2, we can even use together PI and NTK to reach a higher context length without too much damages on perplexity, but that makes an even more complex equation to link both and chose the correct couple of base and scale, and I'm not algebra savvy.
My question is simple, but calls for a complex answer 👍
Can the Rope experts around here make a wiki about the various techniques of Rope, how to use them and even combine accordingly to the Llama models we use, their inner base model and later customization, or even better, integrate a reliable rope calculation system in the Llama.cpp engine accordingly to all the relevant parameters for Llama 1, Llama 2 and CodeLlama (this one is more tricky) ?
Beta Was this translation helpful? Give feedback.
All reactions