-
Notifications
You must be signed in to change notification settings - Fork 12.4k
Description
Hello,
I am working now with llama.cpp for a personal project. My absolute hats off to whoever is working on code quality, structure, and readability, it is an absolute delight to work with and a very rare example of c/c++ code that I am not afraid to work with despite not knowing it inside and out. My thanks to anyone doing code structure policing, BDFLing or the like.
I'll likely open up a few issue tickets here as I see anything working on the lower end of things as I am getting more involved in the code. An explanation is below. For technical qualifications, I've worked with neural networks for years and have a couple of successful open source projects.
I am volunteering to fix this code, should any suggestions be approved.
Expected Behavior
In the llama_sample_repetition_penalty function, we expect to penalize a token based upon how many times it is used. This is done by dividing the token if it is above zero, and multiplying it by the penalty if it is below zero. It's not really necessarily documented in the commandline what this is doing, so one has to read the code to find this out. It's somewhat implied that it needs to be >1 but I think we can make it a bit more clear.
What we expect to happen is a somewhat universal behavior where the token likelihood smoothly goes down over time based upon how often it is repeated.
In practice, this seems to hit a few snags and is mathematically incorrect, looking at the function, we can see why.
Current Behavior
If we look a the linear behavior of the pre-softmax function, we can see that it changes drastically around zero.

One problem with this is that softmax is supposed to be zero-invariant, and many functions depend upon this, for example, the trick where we subtract the maximum value to make everything 0 before softmaxing for numerical stability. This in turn, can cause some bizzare inflection points where token suppressing behavior does not work as expected. This is likely an unintended side effect.
Possible Solutions
-- Scale based on minimum value
The frequency penalty seems to add a bias based upon the counts found, so returning to a simple bias to remove the fixed point around 0 seems like we would simply be duplicating the behavior in a separate function.
What we could do is to instead, similar to the softmax value, scale things by adding the minimum value, scaling multiplicatively (no division), then subtract the minimum value. This makes the minimum value of the incoming logits the fixed point, and prevents any bizzarely-scaled 'division in the middle'.
-- Scale based on maximum value
This is also possible, however, it might be unintuitive due to 0 being the upper fixed point.
Impact From Changes
Fixing this will likely cause some impact to some behaviors currently-deployed scripts, including peoples' parameter tuning around the bug. It should have a net long-term benefit, however.
Code contributions
I am willing to write and walk the code through a PR/code review, as well as document the original paper this is based on (if someone can arxiv link me please).
Conclusion
Thank you for your attention in this matter. I plan on making at least one of these changes (or something similar if I personally find a more fitting solution) in my personal version of llama.cpp, and I hope to see them in the main version as well. This is an excellent program and I am glad to be a potential contributor to it.