You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
New or modified Tokenizer - it currently struggles and fails on large words which is quite a concern [done]
X) GGML threads
I'd like a global mutex flag that calms down all threads in ggml (set while GPU operations take place)
In general there is something very wrong with ggml threads, they are too interrupting to performance.
It makes not much sense that 2 threads works better than 4 on a 8/16 core CPU - the atomic mutex/work loops should be investigated
maybe the number of threads could be modulated, after all ggml knows exactly which operations are upcoming
X) CUDA tensor offload ing
A full mat_mul integer kernel would be nice to have - one day
a function to offload tensors 'again', skipping those that are offloaded. That would allow to utilize cuBLAS temporary buffers. Should be just a couple lines if done good.
X) The python script generates a very old GGML binary (V0) which is without "token scores". It generates a warning during conversion to V3.
Do we need to have those scores ? It appears our tensors are 1:1 identical to the official release.
Are those used only for sampling purposes ? If anyone knows, would be nice to either remove that message or we add the token scores. Done (Scores only Sentencepiece)
X) Multi GPU support
Currently the flags are there to split tensors up, but the implementation is not there. Looks like a small thing, though will get larger with full GPU support. Done
X) Smaller stuff
I'd like to expose the main application params struct to libfalcon.cpp, would make some stuff more convenient but it's quite a mess to get it through the layers.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Not necessarily sequentially
X) GGML threads
X) CUDA tensor offload ing
X) The python script generates a very old GGML binary (V0) which is without "token scores". It generates a warning during conversion to V3.
Do we need to have those scores ? It appears our tensors are 1:1 identical to the official release.
Are those used only for sampling purposes ? If anyone knows, would be nice to either remove that message or we add the token scores.
Done (Scores only Sentencepiece)
X) Multi GPU support
Currently the flags are there to split tensors up, but the implementation is not there. Looks like a small thing, though will get larger with full GPU support.
Done
X) Smaller stuff
I'd like to expose the main application params struct to libfalcon.cpp, would make some stuff more convenient but it's quite a mess to get it through the layers.
Beta Was this translation helpful? Give feedback.
All reactions