Roadmap #14

cmp-nct · 2023-06-19T20:57:02Z

cmp-nct
Jun 19, 2023
Maintainer

Not necessarily sequentially

context beyond 2k - integrating improved ROPE [wip]
Variable Quantization formats
K-type quantization for 7b [consider dropping cublas for 3d mulmat kernel]
moving more operations into GPU [add kernels for missing functions and broadcasting support]
UI enhancement: interactive Reset, Revert
UI enhancement: Model quality test for Task solving, automated
backward pass for all functions
fine tuning quantized Falcon model (fp16 target layer only)
ggml multithreading: thread scheduling option with performance core affinity option
cuda optimization: abstract the offload tensors operation, allow it to run during evaluation to allow optimizing vram usage
model downloader : get model by name and quantization from public sources (TheBloke, HF, etc)
installation automation (bat and sh / installer) with a bit of guidance on issues found
web interface and HTTP API support; get rid of the example folder - modularize main loop
a new name focusing on Falcon instead of the generic llm name we use now
UI enhancement: Instruct mode tune detection [done]
New or modified Tokenizer - it currently struggles and fails on large words which is quite a concern [done]

X) GGML threads

I'd like a global mutex flag that calms down all threads in ggml (set while GPU operations take place)
In general there is something very wrong with ggml threads, they are too interrupting to performance.

It makes not much sense that 2 threads works better than 4 on a 8/16 core CPU - the atomic mutex/work loops should be investigated

maybe the number of threads could be modulated, after all ggml knows exactly which operations are upcoming

X) CUDA tensor offload ing

A full mat_mul integer kernel would be nice to have - one day
a function to offload tensors 'again', skipping those that are offloaded. That would allow to utilize cuBLAS temporary buffers. Should be just a couple lines if done good.

X) The python script generates a very old GGML binary (V0) which is without "token scores". It generates a warning during conversion to V3.
Do we need to have those scores ? It appears our tensors are 1:1 identical to the official release.
Are those used only for sampling purposes ? If anyone knows, would be nice to either remove that message or we add the token scores.
Done (Scores only Sentencepiece)

X) Multi GPU support
Currently the flags are there to split tensors up, but the implementation is not there. Looks like a small thing, though will get larger with full GPU support.
Done

X) Smaller stuff
I'd like to expose the main application params struct to libfalcon.cpp, would make some stuff more convenient but it's quite a mess to get it through the layers.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Roadmap #14

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Roadmap #14

Uh oh!

Uh oh!

cmp-nct Jun 19, 2023 Maintainer

Replies: 0 comments

cmp-nct
Jun 19, 2023
Maintainer