Weight Layer subsystem #3318

mwt11a · 2023-09-23T16:59:08Z

mwt11a
Sep 23, 2023

I don't think there is going to be much interest in this idea, but thought I would put it out there for any feedback.

I've been working on a weight layer block system. The idea being that lately a number of merged models have had different versions with either different order of the weight layers from the merged parent models or different blocks of layers from each parent. So I was thinking about a system that would allow a single model file that contained more layers than is going to normally be used at a time. But with definition files that defined what layers and how they were used from that larger model file.

I wrote a (rough work in progress) overview of the idea here: https://pastebin.com/6VT3SUy9

That also includes a link to the (extremely rough) work in progress code. That code is very much just a proof of concept and needs a lot of work. However with not knowing if there is any interest or use in this idea and with the Parallel decoding + continuous batching support comits coming that I believe will break what I've done. I'm not sure if it is worth continuing with.

BarfingLemurs · 2023-09-23T17:11:21Z

BarfingLemurs
Sep 23, 2023

5 replies

mwt11a Sep 23, 2023
Author

I wasn't really thinking about the models that combine weights from different sizes. I was more thinking of the current 19B/20B models that have only weights from the 13B sized models. They just have more layers than normal. They seem to work better than the models that try to combine 13B and 33B models. However for example with one of those 20B merged models. There was four different versions. With different sets of layers from different parent models. Or with those sets of layers in different orders.

BarfingLemurs Sep 23, 2023

I don't know bout the 20b model then, but there isn't any block merge showcase in llama.cpp, I feel that adding a small example for tinkering would definitely be looked at favorably, that could be great for enhancing our models.. tests with xy plots.. but adding the whole file as a blob of variable amounts of layers.. not sure? :')

mwt11a Sep 23, 2023
Author

Thank you for your feedback and Yes I can see having a merge example would be good to have included with llama.cpp.

But that isn't my area of interest. One of the ideas with this block system was to get beyond having static merges. So to allow at least the merge percentages to be changed at runtime if not to be changed per eval. Or for the layers to be changed per eval. So with the layer changes it would be in a way kind of like having different loras that can be changed between. Although just being able to hot swap lora could be better for that use case. The other use case I could see this system being useful for is for people who do create static merges. With the merge percentages and layers being able to be changed from a definition file. It should be much easier for someone to test different merge strategies to find what ones work best.

The other use case I'm currently using this for is to test a DOLA: DECODING BY CONTRASTING LAYERS (https://arxiv.org/pdf/2309.03883.pdf) type sampling method. Of course this could be implemented without any block system like I have done. But I have found the block system makes this easier as I can just add a extra head blocks that takes the input from one of the other blocks. And easily change the configuration of these and then in sampling compare the headblocks to the final head result.

Anyway I don't see this ever being merged into llama.cpp in any where near the form it is at the moment.

BarfingLemurs Sep 23, 2023

allow at least the merge percentages to be changed at runtime if not to be changed per eval.

Yes!
But with loading two models, into ram, unless that's too expensive, has to be better (for people playtesting), imagine all these weird ass model files on hf

mwt11a Sep 23, 2023
Author

Well with mmap the layers that aren't used shouldn't get loaded into memory. At least they don't in my tests. For example one of the uses I have made of this is that one of my laptops only has 16GB which just about allows the use of Q3 33B models. But even then they are on the edge, and sometimes end up with a lot of disk reading. So I have tested leaving a few layers out by using this system. And in my tests as long as you don't leave out more than 4 or 5 layers (Spread out across the model) then their isn't a noticable difference in output. But it certainly uses less memory because those unused layers aren't loaded in.

Although if it did support per eval changes then yes you are going to need enough memory for all the layers that get used.

Edit: sorry saw you were talking about changing the merging percentages. Not changing the layers used. So yes it would use double the layers for the areas it merged. Although I don't think it would be as much as two whole models. As most merges I believe use a certain number of layers from a single model and only merge some of the layers.

KerfuffleV2 · 2023-09-23T21:58:21Z

KerfuffleV2
Sep 23, 2023
Collaborator

This isn't exactly the same, but I've recently been thinking that an unpacked GGUF format could be really useful. By that, I mean the different GGUF metadata keys could just be text files in a simple format. The actual weights could be separate binary files.

To load it, you'd just mmap all the weight separately (this also wouldn't require dealing with stuff like padding/alignment).

The way it relates to this discussion is with that approach, you could easily swap out layers/weights, change metadata, etc, without having to rewrite huge files or deal with file format organization issues. Or include extra weights, of course.

It could also be really useful for testing stuff like different approaches to quantizing different layers, weights, etc. Currently testing that (especially on bigger models) is really awkward and time consuming. If you could just swap tensors with different versions, testing a bunch of permutations would be wayyy easier.

I actually don't think this would be too hard to implement either.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Weight Layer subsystem #3318

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 5 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Weight Layer subsystem #3318

Uh oh!

mwt11a Sep 23, 2023

Replies: 2 comments · 5 replies

Uh oh!

BarfingLemurs Sep 23, 2023

Uh oh!

mwt11a Sep 23, 2023 Author

Uh oh!

BarfingLemurs Sep 23, 2023

Uh oh!

mwt11a Sep 23, 2023 Author

Uh oh!

BarfingLemurs Sep 23, 2023

Uh oh!

Uh oh!

mwt11a Sep 23, 2023 Author

Uh oh!

KerfuffleV2 Sep 23, 2023 Collaborator

mwt11a
Sep 23, 2023

Replies: 2 comments 5 replies

BarfingLemurs
Sep 23, 2023

mwt11a Sep 23, 2023
Author

mwt11a Sep 23, 2023
Author

mwt11a Sep 23, 2023
Author

KerfuffleV2
Sep 23, 2023
Collaborator