Recomputing similar layers? (Layer substitution) #4303

kalomaze · 2023-12-03T05:29:19Z

kalomaze
Dec 3, 2023

@KerfuffleV2 I was interested by your layer skipping work and wanted to pose a question: If we can identify the layers that are doing "more or less" the same thing as another layer, and then run them again while they're still loaded into VRAM, can we effectively skip layers by 'substituting' them?

Essentially what I'm thinking is:

We initialize some random logit scores (I'm not sure how the input scores are typically initialized, is it at zero for all token embeddings? I think logits is not the word I want to use for what the start of this process looks like), then run each layer on those predetermined values, and see which layers have the highest similarity in terms of how the activation changed the scores (via some similarity metric)
Instead of skipping those 'similar layers' entirely, we instead choose to compute the same layer twice, which in theory means that less total layers are loaded into memory, but we are 'sharing' layers instead.
So if layers 13 and 16 seem to have an extremely similar activation impact, we instead keep layer 13 and run it again where the 16th layer in the original model would have been ran

I recognize I'm probably making a lot of assumptions here, and I could be extremely off on the practicality, but I think it's maybe worth exploring, especially for larger models that might have more 'redundancy'.

The assumption I'm making is that certain layers might share a lot of similarities, or may be near identical in how they change the activations in isolation, but the order in which they are applied inherently matters. For example, if there's a shared group of 'grammar layers' (this is more so theoretically speaking, we don't really know what the hidden layers are doing precisely), and every couple layers it 'checks for grammar', but they all same the serve ultimate 'purpose', this is what that would ultimately aim to optimize for.

The impact of each layer is context-dependent, yes, but how much of the layers are truly 'shared' in what they are attempting to achieve, especially in models with 40+ layers? The process of hidden layers being applied sequentially is what led me to this concept.

KerfuffleV2 · 2023-12-04T01:08:44Z

KerfuffleV2
Dec 4, 2023
Collaborator

can we effectively skip layers by 'substituting' them?

Yes, you can basically run layers in whatever order you want but the results may or may not be good. If you're just running a fully copy of an existing layer then that may avoid the graph memory requirement changes skipping them seemed to cause.

The only thing you'd be buying by doing this is reducing memory usage though. I'm not sure how likely it is to work since models are trained running through layers sequentially.

We initialize some random logit scores

I'm not really sure what you mean. The hacked perplexity tool in that branch iterated through layers skipping one and finding which skip resulted in the lowest perplexity and then adding that as a permanent skip, rinse, repeat. You could potentially just run an existing layer instead of skipping there.

I won't be the one to fix the merge conflicts/add features like that though but you're welcome to do whatever you can with that code.

0 replies

BarfingLemurs · 2023-12-04T03:03:11Z

BarfingLemurs
Dec 4, 2023

How do Loras work? Could you apply them on the fly to a group of layers?

There are tools to extract a lora from a finetuned LLM. Could applying Lora's to layers transform a non-similar layer for better perplexity?

2 replies

BarfingLemurs Dec 4, 2023

They say this model: https://huggingface.co/TheBloke/goliath-120b-GGUF/tree/main
is better in creative writing than 70B.

In theory, If I extracted the differences between the two finetuned used in this model, and applied those Lora's after every layer, in the sequence it was merged, this should allow people to run this in Q4_K_M, without an extra 50B, with 48gb vram total.

BarfingLemurs Dec 4, 2023

And if we could extract the difference effectively between two individual layers, we can try experiments with only a single layer in vram, continually applying lora transformations. The Lora's could be loaded from RAM, and potentially very large LLMs (1000B 4bit) can be run on consumer GPUs.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Recomputing similar layers? (Layer substitution) #4303

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Recomputing similar layers? (Layer substitution) #4303

Uh oh!

Uh oh!

kalomaze Dec 3, 2023

Replies: 2 comments · 2 replies

Uh oh!

KerfuffleV2 Dec 4, 2023 Collaborator

Uh oh!

BarfingLemurs Dec 4, 2023

Uh oh!

BarfingLemurs Dec 4, 2023

Uh oh!

Uh oh!

BarfingLemurs Dec 4, 2023

kalomaze
Dec 3, 2023

Replies: 2 comments 2 replies

KerfuffleV2
Dec 4, 2023
Collaborator

BarfingLemurs
Dec 4, 2023