Replies: 2 comments 2 replies
-
Yes, you can basically run layers in whatever order you want but the results may or may not be good. If you're just running a fully copy of an existing layer then that may avoid the graph memory requirement changes skipping them seemed to cause. The only thing you'd be buying by doing this is reducing memory usage though. I'm not sure how likely it is to work since models are trained running through layers sequentially.
I'm not really sure what you mean. The hacked perplexity tool in that branch iterated through layers skipping one and finding which skip resulted in the lowest perplexity and then adding that as a permanent skip, rinse, repeat. You could potentially just run an existing layer instead of skipping there. I won't be the one to fix the merge conflicts/add features like that though but you're welcome to do whatever you can with that code. |
Beta Was this translation helpful? Give feedback.
-
How do Loras work? Could you apply them on the fly to a group of layers? There are tools to extract a lora from a finetuned LLM. Could applying Lora's to layers transform a non-similar layer for better perplexity? |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
@KerfuffleV2 I was interested by your layer skipping work and wanted to pose a question: If we can identify the layers that are doing "more or less" the same thing as another layer, and then run them again while they're still loaded into VRAM, can we effectively skip layers by 'substituting' them?
Essentially what I'm thinking is:
I recognize I'm probably making a lot of assumptions here, and I could be extremely off on the practicality, but I think it's maybe worth exploring, especially for larger models that might have more 'redundancy'.
The assumption I'm making is that certain layers might share a lot of similarities, or may be near identical in how they change the activations in isolation, but the order in which they are applied inherently matters. For example, if there's a shared group of 'grammar layers' (this is more so theoretically speaking, we don't really know what the hidden layers are doing precisely), and every couple layers it 'checks for grammar', but they all same the serve ultimate 'purpose', this is what that would ultimately aim to optimize for.
The impact of each layer is context-dependent, yes, but how much of the layers are truly 'shared' in what they are attempting to achieve, especially in models with 40+ layers? The process of hidden layers being applied sequentially is what led me to this concept.
Beta Was this translation helpful? Give feedback.
All reactions