Replacing the LlamaDecoderLayer Class hugging Face With New LongNet #94
Replies: 1 comment
-
Thanks for posting, but questions about the Hugging Face transformer library would be out of the scope for this book. I think this question might be a better fit for the Hugging Face forums. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Hey i hope you are doing Great this weekend
i would like to ask you Please a Technical Question !!
i working on the CodeLLama Model which Uses a Decoder-Only Model Transformer following Arch Blow
Main Task is replaced
Decoder-Only
which used Masked-Self-Attention and KV_cache with my ownEncoder-Only
which used Diltaed-Attention used in LongNethere the code Based on
I planned to Replace the Block of
LlamaDecoderLayer
following withinEncoder-only
here the Origin BlockDecoder-Only
used inCodeLlama
:with my own using Inherent from base Class From Hugging Face Here my Following Process i did to Replace with
Encoder-only
Step 1 : Inherent From LlamaConfig To adjust the new parameters config used in my own Encoder model which used
Dilated Multi-heads Attention
Output :
Step 2 : the only part i wanted to Replace is
self_attn
and my own Multi-head-Dilaed Attention is following isLongNet
based Mechanism following code BlowHere
the Dilated Attention
usedflash_Attention_2
is Optional based on GPU used arch supportA100
orT4 GPU
Here The Multi-head Dilated Attention
To do so and Repalce the Layer used Inherent base Class from Hugging face
Notation: As long as
is_causal=None
the learning of the Attention Mechanism is not masked which leads int Fully Learning Representation to produce the Embedding Space of Vectors of Tokens which means theEncoder-Only
learns the feature Representation relevant between Tokens attended to Druing Dot-Product Similarity instead of `Decoder-Only used Masked-Attention which I am not interested to use at the pointStep 4 : ReConstructed the Model using Adjustment
Config Class
I did the followingNotation: i adjusted
num_hidden_layers
only for show caseconfig.num_hidden_layers = 2
the origin param isnum_hidden_layers=32
Notation: i didn't use Rotary Embedding Because of Attention used is Linear
Q 1 Correct me Please if i need to keep
Rotary Embedding
in myEncoder-Only
Output:
Finally Step: Transfer Learning The Weights Layers following
["q_proj", "k_proj", "v_proj", "o_proj"]
FromDecoder-Only
to `Encoder-Only``Here Comparing the New
Encoder-Only
withDecoder-Only
Decoder-Only used in CodeLlama
Encoder-Only used in CodeLlama with Adujsment i did
both are has similar linear Layers in the following
["q_proj", "k_proj", "v_proj", "o_proj"]
the code i built to do Transfering the Weights
Output
Please Correct me if missed understanding anything Transform the CodeLlama to be Encoder-Only to learn the Embedding
Thank you so much for your advance
Beta Was this translation helpful? Give feedback.
All reactions