Unlimiformer #1357

xaedes · 2023-05-07T19:06:24Z

xaedes
May 7, 2023
Collaborator

From the paper "Unlimiformer: Long-Range Transformers with Unlimited Length Input"

Transformer-based models typically have a predefined bound to their input length, because of their need to potentially attend to every token in the input. In this work, we propose Unlimiformer: a general approach that can wrap any existing pretrained encoder-decoder transformer, and offload the attention computation across all layers to a single k-nearest-neighbor index; this index can be kept on either the GPU or CPU memory and queried in sub-linear time. This way, we can index extremely long input sequences, while every attention head in every decoder layer retrieves its top-k keys, instead of attending to every key. We demonstrate Unlimiformers's efficacy on several long-document and multi-document summarization benchmarks, showing that it can summarize even 350k token-long inputs from the BookSum dataset, without any input truncation at test time. Unlimiformer improves pretrained models such as BART and Longformer by extending them to unlimited inputs without additional learned weights and without modifying their code.

This probably trades away some quality of the generation, but unlimited length input sounds fantastic. No training required and can work with any pretrained model.

Would be nice to have for llama! Biggest hurdle would probably be the knn index.

Green-Sky · 2023-05-07T20:14:42Z

Green-Sky
May 7, 2023
Collaborator

their code is here https://github.com/abertsch72/unlimiformer

0 replies

FNsi · 2023-05-08T10:47:45Z

FNsi
May 8, 2023

~~"encoder-decoder" let me assume that may not working on llama?~~

0 replies

SlyEcho · 2023-05-08T11:01:52Z

SlyEcho
May 8, 2023
Collaborator

It mentions this paper: https://openreview.net/forum?id=TrjbxzRcnf-

They put the lookup on only one layer and there are some parameters also that need training.

1 reply

FNsi May 8, 2023

I see the paper

4.2 EXPERIMENTAL METHOD

We used a 12-layer decoder-only transformer (with and without Transformer-XL cache) with an embedding size of 1024, 8 attention heads of dimension 128, and an FFN hidden layer of size 4096. For all of our experiments, we used k = 32. Unless specified otherwise, we use the 9th layer as the kNN augmented attention layer. We used a sentence-piece (Kudo & Richardson, 2018) tokenizer with a vocabulary size of 32K.

So that should also work with llama.

ggerganov · 2023-05-08T17:08:14Z

ggerganov
May 8, 2023
Maintainer

Looks interesting - we should probably make a PoC

Does anyone understand if the kNN works per-token or per-context or something in-between?

For example, let's assume a LLaMA model with 2048 context size.
We create a datastore for a context of 32*2048 tokens by encoding 32 blocks of 2048 tokens.
Then for each decoder layer, do we search across the 32 blocks and select the best one, or do we search across all 65536 tokens independently from their block participation and select the top 2048.

2 replies

SlyEcho May 8, 2023
Collaborator

It is more like there are 32 KV caches and you select the closest one. This is what Wu describes.

Unlimiformer:

The use of a datastore for the encoded tokens, pioneered by Wu et al. (2022), increases the maximum
input length significantly. However, this naïve approach requires constructing separate datastores for
the attention keys and values at each layer and each head, for a total of 2 × L × H datastores, where L
is the number of decoder layers and H is the number of attention heads. A separate datastore for
each attention head in each decoder layer would be both time-intensive to create and space-intensive to
store. So, not surprisingly, Wu et al. (2022) apply their memory layer to only a single decoder layer.

Then they say their version is more efficient because they are doing the lookup per attention head

Thus, rather than constructing 2 × L × H datastores and retrieving from every datastore during
each decoding step, we construct a single data storeand retrieve from it by just projecting the decoder
hidden states to per-head $h_d W_q W_k^T$

So that means they operate with the attention head vector $[128]$. I am unsure how this maps to LLaMa's decoder only architecture? Is $h_e$ same as $h_d$?

FNsi May 20, 2023

The author mentioned in Github issue Question about decoder models #2

There's nothing that practically prevents this from being applied to decoder-only models.
Just implementation and making sure that the right tensors fall into the right locations.

transformer blog

On a final side-note, auto-regressive models, such as GPT2, have the same architecture as transformer-based decoder models if one removes the cross-attention layer because stand-alone auto-regressive models are not conditioned on any encoder outputs. So auto-regressive models are essentially the same as auto-encoding models but replace bi-directional attention with uni-directional attention.

also

Similar to RNN-based encoder-decoder models, the transformer-based encoder-decoder models define a conditional distribution of target vectors $$(\mathbf{Y}{1:n})$$ given an input sequence $$(\mathbf{X}{1:n})$$:

$$ p_{\theta_{\text{enc}}, \theta_{\text{dec}}}(\mathbf{Y}{1:m} | \mathbf{X}{1:n}). $$

The transformer-based encoder part encodes the input sequence $$(\mathbf{X}{1:n}) $$to a sequence of hidden states $$(\mathbf{\overline{X}}{1:n})$$, thus defining the mapping:

$$ f_{\theta_{\text{enc}}}: \mathbf{X}{1:n} \to \mathbf{\overline{X}}{1:n}$$.

The transformer-based decoder part then models the conditional probability distribution of the target vector sequence $$(\mathbf{Y}{1:n})$$ given the sequence of encoded hidden states $$(\mathbf{\overline{X}}{1:n})$$:

$$ p_{\theta_{dec}}(\mathbf{Y}{1:n} | \mathbf{\overline{X}}{1:n}).$$

By Bayes' rule, this distribution can be factorized to a product of conditional probability distribution of the target vector $$(\mathbf{y}i)$$ given the encoded hidden states $$(\mathbf{\overline{X}}{1:n})$$ and all previous target vectors $$(\mathbf{Y}_{0:i-1})$$:

$$ p_{\theta_{dec}}(\mathbf{Y}{1:n} | \mathbf{\overline{X}}{1:n}) = \prod_{i=1}^{n} p_{\theta_{\text{dec}}}(\mathbf{y}i | \mathbf{Y}{0: i-1}, \mathbf{\overline{X}}_{1:n}). $$

The transformer-based decoder hereby maps the sequence of encoded hidden states $$(\mathbf{\overline{X}}{1:n})$$ and all previous target vectors $$(\mathbf{Y}{0:i-1})$$ to the logit vector $$(\mathbf{l}i)$$. The logit vector $$(\mathbf{l}i)$$ is then processed by the softmax operation to define the conditional distribution $$(p{\theta{\text{dec}}}(\mathbf{y}i | \mathbf{Y}{0: i-1}$$, $$\mathbf{\overline{X}}_{1:n}))$$, just as it is done for RNN-based decoders. However, in contrast to RNN-based decoders, the distribution of the target vector $$(\mathbf{y}_i)$$ is explicitly (or directly) conditioned on all previous target vectors $$(\mathbf{y}0, \ldots, \mathbf{y}{i-1})$$ as we will see later in more detail. The 0th target vector$$ (\mathbf{y}_0) $$is hereby represented by a special "begin-of-sentence" $$(\text{BOS}) $$ vector.

Having defined the conditional distribution $$(p_{\theta_{\text{dec}}}(\mathbf{y}i | \mathbf{Y}{0: i-1}, \mathbf{\overline{X}}{1:n}))$$, we can now auto-regressively generate the output and thus define a mapping of an input sequence $$(\mathbf{X}{1:n})$$ to an output sequence $$(\mathbf{Y}_{1:m})$$ at inference.

FNsi · 2023-05-09T15:53:20Z

FNsi
May 9, 2023

Although the model was trained with a sequence length of 2048, ALiBi enables users to increase the maximum sequence length during finetuning and/or inference. For example:

I saw mpt-7b mention that AliBi, any clue of it?

3 replies

SlyEcho May 9, 2023
Collaborator

It requires the model be trained with it from the beginning. Probably not useful for LLaMa. Unlimiformer uses an external memory, but I have doubts how it would work in LLaMa as well since it stores and retrieves the encoder hidden vector in the datastore, but LLaMa is decoder-only.

FNsi May 9, 2023

It requires the model be trained with it from the beginning. Probably not useful for LLaMa. Unlimiformer uses an external memory, but I have doubts how it would work in LLaMa as well since it stores and retrieves the encoder hidden vector in the datastore, but LLaMa is decoder-only.

~~I see, and their words shows that Alibi could be use in a funetuning step?~~

~~And the uniformer, which also require training for better results.~~

~~If we take a conclusion here, it shows loras support long context are coming soon?~~

~~And I suppose that would be high priory let GGML to be ready for all those long ctx loras?~~

FNsi May 9, 2023

I see #1348, wondering if that architecture could be hacked?

xaedes · 2023-05-20T16:23:37Z

xaedes
May 20, 2023
Collaborator Author

I think we could integrate an knn index in llama.cpp by modifing the K and V matrices in eval_internal:

https://github.com/ggerganov/llama.cpp/blob/b8ee340abe4a46143eb8cc4a135b976856c9136c/llama.cpp#L1260-L1305

Q shape [n_embd/n_head, N, n_head] is the roped query for each new token that we also use to query the knn index.
K shape [n_embd/n_head, n_past + N, n_head] are the roped keys for all tokens that are also used as keys to insert in the knn index.
KQ shape [n_past + N, N, n_head] measures how strongly each token key correlates with each query with dot(key_row, query_row).
V shape [n_past + N, n_embd/n_head, n_head] are the values for each token, each column corresponding to a key row.
KQV shape [n_embd/n_head, N, n_head] is the result for each new token of the query lookup Q in the K,V 'database', by activating the values based on KQ.

To enhance this with an knn index we could insert K,V pairs, as calculated above, in the index. For example when they get pushed out of n_ctx.

Note that K is roped.. This makes things a bit ambiguous... for which position in context window (each resulting in different rope) should it be inserted? Maybe for all? Seems wasteful, maybe some regular samples over context length. For example roped at positions [0, n_ctx*1/4, n_ctx*2/4, n_ctx*3/4]

Then we modify K and V in one or more layers.
Lookup n_knn items for the queries Q. Note that there are many queries.
This returns K_knn,V_knn from the knn index.
These are tensors of shape K_knn [n_embd/n_head, n_knn] and V_knn [n_knn, n_embd/n_head].

Augment K and V by concatinating K_knn and V_knn before K_knn:
K := concat(K_knn, K, axis=1) resulting in shape [n_embd/n_head, n_knn + n_past + N, n_head]
V := concat(V_knn, V, axis=0) resulting in shape [n_knn + n_past + N, n_embd/n_head, n_head]

Then proceed with KQV as in original code.

The whole KQV setup as above is in itself some kind of K,V database.
So maybe we could use similar for simulating a knn_index in a proof-of-concept without external dependencies.
Something like this:

Make big K_index tensor of shape [n_embd/n_head, n_index].
Make big V_index tensor of shape [n_index, n_embd/n_head].

Insert new K,V vector pairs by writing into key rows and value columns at current size and increase size by one.

For each key row, track some statistics about usage of the entries so we have some data to decide which to replace when the index is full.

To query the index:
Compute K_indexQ = K_index*Q with shape [n_index, N, n_head]
Sum or max columns of K_indexQ to measure how active each index entry is. Resulting in shape [n_index, 1, n_head]
Sort by index activity to select top n_knn corresponding rows of K and columns of V.
Update usage statistics.
Resulting in K_knn of shape [n_embd/n_head, n_knn, n_head] and V_knn of shape [n_knn, n_embd/n_head, n_head]

Computing K_indexQ results in a big matrix multiplication when evaluating a lot of new tokens at once, in this case this can will be a bottleneck. But when only generating a single token each time (N = 1) it should be ok.

0 replies

okpatil4u · 2023-05-20T16:57:09Z

okpatil4u
May 20, 2023

That is pretty impressive. But do you think if this should come under scope of this repo ? You could already use something like Langchain or Vector search engine in front of Llama.cpp to achieve this.

…

On Sat, 20 May 2023 at 9:53 PM, xaedes ***@***.***> wrote: I think we could integrate an knn index in llama.cpp by modifing the K and V matrices in eval_internal: https://github.com/ggerganov/llama.cpp/blob/b8ee340abe4a46143eb8cc4a135b976856c9136c/llama.cpp#L1260-L1305 Q shape [n_embd/n_head, N, n_head] is the roped query for each new token that we also use to query the knn index. K shape [n_embd/n_head, n_past + N, n_head] are the roped keys for all tokens that are also used as keys to insert in the knn index. KQ shape [n_past + N, N, n_head] measures how strongly each token key correlates with each query with dot(key_row, query_row). V shape [n_past + N, n_embd/n_head, n_head] are the values for each token, each column corresponding to a key row. KQV shape [n_embd/n_head, N, n_head] is the result for each new token of the query lookup Q in the K,V 'database', by activating the values based on KQ. To enhance this with an knn index we could insert K,V pairs, as calculated above, in the index. For example when they get pushed out of n_ctx. Note that K is roped.. This makes things a bit ambiguous... for which position in context window (each resulting in different rope) should it be inserted? Maybe for all? Seems wasteful, maybe some regular samples over context length. For example roped at positions [0, n_ctx*1/4, n_ctx*2/4, n_ctx*3/4] Then we modify K and V in one or more layers. Lookup n_knn items for the queries Q. Note that there are many queries. This returns K_knn,V_knn from the knn index. These are tensors of shape K_knn [n_embd/n_head, n_knn] and V_knn [n_knn, n_embd/n_head]. Augment K and V by concatinating K_knn and V_knn before K_knn: K := concat(K_knn, K, axis=1) resulting in shape [n_embd/n_head, n_knn + n_past + N, n_head] V := concat(V_knn, V, axis=0) resulting in shape [n_knn + n_past + N, n_embd/n_head, n_head] Then proceed with KQV as in original code. ------------------------------ The whole KQV setup as above is in itself some kind of K,V database. So maybe we could use similar for simulating a knn_index in a proof-of-concept without external dependencies. Something like this: Make big K_index tensor of shape [n_embd/n_head, n_index]. Make big V_index tensor of shape [n_index, n_embd/n_head]. Insert new K,V vector pairs by writing into key rows and value columns at current size and increase size by one. For each key row, track some statistics about usage of the entries so we have some data to decide which to replace when the index is full. To query the index: Compute K_indexQ = K_index*Q with shape [n_index, N, n_head] Sum or max columns of K_indexQ to measure how active each index entry is. Resulting in shape [n_index, 1, n_head] Sort by index activity to select top n_knn corresponding rows of K and columns of V. Update usage statistics. Resulting in K_knn of shape [n_embd/n_head, n_knn, n_head] and V_knn of shape [n_knn, n_embd/n_head, n_head] Computing K_indexQ results in a big matrix multiplication when evaluating a lot of new tokens at once, in this case this can will be a bottleneck. But when only generating a single token each time (N = 1) it should be ok. — Reply to this email directly, view it on GitHub <#1357 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAXGU4AVTLCEVMHJYIUOTPLXHDVZNANCNFSM6AAAAAAXZCFDXI> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.*** com>

1 reply

FNsi May 21, 2023

Transformer also can use cpu, and doesn't matter us to choose ggml.

FNsi · 2023-05-22T03:06:30Z

FNsi
May 22, 2023

Maybe a small llm can use to locate the tensors in the right place so extend the context length?

0 replies

rtkclouds · 2023-05-23T07:24:29Z

rtkclouds
May 23, 2023

I have been developing unconventional transformer platforms using segmentation logic for large contexts and low processing.
I use some cryptographic techniques such as addler-type sums to generate intermediate tokens and segment the data, I believe that they are making mistakes in the technology when viewing the data, I believe in only one transformer, being used stacked as a projector and stacking the readings , but multiplying by a dimensional variable that changes the intensity and importance of the detection itself.
Regarding the autonomous detection of open domains, I have anticipated the ideas of the bigtechs and I believe that the way out is in the union of the current system, with a better data access capacity. If you're interested, I can post some prototypes, I've never participated in anything on github, because I've noticed that bigtechs often purposely use them to strangle innovations and other points of view. Much of what I develop is completely the opposite of what is used, often seeming that what is done is taught wrong on purpose to delay development. But I've been enjoying the development and the purpose that people here have had, if anyone is interested just leave a reply or message and I'll send

1 reply

rtkclouds May 23, 2023

I believe that the use of recurrence with a module application manages to propagate the characteristics of the data through simple operations, in the same way that addler32 is not used because it has a vulnerability, this vulnerability, if exploited, can add similar items as I have been researching, especially when used as a sublayer, it can normalize quickly when converted back into an embbed from a token generated from a dimension. This has no lines of research, and I believe it has a lot to use, at least I have used it quite often, and it has replaced with much better results, all types of normalization, not to mention the speed of convergence

rtkclouds · 2023-05-25T19:30:26Z

rtkclouds
May 25, 2023

in the last months I have been working with the creation of autonomous learning methods. one that is working very well is a different way of structuring the data.
I used 50 billion tokens to generate a frequency of the words, with the frequency I used a sum as if it were addler9 to generate an autonomous class from the rounded number of the accumulated sum. with that I recorded the values that were most recurrent for each word. After normalizing the frequency from 0 to 10, in 10 scales, for each word I generated a corresponding vector of size 1024 for the addle and 1024 for the frequency.
every precision word in adder's class result has a probability equal to 90% of its accuracy to be shown as a class and not as a word.
every word checks if there is a class (word that satisfied the previously mentioned conditional) and adds the value in the vector at a distance of up to 1024 positions, since this class is non-recurring it adds the value of the class to the vector of the class (doing the same scheme of bellman equation to keep the value without exploding)
the result of this is that the sum of the Adler, due to the failure of conflict by recurrent sequences, ends up in conjunction with frequency and randomness, generating a prediction capacity based on class + frequency that matches the newest technologies, I have been making an effort in this point and in creating fractal layers, where a single parameter updates the entire layer by the equation, central parameters update the equations. This allows you to run models the size of the lamma on the cpu, for example. Other points that I have studied are obtaining the result using 100% random networks, which if you want I can explain better how they work, they are not based on data to learn, it is a kind of one armed bandit that knows less than them, and that has had very good results. but it seems that because it is a blackbox that aims at results and not how to obtain them, people avoid wanting to be interested in the subject kk

0 replies

SharkWipf · 2023-08-21T23:55:43Z

SharkWipf
Aug 21, 2023

I just noticed this discussion, and wanted to point out that Unlimiformer as of recently supports Llama-2 natively, so if any implementation details were unclear before, there's functional reference now.

0 replies

shi-kejian · 2023-09-08T01:16:46Z

shi-kejian
Sep 8, 2023

https://github.com/abertsch72/unlimiformer

They just added support!

1 reply

Green-Sky Sep 8, 2023
Collaborator

(support for llama2)

Unlimiformer #1357

Uh oh!

xaedes May 7, 2023 Collaborator

Replies: 12 comments · 9 replies

Uh oh!

Green-Sky May 7, 2023 Collaborator

Uh oh!

Uh oh!

Uh oh!

SlyEcho May 8, 2023 Collaborator

Uh oh!

Uh oh!

Uh oh!

ggerganov May 8, 2023 Maintainer

Uh oh!

Uh oh!

SlyEcho May 8, 2023 Collaborator

Uh oh!

Uh oh!

Uh oh!

Uh oh!

SlyEcho May 9, 2023 Collaborator

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

xaedes May 20, 2023 Collaborator Author

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Green-Sky Sep 8, 2023 Collaborator

xaedes
May 7, 2023
Collaborator

Replies: 12 comments 9 replies

Green-Sky
May 7, 2023
Collaborator

SlyEcho
May 8, 2023
Collaborator

ggerganov
May 8, 2023
Maintainer

SlyEcho May 8, 2023
Collaborator

SlyEcho May 9, 2023
Collaborator

xaedes
May 20, 2023
Collaborator Author

Green-Sky Sep 8, 2023
Collaborator