-
My understanding is that unless --batch-size matches the prompt length, the model will not evaluate the weights of all tokens and may in fact not consider all the information in the prompt when generating a response. In one particular use case, I am feeding the model a list of news summaries and I want it to give me an overall conclusion based on all the news summaries. If I use --batch-size that is shorter than the prompt length, the model will not actually look at all the summaries, but will move the --batch-size window across the prompt and summarise only the tokens that fit inside --batch-size. If this is correct, then the use of --batch-size is highly problematic and in effect no different to using embeddings and a search over a vector store, but in a simple fashion rather than actually creating embeddings and indexing them. |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 5 replies
-
Using a larger |
Beta Was this translation helpful? Give feedback.
When a token is evaluated, the result is stored in the KV cache. All previously evaluated tokens are considered when generating a new token. It doesn't matter if you evaluate the prompt one token at a time or in batches of any size, the result in the KV cache is the same.