Assistance Needed: Reason for significantly lower context length limit with LlamaIndex vs appending raw text? #10607
Replies: 2 comments 1 reply
-
🤖 Hey @ronan-zelta, good to see you back with us! Diving into another challenge, I see. The issue you're encountering seems to be related to the token limit in the The The error you're seeing suggests that the total number of tokens in your chat history and prompt is exceeding the To resolve this issue, you could try the following:
Here's an example of how you could increase the from llama_index.core.memory import ChatMemoryBuffer
# Increase the token limit
ChatMemoryBuffer.token_limit = 1000 # or any value that suits your needs Remember that the I hope this helps! If you have any more questions or need further clarification, feel free to ask. Sources
|
Beta Was this translation helpful? Give feedback.
-
@ronan-zelta hmm, I think you might need to set the token limit for the memory buffer higher
|
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
TD;LR: Transitioning a RAG-based chatbot to Llama Index, I encountered a token limit issue with similarity_top_k at 500. Reducing to 80 avoids the error, but it's unclear why Llama Index allows fewer context datapoints than raw text appending. As far as I can tell there isn't significant metadata being passed to the prompt. Would appreciate any insights on managing token limits with Llama Index and debugging text passed to LLM prompts. Thank you! 🙏
I am currently transitioning my RAG-based Chat Bot app from using a manual custom RAG approach, to use Llama Index.
The data points which I am embedding are fairly short strings (1-3 sentence each, none exceeding 100 words). I have not broken my documents down into smaller chunks, because each of these strings is stored in its own .txt file. This approach could probably be improved but for now I just want to focus on the main issue (to do with context length).
My previous approach involved embedding a user provided query and finding the 500 most similar embeddings, then appending their corresponding strings to a prompt, along with a UID associated with each string for use later. This approach worked quite well and I didn't have any problem with hitting the 128K token context limit.
I've been trying to transition to Llama Index, but I'm having issues with hitting the token limit. I created a custom retriever, and set the 'similarity_top_k' value to 500 (see full retriever object below).
But when I try calling
chat_engine.chat
, I'm getting the following error. I keep seeing this error unless I reducesimilarity_top_k
from 500 down to 80.My understanding is that Llama Index passes some metadata to the prompt along with the main retrieved data. But I don't think this is accounting for much of the context length, because as far as I can tell only the relatively short file path is getting passed to the LLM, as per inspection from
source_nodes
below.Would really appreciate it if anyone could point me in the right direction as to why the number of context datapoints that can fit in a prompt is so much lower using Llama Index vs appending raw text to prompt. Or if anyone could help me with debugging exactly what text is getting passed to the LLM prompt?
I've debugged my source_nodes and made sure that all of my text data has been added to the vector index as expected, and it all looks good. I am using
llm=OpenAI(model="gpt-4-0125-preview")
everywhere possible to ensure that this isn't caused by Llama Index defaulting to GPT3.5 with shorter context window.Here are some definitions of the other Llama Index I'm using for clarification.
Beta Was this translation helpful? Give feedback.
All reactions