Assistance Needed: Reason for significantly lower context length limit with LlamaIndex vs appending raw text? #10607

ronan-zelta · 2024-02-13T00:16:15Z

ronan-zelta
Feb 13, 2024

TD;LR: Transitioning a RAG-based chatbot to Llama Index, I encountered a token limit issue with similarity_top_k at 500. Reducing to 80 avoids the error, but it's unclear why Llama Index allows fewer context datapoints than raw text appending. As far as I can tell there isn't significant metadata being passed to the prompt. Would appreciate any insights on managing token limits with Llama Index and debugging text passed to LLM prompts. Thank you! 🙏

I am currently transitioning my RAG-based Chat Bot app from using a manual custom RAG approach, to use Llama Index.

The data points which I am embedding are fairly short strings (1-3 sentence each, none exceeding 100 words). I have not broken my documents down into smaller chunks, because each of these strings is stored in its own .txt file. This approach could probably be improved but for now I just want to focus on the main issue (to do with context length).

My previous approach involved embedding a user provided query and finding the 500 most similar embeddings, then appending their corresponding strings to a prompt, along with a UID associated with each string for use later. This approach worked quite well and I didn't have any problem with hitting the 128K token context limit.

I've been trying to transition to Llama Index, but I'm having issues with hitting the token limit. I created a custom retriever, and set the 'similarity_top_k' value to 500 (see full retriever object below).

from llama_index.core.retrievers import VectorIndexRetriever

# configure retriever
top500_retriever = VectorIndexRetriever(
    index=index,
    llm=OpenAI(model="gpt-4-0125-preview"),
    similarity_top_k=500,
)

But when I try calling chat_engine.chat, I'm getting the following error. I keep seeing this error unless I reduce similarity_top_k from 500 down to 80.

[/usr/local/lib/python3.10/dist-packages/llama_index/core/memory/chat_memory_buffer.py](https://localhost:8080/#) in get(self, initial_token_count, **kwargs)
    107 
    108         if initial_token_count > self.token_limit:
--> 109             raise ValueError("Initial token count exceeds token limit")
    110 
    111         message_count = len(chat_history)

ValueError: Initial token count exceeds token limit

My understanding is that Llama Index passes some metadata to the prompt along with the main retrieved data. But I don't think this is accounting for much of the context length, because as far as I can tell only the relatively short file path is getting passed to the LLM, as per inspection from source_nodes below.

metadata={'file_path': 'explanations/35449.txt', 'file_name': '35449.txt', 'file_type': 'text/plain', 'file_size': 109, 'creation_date': '2024-02-12', 'last_modified_date': '2024-02-12', 'last_accessed_date': '2024-02-12'}

excluded_llm_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date']

Would really appreciate it if anyone could point me in the right direction as to why the number of context datapoints that can fit in a prompt is so much lower using Llama Index vs appending raw text to prompt. Or if anyone could help me with debugging exactly what text is getting passed to the LLM prompt?

I've debugged my source_nodes and made sure that all of my text data has been added to the vector index as expected, and it all looks good. I am using llm=OpenAI(model="gpt-4-0125-preview") everywhere possible to ensure that this isn't caused by Llama Index defaulting to GPT3.5 with shorter context window.

Here are some definitions of the other Llama Index I'm using for clarification.

from llama_index.core.response_synthesizers import (
    get_response_synthesizer,
    BaseSynthesizer,
)

# configure response synthesizer
response_synthesizer = get_response_synthesizer(
    response_mode="compact",
    llm=OpenAI(model="gpt-4-0125-preview"),
)


from llama_index.core.query_engine import CustomQueryEngine

class RAGQueryEngine(CustomQueryEngine):
    """RAG Query Engine."""

    retriever: VectorIndexRetriever
    response_synthesizer: BaseSynthesizer

    def custom_query(self, query_str: str):
        nodes = self.retriever.retrieve(query_str)

        context_str = "\n\n".join([n.node.get_content() for n in nodes])

        prompt = (
            <<< CUSTOM PROMPT (~30 words long) >>>
            f"Query: {query_str}\n"
            f"Answer: "
        )

        response_obj = self.response_synthesizer.synthesize(query_str, nodes)
        return response_obj


query_engine = RAGQueryEngine(
    llm=OpenAI(model="gpt-4-0125-preview"),
    retriever=top500_retriever,
    response_synthesizer=response_synthesizer,
)




from llama_index.core.llms import ChatMessage, MessageRole
from llama_index.core.chat_engine import ContextChatEngine

# list of `ChatMessage` objects
custom_chat_history = [
    ChatMessage(
        role=MessageRole.USER,
        content=(
            <<< INTERNAL PROMPT (~300 words) >>>
            ),
    ),
    ChatMessage(role=MessageRole.ASSISTANT, content=(
        "Okay, sounds good. Ready to receive customer feedback data and query."
        )),
]



chat_engine = ContextChatEngine.from_defaults(
    retriever=top500_retriever,
    query_engine=query_engine,
    llm=OpenAI(model="gpt-4-0125-preview"),
    chat_history=custom_chat_history,
    verbose=True,
)

@ronan-zelta · 2024-02-13T00:16:55Z

dosubot[bot]
bot Feb 13, 2024

🤖

Hey @ronan-zelta, good to see you back with us! Diving into another challenge, I see.

The issue you're encountering seems to be related to the token limit in the ChatMemoryBuffer class of LlamaIndex. The ChatMemoryBuffer class is designed to store the chat history in a memory-efficient way. It has a token_limit attribute which is set to a default value, and this limit is checked against the initial_token_count in the get method. If the initial_token_count exceeds the token_limit, a ValueError is raised.

The initial_token_count is calculated based on the number of tokens in the chat history and the number of tokens in the prompt. The similarity_top_k parameter in the VectorIndexRetriever class is not directly related to this token limit. Instead, it controls the number of most similar embeddings that are returned by the retriever.

The error you're seeing suggests that the total number of tokens in your chat history and prompt is exceeding the token_limit of the ChatMemoryBuffer. When you reduce similarity_top_k to 80, it's likely that the total number of tokens being processed is reduced, hence avoiding the error.

To resolve this issue, you could try the following:

Reduce the number of tokens in your chat history or prompt: This could be done by shortening the text or removing less relevant parts of the conversation.
Increase the token_limit in the ChatMemoryBuffer: This would allow more tokens to be processed at once. However, be aware that this could increase memory usage and potentially slow down your application.

Here's an example of how you could increase the token_limit:

from llama_index.core.memory import ChatMemoryBuffer

# Increase the token limit
ChatMemoryBuffer.token_limit = 1000  # or any value that suits your needs

Remember that the token_limit should be set to a value that balances the need for more context with the available system resources and the performance of your application.

I hope this helps! If you have any more questions or need further clarification, feel free to ask.

Sources

llama_index/indices/query/embedding_utils.py

About Dosu

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

0 replies

logan-markewich · 2024-02-13T04:04:15Z

logan-markewich
Feb 13, 2024
Maintainer

@ronan-zelta hmm, I think you might need to set the token limit for the memory buffer higher

from llama_index.core.memory import ChatMemoryBuffer

memory = ChatMemoryBuffer.from_defaults(token_limit=80000)
chat_engine = ContextChatEngine.from_defaults(...., memory=memory)

1 reply

ronan-zelta Feb 13, 2024
Author

Thanks for the response!

I tried your suggestion, and while it didn't work, it did produce a different error message, which mentioned a max of 4097 tokens. So that set off alarm bells that I was still using gpt-3.5-turbo, even though I manually set the 'llm' kwarg everywhere I could.

So by setting the default LLM globally using code below I got it working!

from llama_index.core import Settings
Settings.llm = OpenAI(temperature=0, model="gpt-4-0125-preview")

Thanks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Assistance Needed: Reason for significantly lower context length limit with LlamaIndex vs appending raw text? #10607

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

About Dosu

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Assistance Needed: Reason for significantly lower context length limit with LlamaIndex vs appending raw text? #10607

Uh oh!

ronan-zelta Feb 13, 2024

Replies: 2 comments · 1 reply

Uh oh!

Uh oh!

dosubot[bot] bot Feb 13, 2024

Sources

About Dosu

Uh oh!

logan-markewich Feb 13, 2024 Maintainer

Uh oh!

ronan-zelta Feb 13, 2024 Author

ronan-zelta
Feb 13, 2024

Replies: 2 comments 1 reply

dosubot[bot]
bot Feb 13, 2024

logan-markewich
Feb 13, 2024
Maintainer

ronan-zelta Feb 13, 2024
Author