Skip to content

Fix BedrockConverse token count in usage to match OpenAI's format #18383

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

ArmandBriere
Copy link

@ArmandBriere ArmandBriere commented Apr 6, 2025

Description

Motivation: I'm currently using LlamaIndex with Bedrock and Langfuse. LLama3 token count tracking is not working with BedrockConverse, I made a fix and explained it all here:

I updated the raw output of the BedrockConverse Base class that had the wrong usage format.

Looking at Langfuse traces we can see that the previous implementation was simply adding the prompt_token to the end of the traces.

This is due to the usage of additional_kwargs in the ChatResponse object.

Renaming the prompt_tokens, completion_tokens and total_tokens directly in the response['usage'] dictionary fix the issue and follows OpenAI's format.

Here is some python code to run Bedrock Converse with Langfuse's tracing to observe the impact:

from typing import Any, Sequence
from langfuse.llama_index import LlamaIndexInstrumentor
from llama_index.core.output_parsers import PydanticOutputParser
from llama_index.llms.bedrock_converse import BedrockConverse
from llama_index.core.program import LLMTextCompletionProgram
from pydantic import BaseModel
from llama_index.llms.bedrock_converse.utils import (
    messages_to_converse_messages,
    converse_with_retry,
)

from llama_index.core.base.llms.types import (
    ChatMessage,
    ChatResponse,
    MessageRole,
)
from llama_index.core.llms.callbacks import (
    llm_chat_callback,
)

# Setup Langfuse Instrumentor to track everything from LlamaIndex
instrumentor: LlamaIndexInstrumentor = LlamaIndexInstrumentor(
    host="http://localhost:3000",
)

# Start the instrumentor
instrumentor.start()


# Pydantic model for the output
class Song(BaseModel):
    """Data model for a song."""

    title: str
    length_seconds: int


class Album(BaseModel):
    """Data model for an album."""

    name: str
    artist: str
    songs: list[Song]


class SerialSeven(BaseModel):
    """Data model for the serial number."""

    serial_number: list[int]


class CustomBedrockConverse(BedrockConverse):
    """Custom Bedrock Converse class to override wrong token count."""

    @llm_chat_callback()
    def chat(self, messages: Sequence[ChatMessage], **kwargs: Any) -> ChatResponse:
        # convert Llama Index messages to AWS Bedrock Converse messages
        converse_messages, system_prompt = messages_to_converse_messages(messages)
        all_kwargs = self._get_all_kwargs(**kwargs)

        # invoke LLM in AWS Bedrock Converse with retry
        response = converse_with_retry(
            client=self._client,
            messages=converse_messages,
            system_prompt=system_prompt,
            max_retries=self.max_retries,
            stream=False,
            guardrail_identifier=self.guardrail_identifier,
            guardrail_version=self.guardrail_version,
            trace=self.trace,
            **all_kwargs,
        )

        content, tool_calls, tool_call_ids, status = self._get_content_and_tool_calls(
            response
        )

        dict_response = dict(response)
        # Add Bedrock's token count to usage dict to match OpenAI's format
        dict_response["usage"] = self._get_response_token_counts(dict_response)

        return ChatResponse(
            message=ChatMessage(
                role=MessageRole.ASSISTANT,
                content=content,
                additional_kwargs={
                    "tool_calls": tool_calls,
                    "tool_call_id": tool_call_ids,
                    "status": status,
                },
            ),
            raw=dict_response,
            additional_kwargs=self._get_response_token_counts(dict(response)),
        )


llm = CustomBedrockConverse(
    model="meta.llama3-8b-instruct-v1:0",
)

prompt_template_str = """\
Generate an example album, with an artist and a list of songs. \
Using the movie {movie_name} as inspiration.\
"""

program: LLMTextCompletionProgram = LLMTextCompletionProgram.from_defaults(
    output_cls=Album,
    prompt_template_str=prompt_template_str,
    verbose=True,
    llm=llm,
)


# Define a function to run the workflow
def run_workflow():
    # Create a Langfuse trace
    with instrumentor.observe():
        output = program(movie_name="The shining")

    return output


if __name__ == "__main__":
    print(run_workflow())

    instrumentor.flush()

Running this code was throwing the following warning:

python main.py
Usage object must have either {input, output, total, unit} or {promptTokens, completionTokens, totalTokens}
Traceback (most recent call last):
  File "~/llm-playground/.venv/lib/python3.12/site-packages/langfuse/client.py", line 2722, in update
    "usage": _convert_usage_input(usage) if usage is not None else None,
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "~/llm-playground/.venv/lib/python3.12/site-packages/langfuse/utils/__init__.py", line 103, in _convert_usage_input
    raise ValueError(
ValueError: Usage object must have either {input, output, total, unit} or {promptTokens, completionTokens, totalTokens}

Here is the Langfuse trace output for the execution above:

image

The new implementation removes the ValueError from above and fixes the usage values:

image

Langfuse trace example with OpenAI client:

image

No code were removed to ensure backward compatibility

Fixes # (issue)

New Package?

Did I fill in the tool.llamahub section in the pyproject.toml and provide a detailed README.md for my new integration or package?

  • Yes
  • No

Version Bump?

Did I bump the version in the pyproject.toml file of the package I am updating? (Except for the llama-index-core package)

  • Yes
  • No

Type of Change

Please delete options that are not relevant.

  • Bug fix (non-breaking change which fixes an issue)

How Has This Been Tested?

Your pull-request will likely not be merged unless it is covered by some form of impactful unit testing.

  • I added new unit tests to cover this change
  • I believe this change is already covered by existing unit tests

Suggested Checklist:

  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have added Google Colab support for the newly added notebooks.
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • I ran make format; make lint to appease the lint gods

@dosubot dosubot bot added the size:XS This PR changes 0-9 lines, ignoring generated files. label Apr 6, 2025
@dosubot dosubot bot added size:S This PR changes 10-29 lines, ignoring generated files. and removed size:XS This PR changes 0-9 lines, ignoring generated files. labels Apr 9, 2025
@ArmandBriere
Copy link
Author

I'm unavailable to test the streaming change, since it's not working before and after the change with Langfuse.
I've tried to debug this, using the code below, while using the stream function, the callback function used is not the wrapped_async_llm_chat has I would have guess but the wrapped_llm_chat decorator. This result in the complete absence of the Assistant logs in Langfuse.

import uuid

from dotenv import load_dotenv
from langfuse.llama_index import LlamaIndexInstrumentor
from llama_index.core.base.llms.types import ChatMessage, ChatResponseGen
from llama_index.llms.bedrock_converse import BedrockConverse

load_dotenv()

# Setup Langfuse Instrumentor to track everything from LlamaIndex
instrumentor: LlamaIndexInstrumentor = LlamaIndexInstrumentor()


# Bedrock model name
model = "meta.llama3-8b-instruct-v1:0"

# Initialize Bedrock LLM
llm = BedrockConverse(
    model=model,
    max_retries=3,
)


def stream_llm_on_bedrock():
    """Stream the LLM output."""
    session_id: str = uuid.uuid4().hex
    user_id: str = uuid.uuid4().hex

    with instrumentor.observe(session_id=session_id, user_id=user_id):
        response: ChatResponseGen = llm.stream_chat(
            messages=[
                ChatMessage(role="user", content="Hello"),
            ]
        )

        for r in response:
            print(r.delta, end="", flush=True)


# Example usage
if __name__ == "__main__":
    # Start the instrumentor
    instrumentor.start()

    stream_llm_on_bedrock()

    # Flush the instrumentor to send all data to Langfuse
    instrumentor.flush()

I might open a new issue for this problem if you don't have any clear fix at the top of your head

@logan-markewich
Copy link
Collaborator

stream_chat is an sync function, so using the sync wrapped_llm_chat makes sense

It could be that langfuse just doesn't handle token counts on streaming responses?

@@ -367,6 +371,10 @@ def stream_chat(
**all_kwargs,
)

dict_response = dict(response)
# Add Bedrock's token count to usage dict to match OpenAI's format
dict_response["usage"] = self._get_response_token_counts(dict_response)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You'll actually need to get the usage on each chunk in the stream? I think? You might need to debug whats beining emitted in the chunk stream below to get the usage properly

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can't know the complete usage until the stream is finished, for example

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
size:S This PR changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants