Skip to content

[Bug]: Incorrect Usage Aggregation for Anthropic Streaming with Caching #10240

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
mdonaj opened this issue Apr 23, 2025 · 11 comments · May be fixed by #10284
Open

[Bug]: Incorrect Usage Aggregation for Anthropic Streaming with Caching #10240

mdonaj opened this issue Apr 23, 2025 · 11 comments · May be fixed by #10284
Assignees
Labels
bug Something isn't working

Comments

@mdonaj
Copy link

mdonaj commented Apr 23, 2025

What happened?

Response from anthropic reports cache usage stats. Response from litellm comes back with no cache information.

I've debugged it locally. This might be helpful to figure it out. The function:

 def calculate_usage(
        self,
        chunks: List[Union[Dict[str, Any], ModelResponse]],
        model: str,
        completion_output: str,
        messages: Optional[List] = None,
        reasoning_tokens: Optional[int] = None,
    ) -> Usage

Iterates over usage in chunks and the last chunk overwrites values set by first chunk.

-> chunks[0].usage.model_dump()
{'completion_tokens': 1, 'prompt_tokens': 4, 'total_tokens': 5, 'completion_tokens_details': None, 'prompt_tokens_details': {'audio_tokens': None, 'cached_tokens': 0}, 'cache_creation_input_tokens': 11822, 'cache_read_input_tokens': 0}

-> chunks[-1].usage.model_dump()
{'completion_tokens': 205, 'prompt_tokens': 0, 'total_tokens': 205, 'completion_tokens_details': None, 'prompt_tokens_details': {'audio_tokens': None, 'cached_tokens': 0}, 'cache_creation_input_tokens': 0, 'cache_read_input_tokens': 0}

The logic that follows in the loop is:

if usage_chunk_dict["cache_read_input_tokens"] is not None:
   cache_read_input_tokens = usage_chunk_dict[
       "cache_read_input_tokens"
   ]

So the end result is that the last chunk overwrites the cache info from the first chunk.

Anthropic sends input cache usage stats in the first chunk and output in the last.

Relevant log output

Are you a ML Ops Team?

No

What LiteLLM version are you on ?

v1.67.0-stable

Twitter / LinkedIn details

https://www.linkedin.com/in/maciej-donajski/

@mdonaj mdonaj added the bug Something isn't working label Apr 23, 2025
@krrishdholakia krrishdholakia self-assigned this Apr 24, 2025
@krrishdholakia
Copy link
Contributor

would the suggested fix here be to check for non-zero value? @mdonaj

@mdonaj
Copy link
Author

mdonaj commented Apr 24, 2025

I can try it tomorrow and drop the PR if that works? But seems pragmatic and quick 🥸

@krrishdholakia
Copy link
Contributor

That would be great, thank you @mdonaj

@reymondzzzz
Copy link

reymondzzzz commented Apr 24, 2025

have same issue

@jbellis
Copy link

jbellis commented Apr 24, 2025

Seeing a similar issue w/ Deepseek.

@mdonaj
Copy link
Author

mdonaj commented Apr 24, 2025

I've pushed #10284. I am not sure it solves Deepseek @jbellis. I only tried Anthropic.

@reymondzzzz
Copy link

@mdonaj it's almost good, but cache_read_input_tokens is included in prompt tokens.
eg.

"cache_read_input_tokens": 5336
"prompt_tokens": 5418

but raw response:
cache_read_input_tokens: 5336
input_tokens: 82

@mdonaj
Copy link
Author

mdonaj commented Apr 25, 2025

Hmm, I know what you mean. I look at old logs that I have (v1.59.x) and usage coming back from litellm was:

  "usage": {
    "total_tokens": 698,
    "prompt_tokens": 4,
    "completion_tokens": 694,
    "prompt_tokens_details": null,
    "cache_read_input_tokens": 22507,
    "completion_tokens_details": null,
    "cache_creation_input_tokens": 18
  }

and prompt_tokens did not include the cache_read_input_tokens.

But my PR should not have changed this behavior. It must be another thing. Good catch.

edit:

I can see that added in here https://github.com/BerriAI/litellm/blob/0cc0da37f3764bf1db3d07f2369b41844bba8dc3/litellm/llms/anthropic/chat/transformation.py#L638:675

@krrishdholakia
Copy link
Contributor

I believe prompt tokens should be the total input tokens - which is then broken down in the prompt_token_details by text_tokens, cached tokens, etc.

This is then used in the token counting logic as well - https://github.com/BerriAI/litellm/blob/main/litellm/litellm_core_utils/llm_cost_calc/utils.py#L159

@mdonaj
Copy link
Author

mdonaj commented Apr 25, 2025

It makes sense, I do not remember version numbers but in the past it was like that @krrishdholakia and later it was not included and now it is included again.

The period where it was not included (the log that I attached before) seems like a regression. We moved to calculating pricing internally since then. I know my team was recalculating the cost and adjusting for the regression in the past.

We are now we should migrate back to subtracting cache reads and writes from prompt tokens. The pricing on anthropic is different for cache writes, cache reads and input tokens which are all part of prompt_tokens.

@krrishdholakia
Copy link
Contributor

krrishdholakia commented Apr 25, 2025

Hey @mdonaj

cache writes is not included in prompt tokens - just the cache_read_input_tokens, as cache write is not input to model, just the stuff they stored -

prompt_tokens += cache_read_input_tokens

we handle cost tracking for all 3 in generic_cost_per_token (allows it to work for bedrock, vertex, etc.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants