-
-
Notifications
You must be signed in to change notification settings - Fork 2.7k
[Bug]: Incorrect Usage Aggregation for Anthropic Streaming with Caching #10240
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
would the suggested fix here be to check for non-zero value? @mdonaj |
I can try it tomorrow and drop the PR if that works? But seems pragmatic and quick 🥸 |
That would be great, thank you @mdonaj |
have same issue |
Seeing a similar issue w/ Deepseek. |
@mdonaj it's almost good, but cache_read_input_tokens is included in prompt tokens. "cache_read_input_tokens": 5336 but raw response: |
Hmm, I know what you mean. I look at old logs that I have (v1.59.x) and usage coming back from litellm was:
and But my PR should not have changed this behavior. It must be another thing. Good catch. edit: I can see that added in here https://github.com/BerriAI/litellm/blob/0cc0da37f3764bf1db3d07f2369b41844bba8dc3/litellm/llms/anthropic/chat/transformation.py#L638:675 |
I believe prompt tokens should be the total input tokens - which is then broken down in the prompt_token_details by text_tokens, cached tokens, etc. This is then used in the token counting logic as well - https://github.com/BerriAI/litellm/blob/main/litellm/litellm_core_utils/llm_cost_calc/utils.py#L159 |
It makes sense, I do not remember version numbers but in the past it was like that @krrishdholakia and later it was not included and now it is included again. The period where it was not included (the log that I attached before) seems like a regression. We moved to calculating pricing internally since then. I know my team was recalculating the cost and adjusting for the regression in the past. We are now we should migrate back to subtracting cache reads and writes from prompt tokens. The pricing on anthropic is different for cache writes, cache reads and input tokens which are all part of |
Hey @mdonaj cache writes is not included in prompt tokens - just the cache_read_input_tokens, as cache write is not input to model, just the stuff they stored -
we handle cost tracking for all 3 in generic_cost_per_token (allows it to work for bedrock, vertex, etc.) |
What happened?
Response from anthropic reports cache usage stats. Response from litellm comes back with no cache information.
I've debugged it locally. This might be helpful to figure it out. The function:
Iterates over usage in chunks and the last chunk overwrites values set by first chunk.
The logic that follows in the loop is:
So the end result is that the last chunk overwrites the cache info from the first chunk.
Anthropic sends input cache usage stats in the first chunk and output in the last.
Relevant log output
Are you a ML Ops Team?
No
What LiteLLM version are you on ?
v1.67.0-stable
Twitter / LinkedIn details
https://www.linkedin.com/in/maciej-donajski/
The text was updated successfully, but these errors were encountered: