Token Usage for Ollama #19422
Replies: 1 comment 1 reply
-
I am doing a project about LLM with Langchain and Ollama and found the way to track Ollama output data. The idea is to create a class that extends from BaseCallbackHandler from langchain, and implements the on_llm_end() method in the class. The on_llm_end() method should accept a parameter namely response (which is a langchain LLMResult object) that stores the statistics from Ollama. Printing the LLMResult object will have the following results (a similar output with output as json key-value is in here, from langfuse)
They can be referenced as a parameter of the generations parameter in the LLMResult object. A minimal working code is as below. I use a deque to collect the statistics for convenient access. from langchain_community.llms import Ollama
llm = Ollama(model='llama2:7b-chat-q4_0')
# define callbacks for detecting token usage
from langchain_core.callbacks.base import BaseCallbackHandler
from langchain_core.outputs.llm_result import LLMResult
from collections import deque
class TokenUsageCallbackHandler(BaseCallbackHandler):
def __init__(self, deque: deque = None):
super().__init__()
self.deque = deque
def on_llm_end(self, response: LLMResult, **kwargs) -> None:
print('Response in callback')
print(response)
print()
generation = response.generations[0][0]
gen_info = generation.generation_info
# get token usage
token_usage = gen_info.get('prompt_eval_count', 0) + gen_info.get('eval_count', 0)
# get time costed (local machine)
# instead of getting total duration, we get the prompt_eval_duration and eval_duration to exclude the load duration (e.g. to load the model to the gpu, etc.)
time_costed = gen_info.get('prompt_eval_duration', 1e-10) + gen_info.get('eval_duration', 1e-10) # in ns, a small value to indicate a inf time when it fails
# create an object to store the token usage and time costed
token_usage_obj = {
'token_usage': token_usage,
'time_costed': time_costed
}
# append the object to the deque
self.deque.append(token_usage_obj)
common_deque = deque()
chain_config = {
"callbacks": [TokenUsageCallbackHandler(common_deque)],
}
# example with calling the llm directly
response = llm.invoke("Hello, how are you?", config=chain_config)
token_usage_obj = common_deque.popleft()
print(response)
print(token_usage_obj)
# create a chain and invoke
from langchain_core.prompts import ChatPromptTemplate
# example with calling a chain object
prompt = ChatPromptTemplate.from_messages(
("human", "Explain {concept} to me.")
)
chain = prompt | llm
response = chain.invoke({"concept": "Large Language Models"}, config=chain_config)
# get the token usage object from the deque
token_usage_obj = common_deque.popleft()
print(response)
print(token_usage_obj) outputs are as following
A list of callbacks is listed in the official Langchain docs. The api reference of BaseCallbackHandler is here I hope it helps. (edit: enable the python syntax highlighting) |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Checked
Feature request
The current API seems not to allow keeping track of token usage while using Ollama.
ref: https://github.com/langchain-ai/langchain/blob/master/libs/community/langchain_community/llms/ollama.py
Motivation
Token usage is important for comparing LLM efficiency.
Proposal (If applicable)
No response
Beta Was this translation helpful? Give feedback.
All reactions