Skip to content

Adding CountToken to Gemini #2137

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 9 commits into
base: main
Choose a base branch
from
Open

Adding CountToken to Gemini #2137

wants to merge 9 commits into from

Conversation

kauabh
Copy link
Contributor

@kauabh kauabh commented Jul 5, 2025

Gemini Provides an endpoint to count tokens https://ai.google.dev/api/tokens#method:-models.counttokens.
I think it will be useful and address some concerns in this issue #1794 (at least for gemini).

@DouweM Wanted to check if this will be helpful. If yes and if the approach is right, wanted to know if you can share some pointers around adding it in usage_limits for gemini. Happy to work on other models too, if this one make it through.

kauabh added 9 commits July 6, 2025 04:27
Gemini Provides an endpoint to count token before sending an response 
https://ai.google.dev/api/tokens#method:-models.counttokens
added type adaptor
Removed extra assignment
Removed White Space
@DouweM
Copy link
Contributor

DouweM commented Jul 7, 2025

@kauabh I agree that if a model API has a method to count tokens, it would be nice to expose that on the Model class.

But I don't think we should automatically use it when UsageLimits(request_tokens_limit=...) is used, as it adds an extra request and the overhead and latency that comes with that, unlike OpenAI's tiktoken which was mentioned in #1794 and can be run locally. So if we'd like to give users the option to better enforce request_tokens_limit by doing a separate count-tokens request ahead of the actual LLM request, that should be opt-in with some flag on UsageLimits and appropriate warnings in the docs about the extra overhead.

That check would need to be implemented here, just before we call model.request, once we have the messages, model settings, and model request params ready:

async def _make_request(
self, ctx: GraphRunContext[GraphAgentState, GraphAgentDeps[DepsT, NodeRunEndT]]
) -> CallToolsNode[DepsT, NodeRunEndT]:
if self._result is not None:
return self._result # pragma: no cover
model_settings, model_request_parameters = await self._prepare_request(ctx)
model_request_parameters = ctx.deps.model.customize_request_parameters(model_request_parameters)
message_history = await _process_message_history(
ctx.state.message_history, ctx.deps.history_processors, build_run_context(ctx)
)
model_response = await ctx.deps.model.request(message_history, model_settings, model_request_parameters)
ctx.state.usage.incr(_usage.Usage())
return self._finish_handling(ctx, model_response)

This would require a method that exists on every model, so it'd be implemented as an abstract method on the base Model class with a default implementation of raise NotImplementedError(...), and only models that have a count-tokens method would override it with a concrete implementation.

As for that concrete implementation, I recommend adding it to GoogleModel instead of GeminiModel, as you can directly use the google-genai library there, and reducing the duplication with the request-preparation logic in _generate_content as much as possible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants