-
-
Notifications
You must be signed in to change notification settings - Fork 2.7k
[Feature]: Gemini 2.5 Flash - Vertex AI to be added to LiteLLM #10121
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature]: Gemini 2.5 Flash - Vertex AI to be added to LiteLLM #10121
Comments
We really need Gemini 2.5 Flash ! |
Also, can we have the budget parameter added to unable / disable thinking mode? |
Gemini-2.5-flash is supported by #10141, but there is currently no way to disable thinking for this model. |
Question: should litellm's default be with thinking or without? |
…isabled allows billing to work correctly Fixes #10121
@krrishdholakia default should be without. Definitely. |
* fix(vertex_and_google_ai_studio_gemini.py): allow thinking budget = 0 Fixes #10121 * fix(vertex_and_google_ai_studio_gemini.py): handle nuance in counting exclusive vs. inclusive tokens Addresses #10141 (comment)
Thanks @Classic298 - will wait to collect feedback from the group, and then we can make the change accordingly (this should be controllable either way) |
thanks for the merge. btw @krrishdholakia on totally unrelated note: it would be sooooo cool if there'd be a feature to automatically create a vertex AI context cache (now that the minimum size was reduced to 4096) and automatically use the created context cache id for all requests to Vertex AI. Now that the minimum size is 4096, it would actually make it much easier to save money and cache system prompts. Something like this in the yaml config file: litellm_settings: and for specific gemini models where you want to use the cache, have this as a parameter? use_vertex_cache: true Would something like this be worthy of opening a feature request? |
Regarding whether thinking is on by default or not, I think it makes the most sense to default to the same behavior as Google's APIs, which is thinking on. I'm not saying that I agree with Google's choice, but if litellm changes the default behavior then it is something that would need to be well documented. I really wish that Google had just made them as two separate "models" given that the pricing is different and it is a binary on-or-off thing. |
+1 for following vendor settings as the default |
Hey @Classic298 would you even need additional settings for that? wouldn't it be just created / checked like how google ai studio is - https://docs.litellm.ai/docs/providers/gemini#context-caching |
Hey @krrishdholakia Google Vertex does not have any automatic caching (as ai studio seemingly does, I just learned from you) You need to manually create a cache, give it a TTL and on all subsequent requests reference the cache's ID for the cached tokens to be added as a prefix to your request to the LLM (such as a system prompt for example). |
Maybe it shall be noted that Logan Kilpatrick, one of the heads of gemini over at Google has said they are considering automatic caching, but they aren't working on it yet. And this is how caching works on vertex ai. |
@krrishdholakia I noticed that with this setup: No matter what prompt you give to gemini, it will never reason (reasoning tokens used is always zero) Am I doing something wrong? I am following the guidelines from the docs.
I also tried a yaml config setup like:
But I never get it to work. |
The Feature
Add Gemini 2.5 Flash to Vertex AI with optional reasoning configurable via parameters in the proxy.
Motivation, pitch
It's a new fast and inexpensive model by google
Are you a ML Ops Team?
No
Twitter / LinkedIn details
No response
The text was updated successfully, but these errors were encountered: