Skip to content

[Feature]: Gemini 2.5 Flash - Vertex AI to be added to LiteLLM #10121

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Classic298 opened this issue Apr 17, 2025 · 14 comments · Fixed by #10198 · May be fixed by #10189 or #10199
Closed

[Feature]: Gemini 2.5 Flash - Vertex AI to be added to LiteLLM #10121

Classic298 opened this issue Apr 17, 2025 · 14 comments · Fixed by #10198 · May be fixed by #10189 or #10199
Assignees
Labels

Comments

@Classic298
Copy link
Contributor

The Feature

Add Gemini 2.5 Flash to Vertex AI with optional reasoning configurable via parameters in the proxy.

Motivation, pitch

It's a new fast and inexpensive model by google

Are you a ML Ops Team?

No

Twitter / LinkedIn details

No response

@Classic298 Classic298 added the enhancement New feature or request label Apr 17, 2025
@adamwuyu
Copy link

We really need Gemini 2.5 Flash !

@mohamedScikitLearn
Copy link

Also, can we have the budget parameter added to unable / disable thinking mode?

@emc2314
Copy link

emc2314 commented Apr 21, 2025

Gemini-2.5-flash is supported by #10141, but there is currently no way to disable thinking for this model.

@mohamedScikitLearn
Copy link

Hey @emc2314 (and whom might me be interested), There is a workaround using litellm to use gemini-2.5-Flash (without thinking).

You can use openrouter integration in litellm ( see here then your LLM should be: google/gemini-2.5-flash-preview

@krrishdholakia
Copy link
Contributor

Question: should litellm's default be with thinking or without?

@Classic298
Copy link
Contributor Author

@krrishdholakia default should be without. Definitely.

krrishdholakia added a commit that referenced this issue Apr 22, 2025
* fix(vertex_and_google_ai_studio_gemini.py): allow thinking budget = 0

Fixes #10121

* fix(vertex_and_google_ai_studio_gemini.py): handle nuance in counting exclusive vs. inclusive tokens

Addresses #10141 (comment)
@krrishdholakia
Copy link
Contributor

Thanks @Classic298 - will wait to collect feedback from the group, and then we can make the change accordingly (this should be controllable either way)

cc: @awesie @emc2314 @cheahjs

@Classic298
Copy link
Contributor Author

Classic298 commented Apr 22, 2025

thanks for the merge.

btw @krrishdholakia on totally unrelated note: it would be sooooo cool if there'd be a feature to automatically create a vertex AI context cache (now that the minimum size was reduced to 4096) and automatically use the created context cache id for all requests to Vertex AI.

Now that the minimum size is 4096, it would actually make it much easier to save money and cache system prompts.

Something like this in the yaml config file:

litellm_settings:
vertex_cache:
cache_content: string
cache_ttl: 24h
<variable_name>: somehow define here, when the cache should get created. Daily, weekly, on certain weekdays only?

and for specific gemini models where you want to use the cache, have this as a parameter?

use_vertex_cache: true

Would something like this be worthy of opening a feature request?

@awesie
Copy link

awesie commented Apr 22, 2025

Regarding whether thinking is on by default or not, I think it makes the most sense to default to the same behavior as Google's APIs, which is thinking on. I'm not saying that I agree with Google's choice, but if litellm changes the default behavior then it is something that would need to be well documented. I really wish that Google had just made them as two separate "models" given that the pricing is different and it is a binary on-or-off thing.

@jbellis
Copy link

jbellis commented Apr 22, 2025

+1 for following vendor settings as the default

@krrishdholakia
Copy link
Contributor

Hey @Classic298

would you even need additional settings for that? wouldn't it be just created / checked like how google ai studio is - https://docs.litellm.ai/docs/providers/gemini#context-caching

@Classic298
Copy link
Contributor Author

Classic298 commented Apr 22, 2025

Hey @krrishdholakia

Google Vertex does not have any automatic caching (as ai studio seemingly does, I just learned from you)

You need to manually create a cache, give it a TTL and on all subsequent requests reference the cache's ID for the cached tokens to be added as a prefix to your request to the LLM (such as a system prompt for example).

https://ai.google.dev/gemini-api/docs/caching?lang=python

https://ai.google.dev/gemini-api/docs/pricing

@Classic298
Copy link
Contributor Author

Classic298 commented Apr 22, 2025

Maybe it shall be noted that Logan Kilpatrick, one of the heads of gemini over at Google has said they are considering automatic caching, but they aren't working on it yet.

And this is how caching works on vertex ai.
Create it
Reference it on subsequent requests
If cache runs out: create new cache
Reference new cache on subsequent requests

@Classic298
Copy link
Contributor Author

Classic298 commented Apr 23, 2025

@krrishdholakia I noticed that with this setup:

No matter what prompt you give to gemini, it will never reason (reasoning tokens used is always zero)

Am I doing something wrong? I am following the guidelines from the docs.

  - model_name: gemini-2.5-flash-reasoning
    litellm_params:
      vertex_location: "us-central1"
      model: vertex_ai/gemini-2.5-flash-preview-04-17
      max_tokens: 65535
      thinking: {"type": "enabled", "budget_tokens": 24576}

I also tried a yaml config setup like:

      thinking: 
        type: enabled
        budget_tokens: 24576

But I never get it to work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment