Vertex AI Context Caching for Gemini 2.5 Pro #2581

casepot · 2025-04-14T01:06:40Z

casepot
Apr 14, 2025

The Vertex AI API seems to now support context caching for Gemini 2.5 Pro: https://cloud.google.com/vertex-ai/generative-ai/docs/context-cache/context-cache-overview

Given the cost of high token count chats this seems pretty important. Would love to see it implemented in Roo-Code. Thanks to everyone who has contributed to this repo, it really is fantastic :)

nickchomey · 2025-04-14T13:09:54Z

nickchomey
Apr 14, 2025

This would be immensely useful

0 replies

ltate23 · 2025-04-15T03:08:12Z

ltate23
Apr 15, 2025

Please!

0 replies

Xytronix · 2025-04-15T10:10:11Z

Xytronix
Apr 15, 2025

Great idea

0 replies

FRSTR · 2025-04-15T10:34:57Z

FRSTR
Apr 15, 2025

Yes, that would be very helpful! Without the caching, the costs are just ridiculous

0 replies

sonobo · 2025-04-15T11:53:42Z

sonobo
Apr 15, 2025

This must be the top priority task for Roo developers as all good models are pretty costy these days.

0 replies

platteel · 2025-04-19T11:34:34Z

platteel
Apr 19, 2025

Gemini 2.5 Pro is great, but very expensive when you are a heavy user. Caching is very much needed.

0 replies

jvmx · 2025-04-22T01:37:22Z

jvmx
Apr 22, 2025

Yes this is a super high priority as it directly impacts cost in a major way - and will dramatically increase my usage of Roo if supported

They also just announced updates to the cache to lower the minimum to 4k tokens making the savings even larger than they were. I know that you have to self-manage the cache with Google and that's complicated and probably why it's not yet implemented, but if Roo nailed this it would cook.

0 replies

cl1107 · 2025-04-23T11:52:27Z

cl1107
Apr 23, 2025

https://openrouter.ai/docs/features/prompt-caching#google-gemini
open router also announce support gemini caching

0 replies

sergedc · 2025-05-03T04:34:56Z

sergedc
May 3, 2025

This was a featured announced in this week's update, but only for 2.5 preview not for 2.5 exp.
Please could this also be made available for 2.5 exp also? I understand that there is no cost impact, but there might be a significant speed impact on very long build / very large codebase.

6 replies

nickchomey May 3, 2025

https://cloud.google.com/vertex-ai/generative-ai/docs/context-cache/context-cache-overview#supported_models

2.5 Pro and flash are shown here. Seems doubtful that they'll add caching to experimental. And is it even necessary, given that it is free?

redazzo May 28, 2025

It's now supported, and my codebase is sufficiently complex to need the more expensive Gemini 2.5 pro capabilities.

When agents attempt to understand parts of a significant codebase, they will sequentially proceed through steps that result in them reading a file, reasoning to themselves as to whether they need to read and understand additional files, and then proceed to, step-by-step and file-by-file, read each file.

This results in a sequence of requests (sometimes a long and extended sequence of requests) where they fetch more and more files, until done, and then proceed to write new or amend code. I have a scenario where I know exactly the files that the agent needs in order to complete its task, and it's in excess of 30 files of relatively complex Typescript.

Each request results in a full context upload, and can result in a request costing several dollars each time. Do this ten times, and before you know it, you've blown $20 or more. In addition, each request will take longer and longer to upload, so not only is this costly, it is also extremely time inefficient.

It's quite easy to show (under simplifying assumptions) that the pre-batching cost / sequential cost ratio is 2 / (n+1) where n is the number of files. Practically, in my scenario, this means that the cost of pre-batching is only about 5% of the cost of the agent reading each file one-by-one.

If you also accommodate for the lower cost of cache requests (in Gemini, this is $0.625, prompts > 200k, versus a normal prompt cost of $2.50, prompts > 200k tokens), the savings are dramatically more.

It's a little more complicated if you take into account the cost of caching (for Google Gemini, $1 per hour for 1M tokens), but the savings should still be very substantial.

So, the request is exactly that - can we add a feature to provide this function? I have some ideas on the UI, and will code it myself if needed (though I have a day-job, so it may be a little slow to arrive). The graphic below illustrates the potential savings, which are dramatic even if we assume the cache requests costs are the same (which they're not, so the savings will be even more dramatic).

sergedc May 28, 2025

You message is critical for the future of ROO but I fear in the wrong place. You should post it as a new feature request.
The current behavior incentivize us to create few files with 5,000 line of code (and tick the "always read the full file" option). This is not good.

jvmx May 28, 2025

It's now supported, and my codebase is sufficiently complex to need the more expensive Gemini 2.5 pro capabilities.

When agents attempt to understand parts of a significant codebase, they will sequentially proceed through steps that result in them reading a file, reasoning to themselves as to whether they need to read and understand additional files, and then proceed to, step-by-step and file-by-file, read each file.

This results in a sequence of requests (sometimes a long and extended sequence of requests) where they fetch more and more files, until done, and then proceed to write new or amend code. I have a scenario where I know exactly the files that the agent needs in order to complete its task, and it's in excess of 30 files of relatively complex Typescript.

Each request results in a full context upload, and can result in a request costing several dollars each time. Do this ten times, and before you know it, you've blown $20 or more. In addition, each request will take longer and longer to upload, so not only is this costly, it is also extremely time inefficient.

It's quite easy to show (under simplifying assumptions) that the pre-batching cost / sequential cost ratio is 2 / (n+1) where n is the number of files. Practically, in my scenario, this means that the cost of pre-batching is only about 5% of the cost of the agent reading each file one-by-one.

If you also accommodate for the lower cost of cache requests (in Gemini, this is $0.625, prompts > 200k, versus a normal prompt cost of $2.50, prompts > 200k tokens), the savings are dramatically more.

It's a little more complicated if you take into account the cost of caching (for Google Gemini, $1 per hour for 1M tokens), but the savings should still be very substantial.

So, the request is exactly that - can we add a feature to provide this function? I have some ideas on the UI, and will code it myself if needed (though I have a day-job, so it may be a little slow to arrive). The graphic below illustrates the potential savings, which are dramatic even if we assume the cache requests costs are the same (which they're not, so the savings will be even more dramatic).

I'd like to +1 this feature request for multi-file upload at once feature, and add allow for specifying individual files and even perhaps an entire folder (recursively listed or not), etc. I've made a similar comment elsewhere about the cost of recursively uploading files one by one -- it can dramatically increase the overall cost versus one-shot batching.

I agree this is probably not the correct thread for this and a feature req should be made elsewhere with specifics.

redazzo Jun 8, 2025

I couldn't help myself so I got started on creating the feature, though this implementation is quite specific to my needs.

It's not an automated caching process, as I need more control to indicate specifically what I'd like cached. However, before proceeding further I'm keen to see whether the approach I've taken is appropriate as I'm not overly familiar with the process here. There will probably also be some views on how and why this should be included.

In summary, I've created a component similar to the auto-approve dropdown that appears above the chat area.

This, when expanded, shows an area where the user can drag content. There are "Upload" and "Clear Cache" buttons on the right, both disabled. The Upload button is enabled when there is content in the drag area. If clicked, it will result in the selected files being concatenated together (and appropriately labeled in-file with metadata for the LLM regarding the files) and uploaded/cached. The Clear Cache button then becomes enabled.

There's also a TTL dropdown.

I'm toying with showing a TTL count-down or similar on the RHS, near the chevron, to allow the user to track the time remaining until the cache clears. There will be logic required to ensure the concatenated file is internally consistent, doesn't have multiple copies of the same file, etc.

The system prompt will also need to be updated to ensure the agents know how to deal with the cached content.

There are a few subtleties to deal with wrt the cost incurred due to the cache, and how this is shown.

If there is agreement that this is a feature you'd like to add, let me know and I'll continue the work and pop in a pull request when done (may be a few weeks unfortunately due to family and a day job).

HahaBill · 2025-06-10T22:29:23Z

HahaBill
Jun 10, 2025

The Vertex AI API seems to now support context caching for Gemini 2.5 Pro: https://cloud.google.com/vertex-ai/generative-ai/docs/context-cache/context-cache-overview

Given the cost of high token count chats this seems pretty important. Would love to see it implemented in Roo-Code. Thanks to everyone who has contributed to this repo, it really is fantastic :)

Hi! I am currently partitcipating in Google Summer of Code at Deepmind and I will be exactly focusing on working on this feature in upcoming weeks. Happy to help :)

2 replies

redazzo Jun 11, 2025

This would be exceptional. Any ideas on how you're looking to attack it?

HahaBill Jun 11, 2025

I am planning to start smaller by (1) caching the context mentions.

(2) Then I will start looking at caching the whole codebase. If users want to reduce caching cost, there could be an option to do some optimal long-context handling algorithms to reduce the number of tokens but the performance may decrease.

(3) Lastly, I am planning to do CAG when codebase indexing will be more mature.

Let me know what you think about my plan and don't be scared to be critical!! If you have anything in your mind that could aid me, don't hesitate and let me know :)

hannesrudolph · 2025-06-18T18:00:33Z

hannesrudolph
Jun 18, 2025
Collaborator

This has been implemented. If you would like any adjustments to the current implementation please submit a detailed feature proposal at https://github.com/RooCodeInc/Roo-Code/issues

0 replies

Vertex AI Context Caching for Gemini 2.5 Pro #2581

Uh oh!

Replies: 11 comments · 8 replies

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hannesrudolph Jun 18, 2025 Collaborator

Replies: 11 comments 8 replies

hannesrudolph
Jun 18, 2025
Collaborator