8
8
Scheduler-side: runs in the scheduler, binds metadata, which
9
9
is used by the worker-side to load/save KV cache.
10
10
get_num_new_matched_tokens() - get number of new tokens
11
- that exist in the remote KV cache
11
+ that exist in the remote KV cache. Might be called multiple
12
+ times for a given request and should be side-effect free.
12
13
update_state_after_alloc() - update KVConnector state after
13
14
temporary buffer alloc by the CacheManager.
15
+ request_finished() - called when a request is finished, with
16
+ the computed kv cache blocks for the request.
17
+ Returns whether KV cache should be freed now or will be
18
+ freed asynchronously and optionally returns KV transfer
19
+ params.
14
20
15
21
Worker-side: runs in each worker, loads/saves KV cache to/from
16
22
the Connector based on the metadata.
19
25
20
26
save_kv_layer() - starts saving KV for layer i (maybe async)
21
27
wait_for_save() - blocks until all saves are done
28
+
29
+ get_finished() - called with ids of finished requests, returns
30
+ ids of requests that have completed async sending/recving.
22
31
"""
23
32
24
33
import enum
@@ -184,7 +193,8 @@ def get_finished(
184
193
finished generating tokens.
185
194
186
195
Returns:
187
- ids of requests that have finished asynchronous transfer,
196
+ ids of requests that have finished asynchronous transfer
197
+ (requests that previously returned True from request_finished()),
188
198
tuple of (sending/saving ids, recving/loading ids).
189
199
The finished saves/sends req ids must belong to a set provided in a
190
200
call to this method (this call or a prior one).
@@ -215,7 +225,8 @@ def get_num_new_matched_tokens(
215
225
- The number of tokens that can be loaded from the
216
226
external KV cache beyond what is already computed.
217
227
- `True` if external KV cache tokens will be loaded
218
- asynchronously (between scheduler steps).
228
+ asynchronously (between scheduler steps). Must be
229
+ 'False' if the first element is 0.
219
230
"""
220
231
pass
221
232
@@ -225,6 +236,18 @@ def update_state_after_alloc(self, request: "Request",
225
236
num_external_tokens : int ):
226
237
"""
227
238
Update KVConnector state after block allocation.
239
+
240
+ If get_num_new_matched_tokens previously returned True for a
241
+ request, this function may be called twice for that same request -
242
+ first when blocks are allocated for the connector tokens to be
243
+ asynchronously loaded into, and second when any additional blocks
244
+ are allocated, after the load/transfer is complete.
245
+
246
+ Args:
247
+ request (Request): the request object.
248
+ blocks (KVCacheBlocks): the blocks allocated for the request.
249
+ num_external_tokens (int): the number of tokens that will be
250
+ loaded from the external KV cache.
228
251
"""
229
252
pass
230
253
0 commit comments