You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
* Add P/D support, respond accordingly to doRemotePrefill and doRemoteDecode fields
Signed-off-by: Maya Barnea <mayab@il.ibm.com>
* Add test for kvcache transfer time command line parameter.
Update config_test to use a function to create configuration same as defined in the config yaml file
Signed-off-by: Maya Barnea <mayab@il.ibm.com>
* Update readme file
change command line argument name to 'kv-cache-transfer-latency'
Signed-off-by: Maya Barnea <mayab@il.ibm.com>
* fixes according PR's comments
Signed-off-by: Maya Barnea <mayab@il.ibm.com>
* added comments for fields
Signed-off-by: Maya Barnea <mayab@il.ibm.com>
* fix utils_test - initialize random before
Signed-off-by: Maya Barnea <mayab@il.ibm.com>
* fixes in readme according the PR review
Signed-off-by: Maya Barnea <mayab@il.ibm.com>
---------
Signed-off-by: Maya Barnea <mayab@il.ibm.com>
Copy file name to clipboardExpand all lines: README.md
+4-3Lines changed: 4 additions & 3 deletions
Original file line number
Diff line number
Diff line change
@@ -29,11 +29,11 @@ The simulator supports two modes of operation:
29
29
-`echo` mode: the response contains the same text that was received in the request. For `/v1/chat/completions` the last message for the role=`user` is used.
30
30
-`random` mode: the response is randomly chosen from a set of pre-defined sentences.
31
31
32
-
Timing of the response is defined by two parameters: `time-to-first-token` and `inter-token-latency`.
32
+
Timing of the response is defined by the `time-to-first-token` and `inter-token-latency` parameters. In case P/D is enabled for a request, `kv-cache-transfer-latency` will be used instead of `time-to-first-token`.
33
33
34
-
For a request with `stream=true`: `time-to-first-token` defines the delay before the first token is returned, `inter-token-latency` defines the delay between subsequent tokens in the stream.
34
+
For a request with `stream=true`: `time-to-first-token`or `kv-cache-transfer-latency`defines the delay before the first token is returned, `inter-token-latency` defines the delay between subsequent tokens in the stream.
35
35
36
-
For a requst with `stream=false`: the response is returned after delay of `<time-to-first-token> + (<inter-token-latency> * (<number_of_output_tokens> - 1))`
36
+
For a requst with `stream=false`: the response is returned after delay of `<time-to-first-token> + (<inter-token-latency> * (<number_of_output_tokens> - 1))` or `<kv-cache-transfer-latency> + (<inter-token-latency> * (<number_of_output_tokens> - 1))` in P/D case
37
37
38
38
It can be run standalone or in a Pod for testing under packages such as Kind.
39
39
@@ -99,6 +99,7 @@ For more details see the <a href="https://docs.vllm.ai/en/stable/getting_started
99
99
-`random`: returns a sentence chosen at random from a set of pre-defined sentences
100
100
-`time-to-first-token`: the time to the first token (in milliseconds), optional, by default zero
101
101
-`inter-token-latency`: the time to 'generate' each additional token (in milliseconds), optional, by default zero
102
+
-`kv-cache-transfer-latency`: time for KV-cache transfer from a remote vLLM (in milliseconds), by default zero. Usually much shorter than `time-to-first-token`
102
103
-`seed`: random seed for operations (if not set, current Unix time in nanoseconds is used)
103
104
104
105
In addition, as we are using klog, the following parameters are available:
0 commit comments