Skip to content

Commit f1575de

Browse files
tlrmchlsmthApostaCrobertgshaw2-redhatRobert Shawmgoin
authored
[P/D Disagg] Direct NIXL Connector (#60)
* [Update] LMcache connector v1 implementation Signed-off-by: ApostaC <yihua98@uchicago.edu> * [Add] examples for disaggregated prefill Signed-off-by: ApostaC <yihua98@uchicago.edu> * [add] extra information about evns Signed-off-by: ApostaC <yihua98@uchicago.edu> * Initial stubs for P/D scheduling changes Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com> * Updates Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com> * Rs branch (#3) * updated Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> * Rs branch (#5) Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> * Remove Unneeded Arguments (#7) * updated Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> * stash Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> * cleanup Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> --------- Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> * Improve disagg-example.sh (#8) - fix spelling - CUDA_VISIBLE_DEVICES should be set externally Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com> * updated Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> * updated Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> * updated Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> * updated Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> * added connector Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> * updated Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> * updated Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> * updated Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> * updated Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> * updated Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> * updated Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> * updated Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> * updated Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> * updated Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> * updated Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> * updated Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> * updated Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> * updated Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> * updated Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> * updated Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> * updated Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> * update Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> * remove Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> * updated Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> * updated Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> * updated Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> * updated Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> * updated Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> * seems to load properly Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> * updated Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> * updated Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> * updated Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> * updated Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> * updated Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> * updated Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> * Revert "updated" This reverts commit 97316d9. * updated Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> * updated Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> * updated Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> * updated Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> * updated Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> * stash Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> * added Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> * diffs for local dev on macos Signed-off-by: Robert Shaw <rshaw@neuralmagic.com> * updated Signed-off-by: Robert Shaw <rshaw@neuralmagic.com> * update Signed-off-by: Robert Shaw <rshaw@neuralmagic.com> * updaed Signed-off-by: Robert Shaw <rshaw@neuralmagic.com> * updated Signed-off-by: Robert Shaw <rshaw@neuralmagic.com> * updated Signed-off-by: Robert Shaw <rshaw@neuralmagic.com> * Checkpoint. Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com> * updated Signed-off-by: Robert Shaw <rshaw@neuralmagic.com> * Cleanup Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com> * WIP Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com> * updated Signed-off-by: Robert Shaw <rshaw@neuralmagic.com> * updated Signed-off-by: Robert Shaw <rshaw@neuralmagic.com> * updated on scheduler side Signed-off-by: Robert Shaw <rshaw@neuralmagic.com> * updated Signed-off-by: Robert Shaw <rshaw@neuralmagic.com> * updated Signed-off-by: Robert Shaw <rshaw@neuralmagic.com> * updated Signed-off-by: Robert Shaw <rshaw@neuralmagic.com> * updated Signed-off-by: Robert Shaw <rshaw@neuralmagic.com> * updated Signed-off-by: Robert Shaw <rshaw@neuralmagic.com> * updated Signed-off-by: Robert Shaw <rshaw@neuralmagic.com> * Hacking away Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com> * cleanup Signed-off-by: Robert Shaw <rshaw@neuralmagic.com> * ensure request removed from running list Signed-off-by: Robert Shaw <rshaw@neuralmagic.com> * Runs E2E. Garbage output. Crashes on 2nd request Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com> * update Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com> * updated Signed-off-by: Robert Shaw <rshaw@neuralmagic.com> * updated Signed-off-by: Robert Shaw <rshaw@neuralmagic.com> * rename files Signed-off-by: Robert Shaw <rshaw@neuralmagic.com> * updated Signed-off-by: Robert Shaw <rshaw@neuralmagic.com> * updated Signed-off-by: Robert Shaw <rshaw@neuralmagic.com> * updated Signed-off-by: Robert Shaw <rshaw@neuralmagic.com> * updated Signed-off-by: Robert Shaw <rshaw@neuralmagic.com> * updated Signed-off-by: Robert Shaw <rshaw@neuralmagic.com> * update Signed-off-by: Robert Shaw <rshaw@neuralmagic.com> * Second request no longer crashes Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com> * Remove gpu_model_runner hacks Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com> * Clean up Justfile Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com> * [Bugfix] Stale finished requests in EMPTY_MODEL_RUNNER_OUTPUT Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com> * update Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com> * justfile edits Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com> * Update Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com> * Fixes - lm_eval gsm8k has correctness Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com> * "just delete the assert" Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com> * fixup precommit issues Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com> * Fixes Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com> * updated (#12) Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> * Add Accuracy Test (#13) * updated Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> * updated Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> * updated Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> * updated Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> --------- Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> * Preemption Bugfixes (#15) * stash fixed double free issue Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> * updated Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> * updated Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> * updated Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> * updated Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> * updated Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> * updated Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> * fixed issue Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> * updated Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> * updatrd Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> * updatrd Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> * updatrd Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> * updatrd Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> * updatrd Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> * updatrd Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> --------- Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> * updated (#16) Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> * Fix Bad Merge | Fix Memory Leak in Upstream (#18) * updated Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> * fix merge Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> * updated Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> * updated Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> * updated Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> * updated Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> --------- Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> * clean up justfile, examples Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com> * more cleanup Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com> * more cleanup Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com> * more cleanup Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com> * more cleanup Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com> * More cleanup Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com> * more cleanup Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com> * more cleanup, precommit fixes Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com> * More cleanup Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com> * run_accuracy_test.sh UX Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com> * squash warnings Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com> * pre-commit Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com> * update Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com> * Add get_finished to base kv connector Signed-off-by: mgoin <mgoin64@gmail.com> * revert test.txt Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com> * cleanup Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com> * Cleanup Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com> * review comments Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com> --------- Signed-off-by: ApostaC <yihua98@uchicago.edu> Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com> Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> Signed-off-by: Robert Shaw <rshaw@neuralmagic.com> Signed-off-by: mgoin <mgoin64@gmail.com> Co-authored-by: ApostaC <yihua98@uchicago.edu> Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com> Co-authored-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> Co-authored-by: Robert Shaw <rshaw@neuralmagic.com> Co-authored-by: mgoin <mgoin64@gmail.com> Co-authored-by: mgoin <mgoin64@gmail.com>
1 parent a928424 commit f1575de

26 files changed

+1821
-66
lines changed

tests/v1/kv_connector/__init__.py

Whitespace-only changes.
+45
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,45 @@
1+
#!/bin/bash
2+
3+
set -xe
4+
5+
# Model to run.
6+
MODEL_NAME=Qwen/Qwen3-0.6B
7+
8+
# Find the git repository root directory
9+
GIT_ROOT=$(git rev-parse --show-toplevel)
10+
11+
# Trap the SIGINT signal (triggered by Ctrl+C)
12+
trap 'kill $(jobs -pr)' SIGINT SIGTERM EXIT
13+
14+
# Waits for vLLM to start.
15+
wait_for_server() {
16+
local port=$1
17+
timeout 1200 bash -c "
18+
until curl -s localhost:${port}/v1/completions > /dev/null; do
19+
sleep 1
20+
done" && return 0 || return 1
21+
}
22+
23+
# Prefill instance.
24+
CUDA_VISIBLE_DEVICES=0 NIXL_ROLE="SENDER" vllm serve $MODEL_NAME \
25+
--port 8100 \
26+
--enforce-eager \
27+
--disable-log-requests \
28+
--kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_both"}' &
29+
30+
# Decode instance.
31+
CUDA_VISIBLE_DEVICES=1 NIXL_ROLE="RECVER" vllm serve $MODEL_NAME \
32+
--port 8200 \
33+
--enforce-eager \
34+
--disable-log-requests \
35+
--kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_both"}' &
36+
37+
# wait until prefill and decode instances are ready
38+
wait_for_server 8100
39+
wait_for_server 8200
40+
41+
# Proxy server.
42+
python ${GIT_ROOT}/tests/v1/kv_connector/toy_proxy_server.py --port 8192 &
43+
44+
# Run lm eval.
45+
python -m pytest -s -x ${GIT_ROOT}/tests/v1/kv_connector/test_accuracy.py
+28
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,28 @@
1+
# SPDX-License-Identifier: Apache-2.0
2+
import lm_eval
3+
4+
MODEL_NAME = "Qwen/Qwen3-0.6B"
5+
NUM_CONCURRENT = 100
6+
TASK = "gsm8k"
7+
FILTER = "exact_match,strict-match"
8+
RTOL = 0.03
9+
EXPECTED_VALUE = 0.41
10+
11+
12+
def test_accuracy():
13+
"""Run the end to end accuracy test."""
14+
15+
model_args = (f"model={MODEL_NAME},"
16+
f"base_url=http://localhost:8192/v1/completions,"
17+
f"num_concurrent={NUM_CONCURRENT},tokenized_requests=False")
18+
19+
results = lm_eval.simple_evaluate(
20+
model="local-completions",
21+
model_args=model_args,
22+
tasks=TASK,
23+
)
24+
25+
measured_value = results["results"][TASK][FILTER]
26+
assert (measured_value - RTOL < EXPECTED_VALUE
27+
and measured_value + RTOL > EXPECTED_VALUE
28+
), f"Expected: {EXPECTED_VALUE} | Measured: {measured_value}"
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,39 @@
1+
# SPDX-License-Identifier: Apache-2.0
2+
3+
from vllm.distributed.kv_transfer.kv_connector.v1.nixl_connector import (
4+
NixlConnectorMetadata)
5+
6+
from .utils import create_request, create_scheduler, create_vllm_config
7+
8+
9+
def test_scheduler_worker_inferface():
10+
11+
vllm_config = create_vllm_config()
12+
scheduler = create_scheduler(vllm_config)
13+
14+
# 2 Full Blocks and 1 Half Block.
15+
BLOCK_SIZE = vllm_config.cache_config.block_size
16+
NUM_EXTERNAL_FULL_BLOCKS = 2
17+
NUM_TOKENS = int(BLOCK_SIZE * (NUM_EXTERNAL_FULL_BLOCKS + 0.5))
18+
19+
request = create_request(request_id=1,
20+
num_tokens=NUM_TOKENS,
21+
do_remote_prefill=True)
22+
request_id = request.request_id
23+
24+
scheduler.add_request(request)
25+
26+
# Remote Prefill, triggers NixlConnectorMetdata.
27+
scheduler_output = scheduler.schedule()
28+
kv_connector_metadata = scheduler_output.kv_connector_metadata
29+
assert kv_connector_metadata is not None
30+
assert isinstance(kv_connector_metadata, NixlConnectorMetadata)
31+
32+
assert len(kv_connector_metadata.requests) == 1
33+
assert request_id in kv_connector_metadata.requests
34+
req_meta = kv_connector_metadata.requests[request_id]
35+
36+
for block_id, block in zip(
37+
req_meta.local_block_ids,
38+
scheduler.kv_cache_manager.req_to_blocks[request_id]):
39+
assert block_id == block.block_id
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,92 @@
1+
# SPDX-License-Identifier: Apache-2.0
2+
import copy
3+
4+
from vllm.v1.outputs import EMPTY_MODEL_RUNNER_OUTPUT
5+
from vllm.v1.request import FinishReason, RequestStatus
6+
7+
from .utils import (assert_scheduler_empty, create_model_runner_output,
8+
create_request, create_scheduler, create_vllm_config)
9+
10+
11+
def test_basic_lifecycle():
12+
"""Test lifecycle of a Remote Decode request."""
13+
14+
vllm_config = create_vllm_config()
15+
scheduler = create_scheduler(vllm_config)
16+
17+
# 2 Full Blocks and 1 Half Block.
18+
BLOCK_SIZE = vllm_config.cache_config.block_size
19+
NUM_EXTERNAL_FULL_BLOCKS = 2
20+
NUM_TOKENS = int(BLOCK_SIZE * (NUM_EXTERNAL_FULL_BLOCKS + 0.5))
21+
22+
request = create_request(request_id=1,
23+
num_tokens=NUM_TOKENS,
24+
do_remote_decode=True)
25+
26+
scheduler.add_request(request)
27+
request_id = request.request_id
28+
29+
# STEP (1): Prefill.
30+
# (1a): schedule()
31+
scheduler_output = scheduler.schedule()
32+
assert len(scheduler.running) == 1
33+
assert len(scheduler_output.scheduled_new_reqs) == 1
34+
35+
# (1b): execute_model()
36+
model_runner_output = create_model_runner_output(reqs=[request])
37+
38+
# (1c): update_from_output()
39+
engine_core_outputs = scheduler.update_from_output(scheduler_output,
40+
model_runner_output)
41+
42+
# Ensure the request is finished after 1 tokens.
43+
assert request.is_finished()
44+
assert request.status == RequestStatus.FINISHED_REMOTE_DECODE
45+
output = engine_core_outputs.outputs[0]
46+
assert output.finish_reason == FinishReason.REMOTE_DECODE
47+
assert output.kv_transfer_params is not None
48+
49+
# Request freed in Scheduler and in Persistent Batch ...
50+
assert request_id in scheduler.finished_req_ids
51+
assert len(scheduler.running) == 0
52+
assert len(scheduler.waiting) == 0
53+
54+
# ... but blocks should not be freed.
55+
blocks = scheduler.kv_cache_manager.req_to_blocks[request_id]
56+
for block in blocks:
57+
assert block.ref_cnt == 1
58+
59+
# STEP (2): Send Finished to PB.
60+
# (2a): schedule() - pass finished request to PB.
61+
scheduler_output = scheduler.schedule()
62+
assert len(scheduler.running) == 0
63+
assert len(scheduler_output.finished_req_ids) == 1
64+
assert request_id in scheduler_output.finished_req_ids
65+
assert len(scheduler_output.scheduled_new_reqs) == 0
66+
assert len(scheduler_output.scheduled_cached_reqs) == 0
67+
assert len(scheduler.finished_req_ids) == 0
68+
69+
# (2b): execute_model()
70+
model_runner_output = EMPTY_MODEL_RUNNER_OUTPUT
71+
72+
# (2c): update_from_output()
73+
scheduler.update_from_output(scheduler_output, model_runner_output)
74+
75+
# STEP (3): Finished sending.
76+
# (3a): schedule() - pass finished request to PB.
77+
scheduler_output = scheduler.schedule()
78+
assert len(scheduler.running) == 0
79+
assert len(scheduler_output.finished_req_ids) == 0
80+
assert len(scheduler_output.scheduled_new_reqs) == 0
81+
assert len(scheduler_output.scheduled_cached_reqs) == 0
82+
assert len(scheduler.finished_req_ids) == 0
83+
84+
# (3b): execute_model()
85+
model_runner_output = copy.deepcopy(EMPTY_MODEL_RUNNER_OUTPUT)
86+
model_runner_output.finished_sending = [request_id]
87+
88+
# (3c): update_from_output()
89+
scheduler.update_from_output(scheduler_output, model_runner_output)
90+
91+
# Confirm we do not have any memory leaks after req lifecycle.
92+
assert_scheduler_empty(scheduler)

0 commit comments

Comments
 (0)