Skip to content

Commit 2c50275

Browse files
tlrmchlsmthApostaCrobertgshaw2-redhatRobert Shaw
authored
[P/D Disagg] NIXL MLA (#70)
* [Update] LMcache connector v1 implementation Signed-off-by: ApostaC <yihua98@uchicago.edu> * [Add] examples for disaggregated prefill Signed-off-by: ApostaC <yihua98@uchicago.edu> * [add] extra information about evns Signed-off-by: ApostaC <yihua98@uchicago.edu> * Initial stubs for P/D scheduling changes Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com> * Updates Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com> * Rs branch (#3) * updated Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> * Rs branch (#5) Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> * Remove Unneeded Arguments (#7) * updated Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> * stash Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> * cleanup Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> --------- Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> * Improve disagg-example.sh (#8) - fix spelling - CUDA_VISIBLE_DEVICES should be set externally Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com> * updated Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> * updated Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> * updated Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> * updated Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> * added connector Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> * updated Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> * updated Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> * updated Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> * updated Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> * updated Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> * updated Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> * updated Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> * updated Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> * updated Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> * updated Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> * updated Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> * updated Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> * updated Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> * updated Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> * updated Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> * updated Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> * update Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> * remove Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> * updated Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> * updated Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> * updated Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> * updated Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> * updated Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> * seems to load properly Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> * updated Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> * updated Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> * updated Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> * updated Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> * updated Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> * updated Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> * Revert "updated" This reverts commit 97316d9. * updated Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> * updated Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> * updated Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> * updated Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> * updated Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> * stash Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> * added Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> * diffs for local dev on macos Signed-off-by: Robert Shaw <rshaw@neuralmagic.com> * updated Signed-off-by: Robert Shaw <rshaw@neuralmagic.com> * update Signed-off-by: Robert Shaw <rshaw@neuralmagic.com> * updaed Signed-off-by: Robert Shaw <rshaw@neuralmagic.com> * updated Signed-off-by: Robert Shaw <rshaw@neuralmagic.com> * updated Signed-off-by: Robert Shaw <rshaw@neuralmagic.com> * Checkpoint. Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com> * updated Signed-off-by: Robert Shaw <rshaw@neuralmagic.com> * Cleanup Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com> * WIP Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com> * updated Signed-off-by: Robert Shaw <rshaw@neuralmagic.com> * updated Signed-off-by: Robert Shaw <rshaw@neuralmagic.com> * updated on scheduler side Signed-off-by: Robert Shaw <rshaw@neuralmagic.com> * updated Signed-off-by: Robert Shaw <rshaw@neuralmagic.com> * updated Signed-off-by: Robert Shaw <rshaw@neuralmagic.com> * updated Signed-off-by: Robert Shaw <rshaw@neuralmagic.com> * updated Signed-off-by: Robert Shaw <rshaw@neuralmagic.com> * updated Signed-off-by: Robert Shaw <rshaw@neuralmagic.com> * updated Signed-off-by: Robert Shaw <rshaw@neuralmagic.com> * Hacking away Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com> * cleanup Signed-off-by: Robert Shaw <rshaw@neuralmagic.com> * ensure request removed from running list Signed-off-by: Robert Shaw <rshaw@neuralmagic.com> * Runs E2E. Garbage output. Crashes on 2nd request Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com> * update Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com> * updated Signed-off-by: Robert Shaw <rshaw@neuralmagic.com> * updated Signed-off-by: Robert Shaw <rshaw@neuralmagic.com> * rename files Signed-off-by: Robert Shaw <rshaw@neuralmagic.com> * updated Signed-off-by: Robert Shaw <rshaw@neuralmagic.com> * updated Signed-off-by: Robert Shaw <rshaw@neuralmagic.com> * updated Signed-off-by: Robert Shaw <rshaw@neuralmagic.com> * updated Signed-off-by: Robert Shaw <rshaw@neuralmagic.com> * updated Signed-off-by: Robert Shaw <rshaw@neuralmagic.com> * update Signed-off-by: Robert Shaw <rshaw@neuralmagic.com> * Second request no longer crashes Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com> * Remove gpu_model_runner hacks Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com> * Clean up Justfile Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com> * [Bugfix] Stale finished requests in EMPTY_MODEL_RUNNER_OUTPUT Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com> * update Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com> * justfile edits Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com> * Update Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com> * Fixes - lm_eval gsm8k has correctness Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com> * "just delete the assert" Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com> * fixup precommit issues Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com> * Fixes Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com> * updated (#12) Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> * Add Accuracy Test (#13) * updated Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> * updated Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> * updated Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> * updated Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> --------- Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> * Preemption Bugfixes (#15) * stash fixed double free issue Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> * updated Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> * updated Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> * updated Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> * updated Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> * updated Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> * updated Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> * fixed issue Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> * updated Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> * updatrd Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> * updatrd Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> * updatrd Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> * updatrd Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> * updatrd Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> * updatrd Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> --------- Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> * updated (#16) Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> * Fix Bad Merge | Fix Memory Leak in Upstream (#18) * updated Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> * fix merge Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> * updated Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> * updated Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> * updated Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> * updated Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> --------- Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> * updated Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> * cleanup code Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> * cleanup code Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> * updated Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> * updated Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> * updated Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> * stash Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> * updated Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> * updated Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> * updatted Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> * updated Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> * updated Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> * revert Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> * more spurious changes Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> * updated Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> * updated Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> * updated Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> * updated Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> * updated Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> * updated Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> * updated Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> * updated Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> * updated Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> * updated Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> * updated Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> * updated Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> * updated Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> * updated Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> * updated Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> * updated Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> * Support MLA in NIXL connector Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com> * WIP adding tests Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com> * wip Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com> * Fixes Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com> --------- Signed-off-by: ApostaC <yihua98@uchicago.edu> Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com> Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> Signed-off-by: Robert Shaw <rshaw@neuralmagic.com> Co-authored-by: ApostaC <yihua98@uchicago.edu> Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com> Co-authored-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> Co-authored-by: Robert Shaw <rshaw@neuralmagic.com>
1 parent 8beac5e commit 2c50275

File tree

4 files changed

+182
-87
lines changed

4 files changed

+182
-87
lines changed

tests/entrypoints/llm/test_accuracy.py

+2
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,7 @@
1616
MODEL_NAMES = [
1717
"Qwen/Qwen2-1.5B-Instruct",
1818
"google/gemma-3-1b-it",
19+
"deepseek-ai/deepseek-vl2-tiny",
1920
]
2021
NUM_CONCURRENT = 500
2122
TASK = "gsm8k"
@@ -24,6 +25,7 @@
2425
EXPECTED_VALUES = {
2526
"Qwen/Qwen2-1.5B-Instruct": 0.58,
2627
"google/gemma-3-1b-it": 0.25,
28+
"deepseek-ai/deepseek-vl2-tiny": 0.4,
2729
}
2830

2931

+129-66
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,11 @@
11
#!/bin/bash
2-
32
set -xe
43

5-
# Model to run.
6-
MODEL_NAME=Qwen/Qwen3-0.6B
4+
# Models to run
5+
MODELS=(
6+
# "Qwen/Qwen3-0.6B"
7+
"deepseek-ai/deepseek-vl2-tiny"
8+
)
79

810
# Number of prefill and decode instances to create
911
NUM_PREFILL_INSTANCES=${NUM_PREFILL_INSTANCES:-1} # Default to 1
@@ -24,86 +26,147 @@ wait_for_server() {
2426
done" && return 0 || return 1
2527
}
2628

27-
# Arrays to store all hosts and ports
28-
PREFILL_HOSTS=()
29-
PREFILL_PORTS=()
30-
DECODE_HOSTS=()
31-
DECODE_PORTS=()
29+
# Function to clean up previous instances
30+
cleanup_instances() {
31+
echo "Cleaning up any running vLLM instances..."
32+
pkill -f "vllm serve" || true
33+
sleep 2
34+
}
3235

33-
# Start prefill instances
34-
for i in $(seq 0 $((NUM_PREFILL_INSTANCES-1))); do
35-
# Calculate GPU ID - we'll distribute across available GPUs
36-
GPU_ID=$((i % $(nvidia-smi --query-gpu=name --format=csv,noheader | wc -l)))
37-
# Calculate port number (base port + instance number)
38-
PORT=$((8100 + i))
39-
# Calculate side channel port
40-
SIDE_CHANNEL_PORT=$((5559 + i))
36+
# Handle to get model-specific arguments for deepseek
37+
get_model_args() {
38+
local model_name=$1
39+
local extra_args=""
4140

42-
echo "Starting prefill instance $i on GPU $GPU_ID, port $PORT"
41+
if [[ "$model_name" == "deepseek-ai/deepseek-vl2-tiny" ]]; then
42+
extra_args="--hf_overrides '{\"architectures\": [\"DeepseekVLV2ForCausalLM\"]}' --trust-remote-code"
43+
fi
4344

44-
CUDA_VISIBLE_DEVICES=$GPU_ID VLLM_NIXL_SIDE_CHANNEL_PORT=$SIDE_CHANNEL_PORT vllm serve $MODEL_NAME \
45-
--port $PORT \
46-
--enforce-eager \
47-
--disable-log-requests \
48-
--gpu-memory-utilization 0.2 \
49-
--kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_both"}' &
45+
echo "$extra_args"
46+
}
5047

51-
# Store host and port for proxy configuration
52-
PREFILL_HOSTS+=("localhost")
53-
PREFILL_PORTS+=($PORT)
54-
done
5548

56-
# Start decode instances
57-
for i in $(seq 0 $((NUM_DECODE_INSTANCES-1))); do
58-
# Calculate GPU ID - we'll distribute across available GPUs, starting from after prefill GPUs
59-
GPU_ID=$(((i + NUM_PREFILL_INSTANCES) % $(nvidia-smi --query-gpu=name --format=csv,noheader | wc -l)))
60-
# Calculate port number (base port + instance number)
61-
PORT=$((8200 + i))
62-
# Calculate side channel port
63-
SIDE_CHANNEL_PORT=$((5659 + i))
49+
# Function to run tests for a specific model
50+
run_tests_for_model() {
51+
local model_name=$1
52+
echo "================================"
53+
echo "Testing model: $model_name"
54+
echo "================================"
55+
56+
# Get model-specific arguments
57+
local model_args=$(get_model_args "$model_name")
58+
59+
# Arrays to store all hosts and ports
60+
PREFILL_HOSTS=()
61+
PREFILL_PORTS=()
62+
DECODE_HOSTS=()
63+
DECODE_PORTS=()
6464

65-
echo "Starting decode instance $i on GPU $GPU_ID, port $PORT"
65+
# Start prefill instances
66+
for i in $(seq 0 $((NUM_PREFILL_INSTANCES-1))); do
67+
# Calculate GPU ID - we'll distribute across available GPUs
68+
GPU_ID=$((i % $(nvidia-smi --query-gpu=name --format=csv,noheader | wc -l)))
69+
# Calculate port number (base port + instance number)
70+
PORT=$((8100 + i))
71+
# Calculate side channel port
72+
SIDE_CHANNEL_PORT=$((5559 + i))
6673

67-
CUDA_VISIBLE_DEVICES=$GPU_ID VLLM_NIXL_SIDE_CHANNEL_PORT=$SIDE_CHANNEL_PORT vllm serve $MODEL_NAME \
74+
echo "Starting prefill instance $i on GPU $GPU_ID, port $PORT"
75+
76+
# Build the command with or without model-specific args
77+
BASE_CMD="CUDA_VISIBLE_DEVICES=$GPU_ID VLLM_NIXL_SIDE_CHANNEL_PORT=$SIDE_CHANNEL_PORT vllm serve $model_name \
6878
--port $PORT \
6979
--enforce-eager \
7080
--disable-log-requests \
7181
--gpu-memory-utilization 0.2 \
72-
--kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_both"}' &
82+
--kv-transfer-config '{\"kv_connector\":\"NixlConnector\",\"kv_role\":\"kv_both\"}'"
83+
84+
if [ -n "$model_args" ]; then
85+
FULL_CMD="$BASE_CMD $model_args"
86+
else
87+
FULL_CMD="$BASE_CMD"
88+
fi
89+
90+
eval "$FULL_CMD &"
91+
92+
# Store host and port for proxy configuration
93+
PREFILL_HOSTS+=("localhost")
94+
PREFILL_PORTS+=($PORT)
95+
done
96+
97+
# Start decode instances
98+
for i in $(seq 0 $((NUM_DECODE_INSTANCES-1))); do
99+
# Calculate GPU ID - we'll distribute across available GPUs, starting from after prefill GPUs
100+
GPU_ID=$(((i + NUM_PREFILL_INSTANCES) % $(nvidia-smi --query-gpu=name --format=csv,noheader | wc -l)))
101+
# Calculate port number (base port + instance number)
102+
PORT=$((8200 + i))
103+
# Calculate side channel port
104+
SIDE_CHANNEL_PORT=$((5659 + i))
105+
106+
echo "Starting decode instance $i on GPU $GPU_ID, port $PORT"
107+
108+
# Build the command with or without model-specific args
109+
BASE_CMD="CUDA_VISIBLE_DEVICES=$GPU_ID VLLM_NIXL_SIDE_CHANNEL_PORT=$SIDE_CHANNEL_PORT vllm serve $model_name \
110+
--port $PORT \
111+
--enforce-eager \
112+
--disable-log-requests \
113+
--gpu-memory-utilization 0.2 \
114+
--kv-transfer-config '{\"kv_connector\":\"NixlConnector\",\"kv_role\":\"kv_both\"}'"
73115

74-
# Store host and port for proxy configuration
75-
DECODE_HOSTS+=("localhost")
76-
DECODE_PORTS+=($PORT)
77-
done
116+
if [ -n "$model_args" ]; then
117+
FULL_CMD="$BASE_CMD $model_args"
118+
else
119+
FULL_CMD="$BASE_CMD"
120+
fi
78121

79-
# Wait for all instances to start
80-
for PORT in "${PREFILL_PORTS[@]}"; do
81-
echo "Waiting for prefill instance on port $PORT to start..."
82-
wait_for_server $PORT
83-
done
122+
eval "$FULL_CMD &"
84123

85-
for PORT in "${DECODE_PORTS[@]}"; do
86-
echo "Waiting for decode instance on port $PORT to start..."
87-
wait_for_server $PORT
88-
done
124+
# Store host and port for proxy configuration
125+
DECODE_HOSTS+=("localhost")
126+
DECODE_PORTS+=($PORT)
127+
done
128+
129+
# Wait for all instances to start
130+
for PORT in "${PREFILL_PORTS[@]}"; do
131+
echo "Waiting for prefill instance on port $PORT to start..."
132+
wait_for_server $PORT
133+
done
89134

90-
# Build the command for the proxy server with all the hosts and ports
91-
PROXY_CMD="python ${GIT_ROOT}/tests/v1/kv_connector/toy_proxy_server.py --port 8192"
135+
for PORT in "${DECODE_PORTS[@]}"; do
136+
echo "Waiting for decode instance on port $PORT to start..."
137+
wait_for_server $PORT
138+
done
92139

93-
# Add all prefill hosts and ports
94-
PROXY_CMD+=" --prefiller-hosts ${PREFILL_HOSTS[@]}"
95-
PROXY_CMD+=" --prefiller-ports ${PREFILL_PORTS[@]}"
140+
# Build the command for the proxy server with all the hosts and ports
141+
PROXY_CMD="python ${GIT_ROOT}/tests/v1/kv_connector/toy_proxy_server.py --port 8192"
96142

97-
# Add all decode hosts and ports
98-
PROXY_CMD+=" --decoder-hosts ${DECODE_HOSTS[@]}"
99-
PROXY_CMD+=" --decoder-ports ${DECODE_PORTS[@]}"
143+
# Add all prefill hosts and ports
144+
PROXY_CMD+=" --prefiller-hosts ${PREFILL_HOSTS[@]}"
145+
PROXY_CMD+=" --prefiller-ports ${PREFILL_PORTS[@]}"
100146

101-
# Start the proxy server
102-
echo "Starting proxy server with command: $PROXY_CMD"
103-
$PROXY_CMD &
147+
# Add all decode hosts and ports
148+
PROXY_CMD+=" --decoder-hosts ${DECODE_HOSTS[@]}"
149+
PROXY_CMD+=" --decoder-ports ${DECODE_PORTS[@]}"
104150

105-
# Wait for the proxy to start
106-
sleep 5
151+
# Start the proxy server
152+
echo "Starting proxy server with command: $PROXY_CMD"
153+
$PROXY_CMD &
154+
155+
# Wait for the proxy to start
156+
sleep 5
157+
158+
# Run lm eval for this model
159+
echo "Running tests for $model_name"
160+
TEST_MODEL=$model_name python -m pytest -s -x ${GIT_ROOT}/tests/v1/kv_connector/test_accuracy.py
161+
162+
# Clean up before running next model
163+
cleanup_instances
164+
sleep 3
165+
}
166+
167+
# Run tests for each model
168+
for model in "${MODELS[@]}"; do
169+
run_tests_for_model "$model"
170+
done
107171

108-
# Run lm eval.
109-
python -m pytest -s -x ${GIT_ROOT}/tests/v1/kv_connector/test_accuracy.py
172+
echo "All tests completed!"
+23-7
Original file line numberDiff line numberDiff line change
@@ -1,32 +1,40 @@
11
# SPDX-License-Identifier: Apache-2.0
2+
import os
3+
24
import lm_eval
35
import openai
46

57
BASE_URL = "http://localhost:8192/v1"
6-
MODEL_NAME = "Qwen/Qwen3-0.6B"
78
NUM_CONCURRENT = 100
89
TASK = "gsm8k"
910
FILTER = "exact_match,strict-match"
1011
RTOL = 0.03
11-
EXPECTED_VALUE = 0.41
12+
13+
# Model-specific expected values
14+
EXPECTED_VALUES = {
15+
"Qwen/Qwen3-0.6B": 0.41,
16+
"deepseek-ai/deepseek-vl2-tiny": 0.20,
17+
}
1218

1319
SIMPLE_PROMPT = "The best part about working on vLLM is that I got to meet so many people across various different organizations like UCB, Google, and Meta which means", # noqa: E501
1420

21+
# Get model name from environment variable
22+
MODEL_NAME = os.environ.get("TEST_MODEL", "Qwen/Qwen3-0.6B")
23+
1524

1625
def run_simple_prompt():
1726
client = openai.OpenAI(api_key="EMPTY", base_url=BASE_URL)
1827
completion = client.completions.create(model=MODEL_NAME,
1928
prompt=SIMPLE_PROMPT)
2029

2130
print("-" * 50)
22-
print("Completion results:")
31+
print(f"Completion results for {MODEL_NAME}:")
2332
print(completion)
2433
print("-" * 50)
2534

2635

2736
def test_accuracy():
2837
"""Run the end to end accuracy test."""
29-
3038
run_simple_prompt()
3139

3240
model_args = (f"model={MODEL_NAME},"
@@ -40,6 +48,14 @@ def test_accuracy():
4048
)
4149

4250
measured_value = results["results"][TASK][FILTER]
43-
assert (measured_value - RTOL < EXPECTED_VALUE
44-
and measured_value + RTOL > EXPECTED_VALUE
45-
), f"Expected: {EXPECTED_VALUE} | Measured: {measured_value}"
51+
expected_value = EXPECTED_VALUES.get(MODEL_NAME)
52+
53+
if expected_value is None:
54+
print(f"Warning: No expected value found for {MODEL_NAME}. "
55+
"Skipping accuracy check.")
56+
print(f"Measured value: {measured_value}")
57+
return
58+
59+
assert (measured_value - RTOL < expected_value
60+
and measured_value + RTOL > expected_value
61+
), f"Expected: {expected_value} | Measured: {measured_value}"

vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py

+28-14
Original file line numberDiff line numberDiff line change
@@ -331,20 +331,32 @@ def _nixl_handshake(self, host: str, port: int):
331331
def register_kv_caches(self, kv_caches: dict[str, torch.Tensor]):
332332
"""Register the KV Cache data in nixl."""
333333

334-
first_layer_name = next(iter(kv_caches))
335-
first_kv_cache = kv_caches[first_layer_name]
334+
first_layer_name, first_kv_cache = next(iter(kv_caches.items()))
335+
kv_elem_size = first_kv_cache.element_size()
336+
337+
# TODO(tms): Find a more robust way to detect and handle MLA
338+
use_mla = len(first_kv_cache.shape) == 3
339+
if use_mla:
340+
# MLA case.
341+
self.num_blocks = first_kv_cache.shape[0]
342+
block_rank = 2 # [block_size, latent_dim]
343+
block_shape = first_kv_cache.shape[-block_rank:]
344+
else:
345+
# [2 (k and v), num_blocks, ...]
346+
self.num_blocks = first_kv_cache.shape[1]
347+
block_rank = 3 # [block_size, kv_heads, head_dim]
348+
block_shape = first_kv_cache.shape[-block_rank:]
336349

337-
# [2 (k and v), num_blocks, ...]
338-
# TODO(tms): num_blocks will be in a different spot for MLA.
339-
num_blocks = first_kv_cache.shape[1]
340-
kv_elem_size = first_kv_cache[0].element_size()
341350
# TODO(tms): self.block_len needs to be per-layer for sliding window,
342351
# hybrid attn, etc
343-
self.block_len = kv_elem_size * math.prod(first_kv_cache.shape[-3:])
344-
345-
logger.debug("Per layer kv cache size: %s", first_kv_cache[0].shape)
346-
self.num_blocks = num_blocks
347-
self.dst_num_blocks[self.engine_id] = num_blocks
352+
self.block_len = kv_elem_size * math.prod(block_shape)
353+
354+
logger.debug("Registering KV_Caches. use_mla: %s, shape %s", use_mla,
355+
first_kv_cache.shape)
356+
logger.debug("num_blocks: %s, block_shape: %s", self.num_blocks,
357+
block_shape)
358+
logger.debug("Per layer kv cache size: %s", first_kv_cache.shape)
359+
self.dst_num_blocks[self.engine_id] = self.num_blocks
348360
self.kv_caches = kv_caches
349361
kv_caches_base_addr = []
350362
caches_data = []
@@ -355,10 +367,12 @@ def register_kv_caches(self, kv_caches: dict[str, torch.Tensor]):
355367
# are non-contiguous (it's not locally guaranteed that they will be)
356368
# Disadvantage is that the encoded NixlAgentMetadata is now larger
357369
# (roughly 8KB vs 5KB).
358-
for layer_name in kv_caches:
359-
for cache in kv_caches[layer_name]:
370+
for cache_or_caches in kv_caches.values():
371+
# Normalize to always be a list of caches
372+
cache_list = [cache_or_caches] if use_mla else cache_or_caches
373+
for cache in cache_list:
360374
base_addr = cache.data_ptr()
361-
region_len = num_blocks * self.block_len
375+
region_len = self.num_blocks * self.block_len
362376
caches_data.append((base_addr, region_len, self.rank, ""))
363377
kv_caches_base_addr.append(base_addr)
364378
self.kv_caches_base_addr[self.engine_id] = kv_caches_base_addr

0 commit comments

Comments
 (0)