Skip to content

Commit c9a82db

Browse files
committed
Clean up v0.9.1 code
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
1 parent 3034e5a commit c9a82db

File tree

24 files changed

+201
-733
lines changed

24 files changed

+201
-733
lines changed

.github/workflows/accuracy_test.yaml

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -38,7 +38,6 @@ on:
3838
options:
3939
- main
4040
- v0.9.2
41-
- v0.9.1
4241
- v0.7.3
4342
vllm-ascend-version:
4443
description: 'vllm-ascend version:'

.github/workflows/vllm_ascend_test_pd.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -44,7 +44,7 @@ jobs:
4444
vllm_verison: [
4545
# revert me when V1 disaggregation prefill is merged in main
4646
# main,
47-
v0.9.1
47+
v0.9.2
4848
]
4949
name: vLLM Ascend prefilling decoding disaggregation test
5050
runs-on: linux-arm64-npu-static-8

docs/source/developer_guide/feature_guide/patch.md

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -20,11 +20,11 @@ In `vllm_ascend/patch`, you can see the code structure as follows:
2020
vllm_ascend
2121
├── patch
2222
│ ├── platform
23-
│ │ ├── patch_0_9_1
23+
│ │ ├── patch_0_9_2
2424
│ │ ├── patch_common
2525
│ │ ├── patch_main
2626
│ ├── worker
27-
│ │ ├── patch_0_9_1
27+
│ │ ├── patch_0_9_2
2828
│ │ ├── patch_common
2929
│ │ ├── patch_main
3030
└───────────
@@ -38,15 +38,15 @@ vllm_ascend
3838

3939
In both **platform** and **worker** folder, there are several patch modules. They are used for patching different version of vLLM.
4040

41-
- `patch_0_9_1`: This module is used for patching vLLM 0.9.1. The version is always the nearest version of vLLM. Once vLLM is released, we will drop this patch module and bump to a new version. For example, `patch_0_9_2` is used for patching vLLM 0.9.2.
41+
- `patch_0_9_2`: This module is used for patching vLLM 0.9.2. The version is always the nearest version of vLLM. Once vLLM is released, we will drop this patch module and bump to a new version. For example, `patch_0_9_2` is used for patching vLLM 0.9.2.
4242
- `patch_main`: This module is used for patching the code in vLLM main branch.
43-
- `patch_common`: This module is used for patching both vLLM 0.9.1 and vLLM main branch.
43+
- `patch_common`: This module is used for patching both vLLM 0.9.2 and vLLM main branch.
4444

4545
## How to write a patch
4646

4747
Before writing a patch, following the principle above, we should patch the least code. If it's necessary, we can patch the code in either **platform** and **worker** folder. Here is an example to patch `distributed` module in vLLM.
4848

49-
1. Decide which version of vLLM we should patch. For example, after analysis, here we want to patch both 0.9.1 and main of vLLM.
49+
1. Decide which version of vLLM we should patch. For example, after analysis, here we want to patch both 0.9.2 and main of vLLM.
5050
2. Decide which process we should patch. For example, here `distributed` belongs to the vLLM main process, so we should patch `platform`.
5151
3. Create the patch file in the right folder. The file should be named as `patch_{module_name}.py`. The example here is `vllm_ascend/patch/platform/patch_common/patch_distributed.py`.
5252
4. Write your patch code in the new file. Here is an example:
@@ -79,4 +79,4 @@ Before writing a patch, following the principle above, we should patch the least
7979

8080
## Limitation
8181
1. In V1 Engine, vLLM starts three kinds of process: Main process, EngineCore process and Worker process. Now vLLM Ascend only support patch the code in Main process and Worker process by default. If you want to patch the code runs in EngineCore process, you should patch EngineCore process entirely during setup, the entry code is here `vllm.v1.engine.core`. Please override `EngineCoreProc` and `DPEngineCoreProc` entirely.
82-
2. If you are running an edited vLLM code, the version of the vLLM may be changed automatically. For example, if you runs an edited vLLM based on v0.9.1, the version of vLLM may be change to v0.9.2xxx, in this case, the patch for v0.9.1 in vLLM Ascend would not work as expect, because that vLLM Ascend can't distinguish the version of vLLM you're using. In this case, you can set the environment variable `VLLM_VERSION` to specify the version of vLLM you're using, then the patch for v0.9.1 should work.
82+
2. If you are running an edited vLLM code, the version of the vLLM may be changed automatically. For example, if you runs an edited vLLM based on v0.9.2, the version of vLLM may be change to v0.9.2xxx, in this case, the patch for v0.9.2 in vLLM Ascend would not work as expect, because that vLLM Ascend can't distinguish the version of vLLM you're using. In this case, you can set the environment variable `VLLM_VERSION` to specify the version of vLLM you're using, then the patch for v0.9.2 should work.

tests/e2e/multicard/test_offline_inference_distributed.py

Lines changed: 0 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -73,28 +73,6 @@ def test_models_distributed_DeepSeek_multistream_moe():
7373
vllm_model.generate_greedy(example_prompts, max_tokens)
7474

7575

76-
@patch.dict(os.environ, {"VLLM_ASCEND_ENABLE_TOPK_OPTIMIZE": "1"})
77-
def test_models_distributed_topk() -> None:
78-
example_prompts = [
79-
"vLLM is a high-throughput and memory-efficient inference and serving engine for LLMs.",
80-
"Briefly describe the major milestones in the development of artificial intelligence from 1950 to 2020.",
81-
"Compare and contrast artificial intelligence with human intelligence in terms of processing information.",
82-
]
83-
dtype = "half"
84-
sampling_params = SamplingParams(max_tokens=5,
85-
temperature=0.0,
86-
top_k=50,
87-
top_p=0.9)
88-
89-
with VllmRunner(
90-
"deepseek-ai/DeepSeek-V2-Lite",
91-
dtype=dtype,
92-
tensor_parallel_size=4,
93-
distributed_executor_backend="mp",
94-
) as vllm_model:
95-
vllm_model.generate(example_prompts, sampling_params)
96-
97-
9876
@patch.dict(os.environ, {"VLLM_ASCEND_ENABLE_DBO": "1"})
9977
def test_models_distributed_DeepSeek_dbo():
10078
example_prompts = ["The president of the United States is"] * 41

tests/e2e/singlecard/core/ascend_scheduler/test_ascend_scheduler.py

Lines changed: 16 additions & 53 deletions
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,6 @@
1616
from vllm.v1.structured_output import StructuredOutputManager
1717

1818
from vllm_ascend.core.scheduler import AscendScheduler
19-
from vllm_ascend.utils import vllm_version_is
2019

2120
EOS_TOKEN_ID = 50256
2221

@@ -140,9 +139,7 @@ def create_requests(num_requests: int,
140139
multi_modal_placeholders=mm_position,
141140
multi_modal_hashes=None,
142141
eos_token_id=EOS_TOKEN_ID,
143-
**({
144-
"pooling_params": None
145-
} if not vllm_version_is("0.9.1") else {}),
142+
pooling_params={},
146143
)
147144
requests.append(request)
148145
return requests
@@ -201,10 +198,7 @@ def test_schedule(enable_prefix_caching: Optional[bool],
201198
# Test initial scheduling
202199
output = scheduler.schedule()
203200
assert len(output.scheduled_new_reqs) == len(requests)
204-
if vllm_version_is("0.9.1"):
205-
assert len(output.scheduled_cached_reqs) == 0
206-
else:
207-
assert output.scheduled_cached_reqs.num_reqs == 0
201+
assert output.scheduled_cached_reqs.num_reqs == 0
208202
assert len(output.finished_req_ids) == 0
209203
# Verify all requests are scheduled.
210204
for req_id, num_tokens in output.num_scheduled_tokens.items():
@@ -241,10 +235,7 @@ def test_schedule_concurrent_partial_requests(enable_prefix_caching: bool):
241235

242236
output = scheduler.schedule()
243237
assert len(output.scheduled_new_reqs) == 3
244-
if vllm_version_is("0.9.1"):
245-
assert len(output.scheduled_cached_reqs) == 0
246-
else:
247-
assert output.scheduled_cached_reqs.num_reqs == 0
238+
assert output.scheduled_cached_reqs.num_reqs == 0
248239
assert len(output.finished_req_ids) == 0
249240

250241
# The first request is scheduled partially - 400.
@@ -264,20 +255,15 @@ def test_schedule_concurrent_partial_requests(enable_prefix_caching: bool):
264255
spec_token_ids=None,
265256
logprobs=None,
266257
prompt_logprobs_dict={},
267-
**({
268-
"pooler_output": []
269-
} if not vllm_version_is("0.9.1") else {}))
258+
pooler_output=[])
270259
scheduler.update_from_output(output, model_runner_output)
271260

272261
# Schedule the next step. All three requests are running.
273262
# Processed the remaining prefills of the first and second requests.
274263
output1 = scheduler.schedule()
275264
assert len(scheduler.running) == 3
276265
assert len(output1.scheduled_new_reqs) == 0
277-
if vllm_version_is("0.9.1"):
278-
assert len(output1.scheduled_cached_reqs) == 3
279-
else:
280-
assert output1.scheduled_cached_reqs.num_reqs == 3
266+
assert output1.scheduled_cached_reqs.num_reqs == 3
281267
assert len(output1.finished_req_ids) == 0
282268
assert output1.num_scheduled_tokens[requests[0].request_id] == 400
283269
assert output1.num_scheduled_tokens[requests[1].request_id] == 400
@@ -293,18 +279,13 @@ def test_schedule_concurrent_partial_requests(enable_prefix_caching: bool):
293279
spec_token_ids=None,
294280
logprobs=None,
295281
prompt_logprobs_dict={},
296-
**({
297-
"pooler_output": []
298-
} if not vllm_version_is("0.9.1") else {}))
282+
pooler_output=[])
299283

300284
scheduler.update_from_output(output1, model_runner_output)
301285
output2 = scheduler.schedule()
302286
assert len(scheduler.running) == 3
303287
assert len(output2.scheduled_new_reqs) == 0
304-
if vllm_version_is("0.9.1"):
305-
assert len(output2.scheduled_cached_reqs) == 3
306-
else:
307-
assert output2.scheduled_cached_reqs.num_reqs == 3
288+
assert output2.scheduled_cached_reqs.num_reqs == 3
308289
assert len(output2.finished_req_ids) == 0
309290
assert output2.num_scheduled_tokens[requests[0].request_id] == 1
310291
assert output2.num_scheduled_tokens[requests[1].request_id] == 1
@@ -351,9 +332,7 @@ def test_stop_via_update_from_output():
351332
spec_token_ids=None,
352333
logprobs=None,
353334
prompt_logprobs_dict={},
354-
**({
355-
"pooler_output": []
356-
} if not vllm_version_is("0.9.1") else {}))
335+
pooler_output=[])
357336

358337
scheduler.update_from_output(scheduler_output, model_output)
359338

@@ -402,9 +381,7 @@ def test_stop_via_update_from_output():
402381
spec_token_ids=None,
403382
logprobs=None,
404383
prompt_logprobs_dict={},
405-
**({
406-
"pooler_output": []
407-
} if not vllm_version_is("0.9.1") else {}))
384+
pooler_output=[])
408385

409386
scheduler.update_from_output(scheduler_output, model_output)
410387

@@ -452,9 +429,7 @@ def test_stop_via_update_from_output():
452429
spec_token_ids=None,
453430
logprobs=None,
454431
prompt_logprobs_dict={},
455-
**({
456-
"pooler_output": []
457-
} if not vllm_version_is("0.9.1") else {}))
432+
pooler_output=[])
458433

459434
scheduler.update_from_output(scheduler_output, model_output)
460435

@@ -497,9 +472,7 @@ def test_stop_via_update_from_output():
497472
spec_token_ids=None,
498473
logprobs=None,
499474
prompt_logprobs_dict={},
500-
**({
501-
"pooler_output": []
502-
} if not vllm_version_is("0.9.1") else {}))
475+
pooler_output=[])
503476

504477
scheduler.update_from_output(scheduler_output, model_output)
505478

@@ -549,9 +522,7 @@ def test_schedule_concurrent_batches(enable_prefix_caching: Optional[bool],
549522
spec_token_ids=None,
550523
logprobs=None,
551524
prompt_logprobs_dict={},
552-
**({
553-
"pooler_output": []
554-
} if not vllm_version_is("0.9.1") else {}))
525+
pooler_output=[])
555526

556527
scheduler.update_from_output(scheduler_output0, model_runner_output)
557528

@@ -569,9 +540,7 @@ def test_schedule_concurrent_batches(enable_prefix_caching: Optional[bool],
569540
spec_token_ids=None,
570541
logprobs=None,
571542
prompt_logprobs_dict={},
572-
**({
573-
"pooler_output": []
574-
} if not vllm_version_is("0.9.1") else {}))
543+
pooler_output=[])
575544

576545
scheduler.update_from_output(scheduler_output1, model_runner_output)
577546

@@ -622,9 +591,7 @@ def test_schedule_spec_decoding_stats(spec_tokens, output_tokens, expected):
622591
spec_token_ids=spec_tokens,
623592
logprobs=None,
624593
prompt_logprobs_dict={},
625-
**({
626-
"pooler_output": []
627-
} if not vllm_version_is("0.9.1") else {}))
594+
pooler_output=[])
628595

629596
engine_core_outputs = scheduler.update_from_output(output,
630597
model_runner_output)
@@ -664,9 +631,7 @@ def test_schedule_spec_decoding_stats(spec_tokens, output_tokens, expected):
664631
spec_token_ids=None,
665632
logprobs=None,
666633
prompt_logprobs_dict={},
667-
**({
668-
"pooler_output": []
669-
} if not vllm_version_is("0.9.1") else {}))
634+
pooler_output=[])
670635

671636
engine_core_outputs = scheduler.update_from_output(output,
672637
model_runner_output)
@@ -695,9 +660,7 @@ def make_output(scheduler: AscendScheduler):
695660
spec_token_ids=None,
696661
logprobs=None,
697662
prompt_logprobs_dict={},
698-
**({
699-
"pooler_output": []
700-
} if not vllm_version_is("0.9.1") else {}))
663+
pooler_output=[])
701664

702665

703666
def assert_scheduler_empty(scheduler: AscendScheduler):

tests/e2e/singlecard/sample/test_rejection_sampler.py

Lines changed: 18 additions & 41 deletions
Original file line numberDiff line numberDiff line change
@@ -4,12 +4,12 @@
44
import pytest
55
import torch
66
import torch.nn.functional as F
7+
from vllm.v1.sample.logits_processor import LogitsProcessorManager
78
from vllm.v1.sample.metadata import SamplingMetadata
89
from vllm.v1.spec_decode.metadata import SpecDecodeMetadata
910

1011
from vllm_ascend.sample.rejection_sampler import (PLACEHOLDER_TOKEN_ID,
1112
AscendRejectionSampler)
12-
from vllm_ascend.utils import vllm_version_is
1313

1414
DEVICE = "npu"
1515

@@ -50,46 +50,23 @@ def create_sampling_metadata(
5050
temperature = None
5151
else:
5252
assert temperature is not None
53-
if vllm_version_is("0.9.1"):
54-
return SamplingMetadata(
55-
temperature=temperature,
56-
all_greedy=all_greedy,
57-
all_random=not all_greedy,
58-
top_p=top_p,
59-
top_k=top_k,
60-
min_p=torch.empty(1, ),
61-
generators=generators,
62-
max_num_logprobs=0,
63-
no_penalties=False,
64-
prompt_token_ids=None,
65-
frequency_penalties=torch.tensor([]),
66-
presence_penalties=torch.tensor([]),
67-
repetition_penalties=torch.tensor([]),
68-
output_token_ids=[],
69-
min_tokens={},
70-
logit_bias=[None],
71-
allowed_token_ids_mask=None,
72-
bad_words_token_ids={},
73-
)
74-
else:
75-
from vllm.v1.sample.logits_processor import LogitsProcessorManager
76-
77-
return SamplingMetadata(temperature=temperature,
78-
all_greedy=all_greedy,
79-
all_random=not all_greedy,
80-
top_p=top_p,
81-
top_k=top_k,
82-
generators=generators,
83-
max_num_logprobs=0,
84-
no_penalties=False,
85-
prompt_token_ids=None,
86-
frequency_penalties=torch.tensor([]),
87-
presence_penalties=torch.tensor([]),
88-
repetition_penalties=torch.tensor([]),
89-
output_token_ids=[],
90-
allowed_token_ids_mask=None,
91-
bad_words_token_ids={},
92-
logitsprocs=LogitsProcessorManager())
53+
54+
return SamplingMetadata(temperature=temperature,
55+
all_greedy=all_greedy,
56+
all_random=not all_greedy,
57+
top_p=top_p,
58+
top_k=top_k,
59+
generators=generators,
60+
max_num_logprobs=0,
61+
no_penalties=False,
62+
prompt_token_ids=None,
63+
frequency_penalties=torch.tensor([]),
64+
presence_penalties=torch.tensor([]),
65+
repetition_penalties=torch.tensor([]),
66+
output_token_ids=[],
67+
allowed_token_ids_mask=None,
68+
bad_words_token_ids={},
69+
logitsprocs=LogitsProcessorManager())
9370

9471

9572
########################### Tests for Greedy Sampling ###################

tests/e2e/singlecard/test_embedding.py

Lines changed: 0 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -19,12 +19,10 @@
1919
from collections.abc import Sequence
2020
from typing import Optional
2121

22-
import pytest
2322
from modelscope import snapshot_download # type: ignore[import-untyped]
2423

2524
from tests.conftest import HfRunner
2625
from tests.utils import check_embeddings_close, matryoshka_fy
27-
from vllm_ascend.utils import vllm_version_is
2826

2927

3028
def run_embedding_correctness_test(
@@ -51,8 +49,6 @@ def test_dummy():
5149
assert True
5250

5351

54-
@pytest.mark.skipif(vllm_version_is("0.9.1"),
55-
reason="vLLM 0.9.1 does not support embed task for v1")
5652
def test_embed_models_correctness(hf_runner, vllm_runner):
5753
queries = ['What is the capital of China?', 'Explain gravity']
5854

0 commit comments

Comments
 (0)