Skip to content

[aclgraph] implentment NPUPiecewiseBackend to enable aclgraph #836

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

MengqingCao
Copy link
Collaborator

What this PR does / why we need it?

Implentment NPUPiecewiseBackend to enable aclgraph

How was this patch tested?

test locally because aclgraph could not enable by default currently.

(atb) xxx@xxx-docker:~/code/vllm-ascend$ VLLM_USE_V1=1 python examples/offline_inference_npu.py 
INFO 05-13 12:53:12 [importing.py:16] Triton not installed or not compatible; certain GPU-related functions will not be available.
WARNING 05-13 12:53:12 [importing.py:28] Triton is not installed. Using dummy decorators. Install it via `pip install triton` to enable kernel compilation.
INFO 05-13 12:53:15 [__init__.py:30] Available plugins for group vllm.platform_plugins:
INFO 05-13 12:53:15 [__init__.py:32] name=ascend, value=vllm_ascend:register
INFO 05-13 12:53:15 [__init__.py:32] name=ascend, value=vllm_ascend:register
INFO 05-13 12:53:15 [__init__.py:34] all available plugins for group vllm.platform_plugins will be loaded.
INFO 05-13 12:53:15 [__init__.py:36] set environment variable VLLM_PLUGINS to control which plugins to load.
INFO 05-13 12:53:15 [__init__.py:44] plugin ascend loaded.
INFO 05-13 12:53:15 [__init__.py:44] plugin ascend loaded.
INFO 05-13 12:53:15 [__init__.py:239] Platform plugin ascend is activated
WARNING 05-13 12:53:18 [_custom_ops.py:21] Failed to import from vllm._C with ImportError('libnuma.so.1: cannot open shared object file: No such file or directory')
INFO 05-13 12:53:21 [__init__.py:30] Available plugins for group vllm.general_plugins:
INFO 05-13 12:53:21 [__init__.py:32] name=ascend_enhanced_model, value=vllm_ascend:register_model
INFO 05-13 12:53:21 [__init__.py:32] name=ascend_enhanced_model, value=vllm_ascend:register_model
INFO 05-13 12:53:21 [__init__.py:34] all available plugins for group vllm.general_plugins will be loaded.
INFO 05-13 12:53:21 [__init__.py:36] set environment variable VLLM_PLUGINS to control which plugins to load.
INFO 05-13 12:53:21 [__init__.py:44] plugin ascend_enhanced_model loaded.
INFO 05-13 12:53:21 [__init__.py:44] plugin ascend_enhanced_model loaded.
WARNING 05-13 12:53:21 [registry.py:393] Model architecture DeepSeekMTPModel is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_mtp:CustomDeepSeekMTP.
WARNING 05-13 12:53:21 [registry.py:393] Model architecture Qwen2VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_vl:AscendQwen2VLForConditionalGeneration.
WARNING 05-13 12:53:21 [registry.py:393] Model architecture Qwen2_5_VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_5_vl:AscendQwen2_5_VLForConditionalGeneration.
WARNING 05-13 12:53:21 [registry.py:393] Model architecture DeepseekV2ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_v2:CustomDeepseekV2ForCausalLM.
WARNING 05-13 12:53:21 [registry.py:393] Model architecture DeepseekV3ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_v2:CustomDeepseekV3ForCausalLM.
INFO 05-13 12:53:41 [config.py:761] This model supports multiple tasks: {'score', 'reward', 'embed', 'generate', 'classify'}. Defaulting to 'generate'.
WARNING 05-13 12:53:41 [arg_utils.py:1546] Detected VLLM_USE_V1=1 with npu. Usage should be considered experimental. Please report any issues on Github.
INFO 05-13 12:53:41 [config.py:1859] Disabled the custom all-reduce kernel because it is not supported on current platform.
INFO 05-13 12:53:41 [config.py:2068] Chunked prefill is enabled with max_num_batched_tokens=8192.
INFO 05-13 12:53:41 [platform.py:142] PIECEWISE compilation enabled on NPU. use_inductor not supported - using only ACL Graph mode
INFO 05-13 12:53:41 [utils.py:140] Calculated maximum supported batch sizes for ACL graph: 76
INFO 05-13 12:53:41 [utils.py:166] No adjustment needed for ACL graph batch sizes: Qwen2ForCausalLM model (layers: 24) with 67 sizes
INFO 05-13 12:53:43 [core.py:61] Initializing a V1 LLM engine (v0.8.5.dev545+g376786fac.d20250509) with config: model='Qwen/Qwen2.5-0.5B-Instruct', speculative_config=None, tokenizer='Qwen/Qwen2.5-0.5B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=True, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=npu, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=None, served_model_name=Qwen/Qwen2.5-0.5B-Instruct, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, pooler_config=None, compilation_config={"level": 3, "custom_ops": ["all"], "splitting_ops": ["vllm.unified_attention", "vllm.unified_attention_with_output", "vllm.unified_ascend_attention_with_output"], "use_inductor": false, "compile_sizes": [], "use_cudagraph": true, "cudagraph_num_of_warmups": 1, "cudagraph_capture_sizes": [512, 504, 496, 488, 480, 472, 464, 456, 448, 440, 432, 424, 416, 408, 400, 392, 384, 376, 368, 360, 352, 344, 336, 328, 320, 312, 304, 296, 288, 280, 272, 264, 256, 248, 240, 232, 224, 216, 208, 200, 192, 184, 176, 168, 160, 152, 144, 136, 128, 120, 112, 104, 96, 88, 80, 72, 64, 56, 48, 40, 32, 24, 16, 8, 4, 2, 1], "max_capture_size": 512}
WARNING 05-13 12:53:44 [utils.py:2595] Methods add_lora,cache_config,determine_available_memory,determine_num_available_blocks,device_config,get_cache_block_size_bytes,list_loras,load_config,pin_lora,remove_lora,scheduler_config not implemented in <vllm_ascend.worker.worker_v1.NPUWorker object at 0xfffcf81373d0>
INFO 05-13 12:53:53 [parallel_state.py:1004] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
INFO 05-13 12:53:54 [model_runner_v1.py:936] Starting to load model Qwen/Qwen2.5-0.5B-Instruct...
INFO 05-13 12:53:56 [backends.py:41] Using EagerAdaptor
INFO 05-13 12:53:58 [weight_utils.py:257] Using model weights format ['*.safetensors']
INFO 05-13 12:53:58 [weight_utils.py:307] No model.safetensors.index.json found in remote.
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.00it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.99it/s]

INFO 05-13 12:53:59 [default_loader.py:278] Loading weights took 0.28 seconds
INFO 05-13 12:53:59 [model_runner_v1.py:942] Loading model weights took 0.9281 GB
INFO 05-13 12:54:05 [backends.py:461] Using cache directory: /home/xxx/.cache/vllm/torch_compile_cache/08b3ae930b/rank_0_0 for vLLM's torch.compile
INFO 05-13 12:54:05 [backends.py:471] Dynamo bytecode transform time: 5.42 s
INFO 05-13 12:54:07 [backends.py:173] Compiling a graph for general shape takes 1.44 s
INFO 05-13 12:54:14 [monitor.py:33] torch.compile takes 6.86 s in total
INFO 05-13 12:54:15 [worker_v1.py:165] Available memory: 57321098444.8, total memory: 65464696832
INFO 05-13 12:54:15 [kv_cache_utils.py:639] GPU KV cache size: 4,664,704 tokens
INFO 05-13 12:54:15 [kv_cache_utils.py:642] Maximum concurrency for 32,768 tokens per request: 142.36x
INFO 05-13 12:55:01 [model_runner_v1.py:1097] Graph capturing finished in 46 secs, took 0.14 GiB
INFO 05-13 12:55:01 [core.py:163] init engine (profile, create kv cache, warmup model) took 61.73 seconds
INFO 05-13 12:55:01 [core_client.py:442] Core engine process 0 ready.
Adding requests: 100%|███████████████████████████████████████████████████| 4/4 [00:00<00:00, 193.29it/s]
Processed prompts: 100%|█| 4/4 [00:01<00:00,  2.74it/s, est. speed input: 15.08 toks/s, output: 274.14 t
Prompt: 'Hello, my name is', Generated text: ' Alex and I am a 17 year old male. I have been diagnosed with a rare genetic disorder called X-linked recessive. I have been told that I will not be able to have children. I have been told that I will not be able to have children because of the gene that I have. I have been told that I will not be able to have children because of the gene that I have. I have been told that I will not be able to have children because of the gene'
Prompt: 'The president of the United States is', Generated text: ' a very important person. He is the leader of the country. He is the leader of the country. He is the leader of the country. He is the leader of the country. He is the leader of the country. He is the leader of the country. He is the leader of the country. He is the leader of the country. He is the leader of the country. He is the leader of the country. He is the leader of the country. He is the leader of the country'
Prompt: 'The capital of France is', Generated text: ' Paris. It is the largest city in Europe and the second largest city in the world. It is located in the south of France, on the banks of the Seine River. It is situated on the Île de la Cité, which is a small island in the center of the city. The city is surrounded by the Seine River, which flows through the city. The city is also surrounded by the Pyrenees mountains, which are located to the north of the city. The city'
Prompt: 'The future of AI is', Generated text: ' in the hands of the people. The future of AI is in the hands of the people. The future of AI is in the hands of the people. The future of AI is in the hands of the people. The future of AI is in the hands of the people. The future of AI is in the hands of the people. The future of AI is in the hands of the people. The future of AI is in the hands of the people. The future of AI is in the hands of'

Signed-off-by: MengqingCao <cmq0113@163.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant