[Bug]: Multi-GPU AMD Setup Hangs without NCCL_P2P_DISABLE=1 and  --disable-custom-all-reduce

### Your current environment

<details>
<summary>The output of `python collect_env.py` plus rest of the environment if it might prove useful</summary>

```text
+ python collect_env.py
DEBUG 01-27 18:16:52 __init__.py:26] No plugins for group vllm.platform_plugins found.
INFO 01-27 18:16:52 __init__.py:183] Automatically detected platform rocm.
Collecting environment information...
PyTorch version: 2.6.0a0+git8d4926e
Is debug build: False
CUDA used to build PyTorch: N/A
ROCM used to build PyTorch: 6.3.42133-1b9c17779

OS: Ubuntu 22.04.5 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: 18.0.0git (https://github.com/RadeonOpenCompute/llvm-project roc-6.3.1 24491 1e0fda770a2079fbd71e4b70974d74f62fd3af10)
CMake version: version 3.31.4
Libc version: glibc-2.35

Python version: 3.12.8 (main, Dec  4 2024, 08:54:12) [GCC 11.4.0] (64-bit runtime)
Python platform: Linux-5.15.0-125-generic-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: AMD Instinct MI210 (gfx90a:sramecc+:xnack-)
Nvidia driver version: Could not collect
cuDNN version: Could not collect
HIP runtime version: 6.3.42133
MIOpen runtime version: 3.3.0
Is XNNPACK available: True

CPU:
Architecture:                         x86_64
CPU op-mode(s):                       32-bit, 64-bit
Address sizes:                        43 bits physical, 48 bits virtual
Byte Order:                           Little Endian
CPU(s):                               128
On-line CPU(s) list:                  0-127
Vendor ID:                            AuthenticAMD
Model name:                           AMD EPYC 7542 32-Core Processor
CPU family:                           23
Model:                                49
Thread(s) per core:                   2
Core(s) per socket:                   32
Socket(s):                            2
Stepping:                             0
Frequency boost:                      enabled
CPU max MHz:                          2900.0000
CPU min MHz:                          1500.0000
BogoMIPS:                             5799.35
Flags:                                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr rdpru wbnoinvd amd_ppin arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif v_spec_ctrl umip rdpid overflow_recov succor smca sme sev sev_es
Virtualization:                       AMD-V
L1d cache:                            2 MiB (64 instances)
L1i cache:                            2 MiB (64 instances)
L2 cache:                             32 MiB (64 instances)
L3 cache:                             256 MiB (16 instances)
NUMA node(s):                         2
NUMA node0 CPU(s):                    0-31,64-95
NUMA node1 CPU(s):                    32-63,96-127
Vulnerability Gather data sampling:   Not affected
Vulnerability Itlb multihit:          Not affected
Vulnerability L1tf:                   Not affected
Vulnerability Mds:                    Not affected
Vulnerability Meltdown:               Not affected
Vulnerability Mmio stale data:        Not affected
Vulnerability Reg file data sampling: Not affected
Vulnerability Retbleed:               Mitigation; untrained return thunk; SMT enabled with STIBP protection
Vulnerability Spec rstack overflow:   Mitigation; safe RET
Vulnerability Spec store bypass:      Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:             Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:             Mitigation; Retpolines; IBPB conditional; STIBP always-on; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected
Vulnerability Srbds:                  Not affected
Vulnerability Tsx async abort:        Not affected

Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] pyzmq==26.2.0
[pip3] torch==2.6.0a0+git8d4926e
[pip3] torchvision==0.19.1a0+6194369
[pip3] transformers==4.48.1
[pip3] triton==3.2.0+gite5be006a
[conda] Could not collect
ROCM Version: 6.3.42133-1b9c17779
Neuron SDK Version: N/A
vLLM Version: 0.6.4.post2.dev822+g16366ee8b
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
============================ ROCm System Management Interface ============================
================================ Weight between two GPUs =================================
       GPU0         GPU1         
GPU0   0            40           
GPU1   40           0            

================================= Hops between two GPUs ==================================
       GPU0         GPU1         
GPU0   0            2            
GPU1   2            0            

=============================== Link Type between two GPUs ===============================
       GPU0         GPU1         
GPU0   0            PCIE         
GPU1   PCIE         0            

======================================= Numa Nodes =======================================
GPU[0]		: (Topology) Numa Node: 0
GPU[0]		: (Topology) Numa Affinity: 0
GPU[1]		: (Topology) Numa Node: 0
GPU[1]		: (Topology) Numa Affinity: 0
================================== End of ROCm SMI Log ===================================

TORCH_USE_HIP_DSA=1
NCCL_P2P_DISABLE=1
NCCL_DEBUG=TRACE
VLLM_WORKER_MULTIPROC_METHOD=spawn
VLLM_TRACE_FUNCTION=1
PYTORCH_ROCM_ARCH=gfx90a;gfx942
LD_LIBRARY_PATH=/usr/local/lib/python3.12/dist-packages/cv2/../../lib64:/opt/rocm/lib:/usr/local/lib:
VLLM_LOGGING_LEVEL=DEBUG
VLLM_USE_TRITON_FLASH_ATTN=0
NCCL_CUMEM_ENABLE=0
TORCHINDUCTOR_COMPILE_THREADS=1
CUDA_MODULE_LOADING=LAZY

+ echo '=== Launching vLLM in background with real-time logs ==='
+ echo 'Running vLLM with model=Qwen/QwQ-32B-preview'
=== Launching vLLM in background with real-time logs ===
Running vLLM with model=Qwen/QwQ-32B-preview
+ vllm serve Qwen/QwQ-32B-preview --host=0.0.0.0 --port=8000 --max-model-len=2048 --enforce-eager --dtype=half --tensor-parallel-size 2 --disable-custom-all-reduce --trust-remote-code
DEBUG 01-27 18:17:13 __init__.py:26] No plugins for group vllm.platform_plugins found.
INFO 01-27 18:17:13 __init__.py:183] Automatically detected platform rocm.
INFO 01-27 18:17:14 api_server.py:768] vLLM API server version 0.6.4.post2.dev822+g16366ee8b
INFO 01-27 18:17:14 api_server.py:769] args: Namespace(subparser='serve', model_tag='Qwen/QwQ-32B-preview', config='', host='0.0.0.0', port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_request_id_headers=False, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='Qwen/QwQ-32B-preview', task='auto', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=True, allowed_local_media_path=None, download_dir=None, load_format='auto', config_format=<ConfigFormat.AUTO: 'auto'>, dtype='half', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=2048, guided_decoding_backend='xgrammar', logits_processor_pattern=None, distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=2, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=None, enable_prefix_caching=None, disable_sliding_window=False, use_v2_block_manager=True, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=None, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, hf_overrides=None, enforce_eager=True, max_seq_len_to_capture=8192, disable_custom_all_reduce=True, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=False, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_disable_mqa_scorer=False, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, scheduling_policy='fcfs', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=None, worker_cls='auto', generation_config=None, enable_sleep_mode=False, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False, dispatch_function=<function serve at 0x7f9ef9a439c0>)
DEBUG 01-27 18:17:14 __init__.py:26] No plugins for group vllm.general_plugins found.
DEBUG 01-27 18:17:14 api_server.py:176] Multiprocessing frontend to use ipc:///tmp/ddef63dd-382b-4290-a14d-1a4d58fb1945 for IPC Path.
INFO 01-27 18:17:14 api_server.py:195] Started engine process with PID 855
WARNING 01-27 18:17:16 config.py:2325] Casting torch.bfloat16 to torch.float16.
DEBUG 01-27 18:17:18 __init__.py:26] No plugins for group vllm.platform_plugins found.
INFO 01-27 18:17:19 __init__.py:183] Automatically detected platform rocm.
DEBUG 01-27 18:17:19 __init__.py:26] No plugins for group vllm.general_plugins found.
WARNING 01-27 18:17:22 config.py:2325] Casting torch.bfloat16 to torch.float16.
INFO 01-27 18:17:38 config.py:528] This model supports multiple tasks: {'score', 'embed', 'classify', 'reward', 'generate'}. Defaulting to 'generate'.
INFO 01-27 18:17:38 config.py:1335] Defaulting to use mp for distributed inference
INFO 01-27 18:17:38 config.py:1365] Disabled the custom all-reduce kernel because it is not supported on AMD GPUs.
WARNING 01-27 18:17:38 rocm.py:109] To see benefits of async output processing, enable CUDA graph. Since, enforce-eager is enabled, async output processor cannot be used
WARNING 01-27 18:17:38 config.py:664] Async output processing is not supported on the current platform type cuda.
INFO 01-27 18:17:42 config.py:528] This model supports multiple tasks: {'reward', 'score', 'embed', 'classify', 'generate'}. Defaulting to 'generate'.
INFO 01-27 18:17:42 config.py:1335] Defaulting to use mp for distributed inference
INFO 01-27 18:17:42 config.py:1365] Disabled the custom all-reduce kernel because it is not supported on AMD GPUs.
WARNING 01-27 18:17:42 rocm.py:109] To see benefits of async output processing, enable CUDA graph. Since, enforce-eager is enabled, async output processor cannot be used
WARNING 01-27 18:17:42 config.py:664] Async output processing is not supported on the current platform type cuda.
INFO 01-27 18:17:42 llm_engine.py:232] Initializing an LLM engine (v0.6.4.post2.dev822+g16366ee8b) with config: model='Qwen/QwQ-32B-preview', speculative_config=None, tokenizer='Qwen/QwQ-32B-preview', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=2, pipeline_parallel_size=1, disable_custom_all_reduce=True, quantization=None, enforce_eager=True, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=Qwen/QwQ-32B-preview, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=False, use_async_output_proc=False, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"candidate_compile_sizes":[],"compile_sizes":[],"capture_sizes":[],"max_capture_size":0}, use_cached_outputs=True, 
WARNING 01-27 18:17:43 multiproc_worker_utils.py:298] Reducing Torch parallelism from 128 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
INFO 01-27 18:17:43 custom_cache_manager.py:17] Setting Triton cache manager to: vllm.triton_utils.custom_cache_manager:CustomCacheManager
WARNING 01-27 18:17:43 logger.py:201] VLLM_TRACE_FUNCTION is enabled. It will record every function executed by Python. This will slow down the code. It is suggested to be used for debugging hang or crashes only.
INFO 01-27 18:17:43 logger.py:205] Trace frame log is saved to /tmp/root/vllm/vllm-instance-c1065/VLLM_TRACE_FUNCTION_for_process_855_thread_139818853159104_at_2025-01-27_18:17:43.312488.log
DEBUG 01-27 18:17:47 __init__.py:26] No plugins for group vllm.platform_plugins found.
INFO 01-27 18:17:47 __init__.py:183] Automatically detected platform rocm.
(VllmWorkerProcess pid=1318) INFO 01-27 18:17:48 multiproc_worker_utils.py:227] Worker ready; awaiting tasks
(VllmWorkerProcess pid=1318) WARNING 01-27 18:17:48 logger.py:201] VLLM_TRACE_FUNCTION is enabled. It will record every function executed by Python. This will slow down the code. It is suggested to be used for debugging hang or crashes only.
(VllmWorkerProcess pid=1318) INFO 01-27 18:17:48 logger.py:205] Trace frame log is saved to /tmp/root/vllm/vllm-instance-c1065/VLLM_TRACE_FUNCTION_for_process_1318_thread_140020751671488_at_2025-01-27_18:17:48.631413.log
(VllmWorkerProcess pid=1318) DEBUG 01-27 18:17:48 __init__.py:26] No plugins for group vllm.general_plugins found.
DEBUG 01-27 18:17:51 client.py:188] Waiting for output from MQLLMEngine.
DEBUG 01-27 18:18:01 client.py:188] Waiting for output from MQLLMEngine.
DEBUG 01-27 18:18:11 client.py:188] Waiting for output from MQLLMEngine.

```
Added full log for more context at runtime.
[full-log.txt](https://github.com/user-attachments/files/18562829/full-log.txt)

</details>


### Model Input Dumps

_No response_

### 🐛 Describe the bug

Hello ROCm Team,

I’ve been trying to run vLLM on multi-GPU cluster for the purpose of this bug I'll share a case with two AMD MI210 GPUs (ROCm environment). We have a Kubernetes cluster where each node provides two AMD GPUs, and everything works fine on a single GPU. However, as soon as I enable multi‐GPU (using --tensor-parallel-size 2), vLLM hangs and never fully starts the server unless I set:

```
export NCCL_P2P_DISABLE=1
```
With NCCL_P2P_DISABLE=1, the multi‐GPU server initializes successfully, but this obviously forces GPU‐to‐GPU traffic to go through host memory (slower than direct peer‐to‐peer).

Symptoms
- Single GPU: Works well with most tested models.
- Two GPUs (tensor parallel = 2): Service hangs at RCCL init stage. No further logs; never sees the “Started server process” message.
- Workaround: NCCL_P2P_DISABLE=1 unblocks the hang and I can run multi‐GPU inference, but throughput is presumably slower.

Test Matrix & Observations

I tested multiple environment variables (HIP_FORCE_P2P_HOST, HSA_ENABLE_SDMA, etc.) plus variations of flash-attention, with or without cuMask, etc. Some combos succeed on many 7B–20B models, but none allowed multi‐GPU to start without disabling P2P.
I also see kernel logs indicating that iommu=pt is not set and kernel.numa_balancing is 1. Currently, we can’t fix that on the node’s kernel. Possibly that’s the underlying cause of the hang. But the net effect is: multi‐GPU gets stuck unless P2P is turned off.

Below is a snippet of my job YAML (Kubernetes) that includes:

```
export HIP_VISIBLE_DEVICES=0,1
export ROCR_VISIBLE_DEVICES=0,1
export NCCL_P2P_DISABLE=1   # The only way to get multi-GPU working
vllm serve ...
  --tensor-parallel-size 2
  --disable-custom-all-reduce

```
And it starts up fine, whereas without NCCL_P2P_DISABLE=1 it just hangs forever at RCCL init.

Questions
1. Recommended Settings
- Are there official or recommended environment variables to ensure multi‐GPU AMD ROCm works with vLLM, especially if we cannot modify host kernel parameters like iommu=pt and kernel.numa_balancing=0?
- Is there a “one size fits all” environment variable set that you suggest for AMD GPUs to avoid P2P or specific issues?
- Is there an official or recommended setup? (including kernel changes)

2. Kernel‐Level Requirements
- Do you suggest we absolutely must set iommu=pt and disable NUMA auto‐balancing on the host for stable multi‐GPU? Or are there known overrides in RCCL/vLLM for multi‐GPU ROCm that can achieve stable operation without those kernel changes?

3. Performance vs. Stability
- Is there any guidance on which env vars or vLLM flags best preserve performance if we can’t set iommu=pt? We’d prefer not to rely on NCCL_P2P_DISABLE=1 if there’s a less costly fallback.
- Thanks in advance for any advice. The single‐GPU experience is great; multi‐GPU is the only stumbling block for us right now. If you have any insights or updated best‐practice docs, please let me know!

Logs & Additional Data
- We have run a fairly broad matrix of environment variables:
HIP_FORCE_P2P_HOST, HSA_ENABLE_SDMA, ROCM_DISABLE_CU_MASK, TORCH_USE_HIP_DSA, VLLM_USE_TRITON_FLASH_ATTN, etc.
- In all cases, multi‐GPU only works if we fully disable GPU P2P.
- Single GPU usage is rock‐solid with all tested models.

Let us know if you have recommended kernel settings or any user‐space environment variables that typically solve this in AMD ROCm + vLLM. 
Thank you!


### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug]: Multi-GPU AMD Setup Hangs without NCCL_P2P_DISABLE=1 and --disable-custom-all-reduce #390

Your current environment

Model Input Dumps

🐛 Describe the bug

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug]: Multi-GPU AMD Setup Hangs without NCCL_P2P_DISABLE=1 and --disable-custom-all-reduce #390

Description

Your current environment

Model Input Dumps

🐛 Describe the bug

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions