-
Notifications
You must be signed in to change notification settings - Fork 6.3k
[Serve] Detailed Analysis of Errors Related to 'Ray does not allocate any GPUs on the driver node' && 'No CUDA GPUs are available' #51242
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hi, so I did some digging and here is the explanation for your observation and ultimately some suggested workarounds: The reason that TP=1 breaks with the existing example code is that vllm will fall back to
This is not entirely true. In the serve llm apis, we are using ray distributed backend regardless of TP. This is the main reason why the serve api will work out of the box in both TP=1 and TP>1 context. I highly recommend switching over to that instead of using the existing vllm serve example (which we'll soon remove to reduce confusion of entry path now that we have serve llm apis). There are more subtle details for vllm-v1 which we will handle as part of our data and serve llm APIs so the ray + vllm deployments will work out of the box. |
Thanks for your correction! You’re absolutely right—the executor type is indeed a key consideration. However, if we have a Ray cluster with a CPU-only head node and additional worker nodes equipped with GPUs, the replica actor might be scheduled on the head node, leading to the same issue. That’s exactly when I first encountered the "No CUDA GPUs" error. Maybe we should add "accelerator" resource spec to the first bundle in PG to ensure the replica actor is scheduled to a GPU node. Additionally, I noticed that up until vLLM 0.7.3, the LLM engine enforces the creation of a uniproc executor when TP=1, even if the backend is explicitly set. This behavior is discussed in this PR. As for why I have to use Ray Serve—my GPU platform runs on a heavily modified version of Ray to support an IPv6 network within our Kubernetes cluster. Because of this, I’m constrained to using a lower version of ray provied by my platform in multi-node task 😂. I still want to find a workaround to make it run if it is possible! While the expriment result shown here is running in latest ray, with a single node. |
If you use STRICT_PACK (or even PACK) for the placement-group, both replica and all the vllm workers will be forced to be on the same locality which means the serve replica should not get scheduled on the head node if possible (docs). In case of PACK it will only do it on a different node when the resources on a single node are not satisfied.
Yes this was a bug that was breaking our solution as well. It should be fixed after that PR.
Understood why ray upgrade is not possible. Push for an upgrade :) We have a lot of awesome stuff cooking on both the data and serve llm stack which you might not want to recreate. Roadmap will get posted soon. |
Thanks for your clarification again, sorry for forgetting placement strategy:) I think ray llm is a great api and it always works well with vllm, the issue mentioned above does not effect it since ray llm directly call the constructor instead of Here is also a new some finding about ray serve. (I have to trace every function call to My envrionment somehow depends on Adding Here for code implementation # RayDistributedExecutor
@serve.deployment(
num_replicas=1,
max_ongoing_requests=128,
)
@serve.ingress(app)
class VLLMDeployment:
def __init__(
...
):
# in the begnning of init function
import os
os.environ['CUDA_VISIBLE_DEVICES'] = '0,1,2,3,4,5,6,7'
...
from vllm.executor.ray_distributed_executor import RayDistributedExecutor
engine_args.distributed_executor_backend = RayDistributedExecutor # "ray" is not enough for tp = 1
self.engine = AsyncLLMEngine.from_engine_args(engine_args)
...
def build_app(cli_args: Dict[str, str]) -> serve.Application:
...
pg_resources=[{"CPU": 1}] + [{"GPU": 1}] * tp
return VLLMDeployment.options(
placement_group_bundles=pg_resources,
placement_group_strategy="STRICT_PACK",
).bind(
... # MultiprocessingDistributedExecutor, always work
@serve.deployment(
num_replicas=1,
max_ongoing_requests=128,
)
@serve.ingress(app)
class VLLMDeployment:
def __init__(
...
):
from vllm.executor.mp_distributed_executor import MultiprocessingDistributedExecutor
engine_args.distributed_executor_backend = MultiprocessingDistributedExecutor
self.engine = AsyncLLMEngine.from_engine_args(engine_args)
...
def build_app(cli_args: Dict[str, str]) -> serve.Application:
...
pg_resources=[{"CPU": 1, "GPU": tp}]
return VLLMDeployment.options(
ray_actor_options={"num_gpus": tp,"num_cpus": 1,},
placement_group_bundles=pg_resources,
placement_group_strategy="STRICT_PACK",
).bind(
... |
@huiyeruzhou I think a better compromise instead of setting
In other words you bundle CPU and GPUs together. This way the replica would still claim a CPU from one of these bundles which come with a GPU. Therefore ray would not unset the CUDA_VISIBLE_DEVICES within the context of VLLMDeployment. So I suppose this should fix the caching behavior you are seeing with NO CUDA DEVICES found? The key idea is the relationship between actors to bundles is many-to-one. So this could work ^ |
Hello! I apologize for the delayed response.
This approach holds true in most common scenarios. However, vllm is a rather special case. It creates a dummy worker to handle resource management for the driver - worker. If the driver has already been allocated a GPU, the creation of this extra worker will lead to the code getting stuck. |
In production environment, head node in ray cluster was set 0 GPU, then the serve code which uses LLM api return error
|
@rainmaple what vllm version are you using ? I think the traceback that you pasted suggests it's vllm > 0.8? related issue (at least part 1 of that thread is exactly the issue you are talking about) This PR adds support for that version of vllm. It was merged yesterday. So you should be able to use nightly to see if you still have the issue. |
use the vllm==0.8.2 |
Does it mean that if vllm serve directly without ray serve in a product environment (ray cluster with head node has 0 GPU) may failed due to this design ? @huiyeruzhou |
Did not quite understand the question? Did you try vllm==0.7.2? If it doesn't work what is the error? |
Emm just to confirm, according the analysis mentioned above, for vllm without the ray serve if head node has gpu is necessary ?
…---Original---
From: "kourosh ***@***.***>
Date: Thu, Apr 3, 2025 14:54 PM
To: ***@***.***>;
Cc: ***@***.******@***.***>;
Subject: Re: [ray-project/ray] [Serve] Detailed Analysis of Errors Related to'Ray does not allocate any GPUs on the driver node' && 'No CUDA GPUs areavailable' (Issue #51242)
head 0GPU vllm serve model directly failed due to infer device type, so i turned to use the ray serve llm api, if env set USE_VLLM_V1": "0 can help? Or vllm==0.7.2 directly failed for TP=1 limits mentioned above. @kouroshHakha
Did not quite understand the question? Did you try vllm==0.7.2? If it doesn't work what is the error?
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you were mentioned.Message ID: ***@***.***>
kouroshHakha left a comment (ray-project/ray#51242)
head 0GPU vllm serve model directly failed due to infer device type, so i turned to use the ray serve llm api, if env set USE_VLLM_V1": "0 can help? Or vllm==0.7.2 directly failed for TP=1 limits mentioned above. @kouroshHakha
Did not quite understand the question? Did you try vllm==0.7.2? If it doesn't work what is the error?
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
Yes without ray serve, vllm assumes on the driver (which could be head node) the gpu type and cuda platform can be determined. With ray serve we orchestrate so that requirement is satisfied. |
- Your response is much appreciated.
…---Original---
From: "kourosh ***@***.***>
Date: Thu, Apr 3, 2025 23:31 PM
To: ***@***.***>;
Cc: ***@***.******@***.***>;
Subject: Re: [ray-project/ray] [Serve] Detailed Analysis of Errors Related to'Ray does not allocate any GPUs on the driver node' && 'No CUDA GPUs areavailable' (Issue #51242)
Emm just to confirm, according the analysis mentioned above, for vllm without the ray serve if head node has gpu is necessary ?
Yes without ray serve, vllm assumes on head node the gpu type and cuda platform can be determined. With ray serve we orchestrate so that requirement is satisfied.
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you were mentioned.Message ID: ***@***.***>
kouroshHakha left a comment (ray-project/ray#51242)
Emm just to confirm, according the analysis mentioned above, for vllm without the ray serve if head node has gpu is necessary ?
Yes without ray serve, vllm assumes on head node the gpu type and cuda platform can be determined. With ray serve we orchestrate so that requirement is satisfied.
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
@kouroshHakha I was also following the documentation quick start example to refactor our Ray[Serve] + vLLM implementation to Ray[Serve,LLM] and so far it has been challenging. After navigating my way through this issue I was able to complete the deployment without LLMRouter but to be able to use the openai compatible server I tried adding LLMRouter which failed the deployment with the following error -
I have combination of V100 and A100s and this exact cluster works fine with older vLLM implementation. I was wondering if there are additional configurations required to handle heterogeneous GPUs. Here are my complete configs and terminal output - # LLMConfig
llm_config_dict = {
"model_loading_config": {
"model_id": args.get("model", "meta-llama/Llama-3.1-8B-Instruct"),
"model_source": "meta-llama/Llama-3.1-8B-Instruct",
},
"engine_kwargs": {
# .. vLLM engine arguments
},
"deployment_config": {
"ray_actor_options": {},
"autoscaling_config": {
"min_replicas": 1,
"max_replicas": 1
}
},
"runtime_env": {
"pip": ["httpx", "ray[llm,serve]==2.44.1", "vllm==0.7.2"],
"env_vars": {
"USE_VLLM_V1": "0",
"HF_TOKEN": os.getenv("HF_TOKEN")
}
}
}
configs = LLMConfig(**llm_config_dict)
bundles=[
{"CPU": 1, "GPU": 1}
for _ in range(int(args["tensor_parallel_size"]))
]
deployment = LLMServer.as_deployment(
configs.get_serve_options(name_prefix="vLLM:"),
).options(
placement_group_bundles=bundles,
placement_group_strategy="SPREAD"
).bind(configs)
app = LLMRouter.as_deployment().bind(llm_deployments=[deployment])
return app
This has blocked us from moving forward. I will really appreciate any help with this issue. |
@nitingoyal0996 can you open a new issue on this? Also why are you setting up the placement group yourself? Let's continue this convo in the new issue. |
@kouroshHakha here is the new issue - |
What happened + What you expected to happen
When deploying platforms based on the Ray framework, such as Ray Serve and Ray LLM, together with vLLM's OpenAI server, the errors "No CUDA GPUs are available" or "Ray does not allocate any GPUs on the driver node" have become recurring issues.
In this issue, I will provide a detailed analysis of these problems, along with a brief solution, experimental records. I sincerely invite developers from the Ray and vLLM communities to participate in the discussion, point out any shortcomings, and share your suggestions!
Quick Troubleshoot
For older versions of vLLM, I have also provided a hack to temporarily resolve this issue. Please refer to: Ray Issue #51154.
For Ray LLM and Ray Serve documentation:
A proper configuration for TP=1 involves modifying the
build_app
function in the example code from the Ray Serve documentation by replacing the following content.Introduction
The issue can be summarized simply: the framework design of vLLM does not fully accommodate
LLMEngine
running within a placement group.The process that creates
RayDistributedExecutor
, which serves as the entry point, must have access to a GPU while not occupying GPU resources within Ray. This conflicts with the typical configuration of Ray Serve. Additionally, since vLLM always requests a whole number of GPUs whenworld_size > 1
, it is not possible to work around this limitation by allocating fractional GPUs.Regardless of whether using
LLM
(offline inference) orOpenAIServingCompletion
(online deployment), both are considered entry points. The class responsible for managing the specific processes during initialization is called anExecutor
. TheExecutor
itself creates a local actor to use the GPU and also spawns a dummy actor to reserve resources in the placement group.However, when integrating this framework into Ray, several issues arise:
Executor
itself also runs within an Actor and uses the first bundle of the placement group.CUDA_VISIBLE_DEVICES
will be an empty string, leading to the "No CUDA GPUs are available" error when trying to callset_device
.dummy_driver_worker
that occupies a GPU, which causes the total number of requested workers to exceed the placement group capacity.world_size > 1
, we cannot work around this limitation by assigning fractional GPUs.A Deadlock!
Experiments
Due to the specific feature of the code, there are actually two executable scenarios. I will first present the experimental table and then analyze each case one by one.
Analysis
In the existing code, there are actually two scenarios where execution is possible:
Case 1: Default Configuration (
TP > 1
& No GPU Assigned)Even if Ray does not allocate any GPUs to the Replica Actor (i.e., the
RayDistributedExecutor
within the Serve framework),CUDA_VISIBLE_DEVICES
will still not be empty.This happens because of this line of code, which calls
self.driver_worker
and modifies the environment variables of the current process.As a result, in the default configuration, the code functions correctly, allowing a process to access GPUs without directly occupying them.
Case 2:
TP = 1
Changes the BehaviorWhen TP = 1, vLLM switches to using
UniprocExecutor
, as seen in this line of code.In this case, if
CUDA_VISIBLE_DEVICES
is empty, it will cause an error, asUniprocExecutor
does not inherit the same environment variable handling as the multi-process setup.Supplementary Notes on Ray Serve and Ray LLM
After an initial review of the source code and conducting simple experiments, I believe that the new and old APIs of Ray Serve are fundamentally the same, except for the addition of a router and deeper integration with vLLM.
The core interaction between Ray and vLLM still revolves around the placement group (PG) allocation during deployment.
Therefore, these two approaches are essentially equivalent:
vllm.entrypoints.openai.serving_completion
into Ray Serve.ray[llm]
library for deployment.Related Issues
Based on my preliminary review, the following issues are all related to the analysis presented here:
Versions / Dependencies
vllm>=0.7.2
ray[serve,llm,default] -U
Reproduction script
Demo code in following
Issue Severity
High: It blocks me from completing my task.
The text was updated successfully, but these errors were encountered: