Skip to content

[Serve] Detailed Analysis of Errors Related to 'Ray does not allocate any GPUs on the driver node' && 'No CUDA GPUs are available' #51242

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
huiyeruzhou opened this issue Mar 11, 2025 · 18 comments
Assignees
Labels
bug Something that is supposed to be working; but isn't llm serve Ray Serve Related Issue

Comments

@huiyeruzhou
Copy link

huiyeruzhou commented Mar 11, 2025

What happened + What you expected to happen

When deploying platforms based on the Ray framework, such as Ray Serve and Ray LLM, together with vLLM's OpenAI server, the errors "No CUDA GPUs are available" or "Ray does not allocate any GPUs on the driver node" have become recurring issues.
In this issue, I will provide a detailed analysis of these problems, along with a brief solution, experimental records. I sincerely invite developers from the Ray and vLLM communities to participate in the discussion, point out any shortcomings, and share your suggestions!

Quick Troubleshoot

Image

For older versions of vLLM, I have also provided a hack to temporarily resolve this issue. Please refer to: Ray Issue #51154.

For Ray LLM and Ray Serve documentation:

A proper configuration for TP=1 involves modifying the build_app function in the example code from the Ray Serve documentation by replacing the following content.

    pg_resources = []
-    pg_resources.append({"CPU": 1})  # for the deployment replica
    for i in range(tp):
        pg_resources.append({"CPU": 1, accelerator: 1})  # for the vLLM actors

    # We use the "STRICT_PACK" strategy below to ensure all vLLM actors are placed on
    # the same Ray node.
    return VLLMDeployment.options(
+       ray_actor_options={"num_gpus": 1,"num_cpus": 1},
        placement_group_bundles=pg_resources, placement_group_strategy="STRICT_PACK"
    ).bind(

Introduction

The issue can be summarized simply: the framework design of vLLM does not fully accommodate LLMEngine running within a placement group.

The process that creates RayDistributedExecutor, which serves as the entry point, must have access to a GPU while not occupying GPU resources within Ray. This conflicts with the typical configuration of Ray Serve. Additionally, since vLLM always requests a whole number of GPUs when world_size > 1, it is not possible to work around this limitation by allocating fractional GPUs.

Image

Regardless of whether using LLM (offline inference) or OpenAIServingCompletion (online deployment), both are considered entry points. The class responsible for managing the specific processes during initialization is called an Executor. The Executor itself creates a local actor to use the GPU and also spawns a dummy actor to reserve resources in the placement group.

Image

However, when integrating this framework into Ray, several issues arise:

  • In Ray, the Executor itself also runs within an Actor and uses the first bundle of the placement group.
    • If no GPU resources are assigned to it, CUDA_VISIBLE_DEVICES will be an empty string, leading to the "No CUDA GPUs are available" error when trying to call set_device.
    • On the other hand, if we do allocate a GPU to it, vLLM will still use a dummy_driver_worker that occupies a GPU, which causes the total number of requested workers to exceed the placement group capacity.
    • Since vLLM does not allocate resources based on bundles but instead forces each worker to use exactly one GPU when world_size > 1, we cannot work around this limitation by assigning fractional GPUs.

A Deadlock!


Experiments

Due to the specific feature of the code, there are actually two executable scenarios. I will first present the experimental table and then analyze each case one by one.

VLLM Version Placement Group Configuration TP Status Notes
VLLM 0.7.3 [{'CPU':1} + {'GPU':1} * TP] >1 ✅ Works Replica actor has no GPU but gains access via update_environment_variables
VLLM 0.7.3 [{'GPU':1} * TP] >1 ❌ Fails Extra worker creation causes deadlock due to loop in ray_distributed_executor.py#L187
VLLM 0.7.3 [{'CPU':1} + {'GPU':1} * TP] 1 ❌ Fails Replica actor has no GPU, and Executor can no longer "borrow" CUDA_VISIBLE_DEVICES
VLLM 0.7.3 [{'GPU':1} * TP] 1 ✅ Works Replica actor has no GPU, but uniproc_executor avoids dummy worker creation

Analysis

In the existing code, there are actually two scenarios where execution is possible:

  1. TP > 1 without explicitly assigning GPUs (this is the default setting in Ray Serve). This explains why the issue has not become a critical blocker—under the current configuration, execution is still possible.
  2. TP = 1 with GPU assignment (as mentioned earlier, using an appropriate configuration combined with Ray Serve to resolve the issue).

Case 1: Default Configuration (TP > 1 & No GPU Assigned)

Image

Even if Ray does not allocate any GPUs to the Replica Actor (i.e., the RayDistributedExecutor within the Serve framework), CUDA_VISIBLE_DEVICES will still not be empty.

This happens because of this line of code, which calls self.driver_worker and modifies the environment variables of the current process.

As a result, in the default configuration, the code functions correctly, allowing a process to access GPUs without directly occupying them.

Case 2: TP = 1 Changes the Behavior

When TP = 1, vLLM switches to using UniprocExecutor, as seen in this line of code.

In this case, if CUDA_VISIBLE_DEVICES is empty, it will cause an error, as UniprocExecutor does not inherit the same environment variable handling as the multi-process setup.

Image


Supplementary Notes on Ray Serve and Ray LLM

After an initial review of the source code and conducting simple experiments, I believe that the new and old APIs of Ray Serve are fundamentally the same, except for the addition of a router and deeper integration with vLLM.

The core interaction between Ray and vLLM still revolves around the placement group (PG) allocation during deployment.

Therefore, these two approaches are essentially equivalent:

  1. Manually integrating vllm.entrypoints.openai.serving_completion into Ray Serve.
  2. Using the ray[llm] library for deployment.

Related Issues

Based on my preliminary review, the following issues are all related to the analysis presented here:

Versions / Dependencies

vllm>=0.7.2
ray[serve,llm,default] -U

Reproduction script

Demo code in following

Issue Severity

High: It blocks me from completing my task.

@kouroshHakha
Copy link
Contributor

Hi, so I did some digging and here is the explanation for your observation and ultimately some suggested workarounds:

The reason that TP=1 breaks with the existing example code is that vllm will fall back to UniProcExecutor when TP=1. This executor will need the resources to be provisioned at the serve replica, which you already noted. To fix this using the same pattern as TP > 1, you can force vllm to use ray distributed backend by passing engine_args.distributed_executor_backend = "ray".

After an initial review of the source code and conducting simple experiments, I believe that the new and old APIs of Ray Serve are fundamentally the same, except for the addition of a router and deeper integration with vLLM.
The core interaction between Ray and vLLM still revolves around the placement group (PG) allocation during deployment. Therefore, these two approaches are essentially equivalent

This is not entirely true. In the serve llm apis, we are using ray distributed backend regardless of TP. This is the main reason why the serve api will work out of the box in both TP=1 and TP>1 context. I highly recommend switching over to that instead of using the existing vllm serve example (which we'll soon remove to reduce confusion of entry path now that we have serve llm apis). There are more subtle details for vllm-v1 which we will handle as part of our data and serve llm APIs so the ray + vllm deployments will work out of the box.

@kouroshHakha kouroshHakha self-assigned this Mar 12, 2025
@kouroshHakha kouroshHakha added llm and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Mar 12, 2025
@huiyeruzhou
Copy link
Author

huiyeruzhou commented Mar 12, 2025

Thanks for your correction! You’re absolutely right—the executor type is indeed a key consideration. However, if we have a Ray cluster with a CPU-only head node and additional worker nodes equipped with GPUs, the replica actor might be scheduled on the head node, leading to the same issue. That’s exactly when I first encountered the "No CUDA GPUs" error.

Maybe we should add "accelerator" resource spec to the first bundle in PG to ensure the replica actor is scheduled to a GPU node.

Additionally, I noticed that up until vLLM 0.7.3, the LLM engine enforces the creation of a uniproc executor when TP=1, even if the backend is explicitly set. This behavior is discussed in this PR.

As for why I have to use Ray Serve—my GPU platform runs on a heavily modified version of Ray to support an IPv6 network within our Kubernetes cluster. Because of this, I’m constrained to using a lower version of ray provied by my platform in multi-node task 😂. I still want to find a workaround to make it run if it is possible!

While the expriment result shown here is running in latest ray, with a single node.

@kouroshHakha
Copy link
Contributor

kouroshHakha commented Mar 12, 2025

Thanks for your correction! You’re absolutely right—the executor type is indeed a key consideration. However, if we have a Ray cluster with a CPU-only head node and additional worker nodes equipped with GPUs, the replica actor might be scheduled on the head node, leading to the same issue. That’s exactly when I first encountered the "No CUDA GPUs" error.

If you use STRICT_PACK (or even PACK) for the placement-group, both replica and all the vllm workers will be forced to be on the same locality which means the serve replica should not get scheduled on the head node if possible (docs). In case of PACK it will only do it on a different node when the resources on a single node are not satisfied.

Additionally, I noticed that up until vLLM 0.7.3, the LLM engine enforces the creation of a uniproc executor when TP=1, even if the backend is explicitly set. This behavior is discussed in vllm-project/vllm#12934.

Yes this was a bug that was breaking our solution as well. It should be fixed after that PR.

As for why I have to use Ray Serve—my GPU platform runs on a heavily modified version of Ray to support an IPv6 network within our Kubernetes cluster. Because of this, I’m constrained to using a lower version of ray provied by my platform in multi-node task 😂. I still want to find a workaround to make it run if it is possible!

Understood why ray upgrade is not possible. Push for an upgrade :) We have a lot of awesome stuff cooking on both the data and serve llm stack which you might not want to recreate. Roadmap will get posted soon.

@huiyeruzhou
Copy link
Author

huiyeruzhou commented Mar 14, 2025

Thanks for your clarification again, sorry for forgetting placement strategy:) I think ray llm is a great api and it always works well with vllm, the issue mentioned above does not effect it since ray llm directly call the constructor instead of from_engine_args

Here is also a new some finding about ray serve. (I have to trace every function call to troch.cuda to locate it....)
The current RayDistributedExecutor relis on resetting CUDA_VISIBLE_DEVICES, which means torch._C._cuda_getDeviceCount() should not be called before the engine is built, since the function always returns the same value based as the first call. This is because of the static vaiable used in Cpp source code.

My envrionment somehow depends on tensordict library, which will call torch.cuda.is_available() when it is imported, and therefore the reset of CUDA_VISIBLE_DEVICES will be ignored : (
torch.cuda.is_available(), torch.cuda.device_count() are 2 functions that will freeze the effect of CUDA_VISIBLE_DEVICES, there maybe more function in _C that has this side effect.

Adding os.environ['CUDA_VISIBLE_DEVICES'] = '0,1,2,3,4,5,6,7' in the __init__ of VLLMDeployment class can make it work well with RayDistributedExecutor in almost any environment. It may have some side effects, but the mechanism is clear now.
Another way is to specify {‘GPU': tp} and use MultiprocessingDistributedExecutor instead, which means ray can free from holding GPU for vllm's executor.
I think this two ways can be a final solution of ray serve for VLLM > 0.7.0.
For VLLM <= 0.7.2, we must specify the class, "ray" will be ignored when TP =1

Here for code implementation

# RayDistributedExecutor
@serve.deployment(
    num_replicas=1,
    max_ongoing_requests=128,
)
@serve.ingress(app)
class VLLMDeployment:
    def __init__(
...
):
	# in the begnning of init function
	import os
	os.environ['CUDA_VISIBLE_DEVICES'] = '0,1,2,3,4,5,6,7'
	...
	from vllm.executor.ray_distributed_executor import RayDistributedExecutor
	engine_args.distributed_executor_backend = RayDistributedExecutor # "ray" is not enough for tp = 1
	self.engine = AsyncLLMEngine.from_engine_args(engine_args)

...
def build_app(cli_args: Dict[str, str]) -> serve.Application:
	...
	pg_resources=[{"CPU": 1}] + [{"GPU": 1}] * tp
	return VLLMDeployment.options(
	    placement_group_bundles=pg_resources,
	    placement_group_strategy="STRICT_PACK",
	).bind(
	...
# MultiprocessingDistributedExecutor, always work
@serve.deployment(
    num_replicas=1,
    max_ongoing_requests=128,
)
@serve.ingress(app)
class VLLMDeployment:
    def __init__(
...
):
	from vllm.executor.mp_distributed_executor import MultiprocessingDistributedExecutor
	engine_args.distributed_executor_backend = MultiprocessingDistributedExecutor
	self.engine = AsyncLLMEngine.from_engine_args(engine_args)
	...
def build_app(cli_args: Dict[str, str]) -> serve.Application:
	...
	pg_resources=[{"CPU": 1, "GPU": tp}]
	return VLLMDeployment.options(
	    ray_actor_options={"num_gpus": tp,"num_cpus": 1,},
	    placement_group_bundles=pg_resources,
	    placement_group_strategy="STRICT_PACK",
	).bind(
	...

@kouroshHakha
Copy link
Contributor

kouroshHakha commented Mar 14, 2025

@huiyeruzhou I think a better compromise instead of setting os.environ['CUDA_VISIBLE_DEVICES'] = '0,1,2,3,4,5,6,7' in VLLMDeployment would be to use

pg_resources=[{"CPU": 1, "GPU": 1} for _ in range(tp)]

In other words you bundle CPU and GPUs together. This way the replica would still claim a CPU from one of these bundles which come with a GPU. Therefore ray would not unset the CUDA_VISIBLE_DEVICES within the context of VLLMDeployment. So I suppose this should fix the caching behavior you are seeing with NO CUDA DEVICES found?

The key idea is the relationship between actors to bundles is many-to-one. So this could work ^

@huiyeruzhou
Copy link
Author

Hello! I apologize for the delayed response.

pg_resources = [{"CPU": 1, "GPU": 1} for _ in range(tp)]

The central idea here is that there's a many - to - one relationship between actors and bundles. Thus, this code should work as expected. ^

This approach holds true in most common scenarios. However, vllm is a rather special case. It creates a dummy worker to handle resource management for the driver - worker. If the driver has already been allocated a GPU, the creation of this extra worker will lead to the code getting stuck.

@rainmaple
Copy link

In production environment, head node in ray cluster was set 0 GPU, then the serve code which uses LLM api return error

INFO 04-02 17:25:34 [config.py:1697] Chunked prefill is enabled with max_num_batched_tokens=2048.
(pid=2159, ip=192.168.193.78) INFO 04-02 17:25:27 [__init__.py:239] Automatically detected platform cuda. [repeated 2x across cluster]
(ServeController pid=2922) ERROR 2025-04-02 17:25:34,553 controller 2922 -- Exception in Replica(id='m37qpvvp', deployment='vLLM:qwen-1_5b', app='default'), the replica will be stopped.
(ServeController pid=2922) Traceback (most recent call last):
(ServeController pid=2922)   File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/serve/_private/deployment_state.py", line 695, in check_ready
(ServeController pid=2922)     ) = ray.get(self._ready_obj_ref)
(ServeController pid=2922)         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(ServeController pid=2922)   File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper
(ServeController pid=2922)     return fn(*args, **kwargs)
(ServeController pid=2922)            ^^^^^^^^^^^^^^^^^^^
(ServeController pid=2922)   File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
(ServeController pid=2922)     return func(*args, **kwargs)
(ServeController pid=2922)            ^^^^^^^^^^^^^^^^^^^^^
(ServeController pid=2922)   File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/_private/worker.py", line 2782, in get
(ServeController pid=2922)     values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
(ServeController pid=2922)                                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(ServeController pid=2922)   File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/_private/worker.py", line 929, in get_objects
(ServeController pid=2922)     raise value.as_instanceof_cause()
(ServeController pid=2922) ray.exceptions.RayTaskError(RuntimeError): ray::ServeReplica:default:vLLM:qwen-1_5b.initialize_and_get_metadata() (pid=2112, ip=192.168.193.78, actor_id=9fc066631e58962eb4f0662a02000000, repr=<ray.serve._private.replica.ServeReplica:default:vLLM:qwen-1_5b object at 0x7f03469c65d0>)
(ServeController pid=2922)   File "/home/ray/anaconda3/lib/python3.11/concurrent/futures/_base.py", line 449, in result
(ServeController pid=2922)     return self.__get_result()
(ServeController pid=2922)            ^^^^^^^^^^^^^^^^^^^
(ServeController pid=2922)   File "/home/ray/anaconda3/lib/python3.11/concurrent/futures/_base.py", line 401, in __get_result
(ServeController pid=2922)     raise self._exception
(ServeController pid=2922)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(ServeController pid=2922)   File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/serve/_private/replica.py", line 967, in initialize_and_get_metadata
(ServeController pid=2922)     await self._replica_impl.initialize(deployment_config)
(ServeController pid=2922)   File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/serve/_private/replica.py", line 696, in initialize
(ServeController pid=2922)     raise RuntimeError(traceback.format_exc()) from None
(ServeController pid=2922) RuntimeError: Traceback (most recent call last):
(ServeController pid=2922)   File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/serve/_private/replica.py", line 673, in initialize
(ServeController pid=2922)     self._user_callable_asgi_app = await asyncio.wrap_future(
(ServeController pid=2922)                                    ^^^^^^^^^^^^^^^^^^^^^^^^^^
(ServeController pid=2922)   File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/serve/_private/replica.py", line 1365, in initialize_callable
(ServeController pid=2922)     await self._call_func_or_gen(
(ServeController pid=2922)   File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/serve/_private/replica.py", line 1328, in _call_func_or_gen
(ServeController pid=2922)     result = await result
(ServeController pid=2922)              ^^^^^^^^^^^^
(ServeController pid=2922)   File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/llm/_internal/serve/deployments/llm/llm_server.py", line 447, in __init__
(ServeController pid=2922)     await asyncio.wait_for(self._start_engine(), timeout=ENGINE_START_TIMEOUT_S)
(ServeController pid=2922)   File "/home/ray/anaconda3/lib/python3.11/asyncio/tasks.py", line 489, in wait_for
(ServeController pid=2922)     return fut.result()
(ServeController pid=2922)            ^^^^^^^^^^^^
(ServeController pid=2922)   File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/llm/_internal/serve/deployments/llm/llm_server.py", line 492, in _start_engine
(ServeController pid=2922)     await self.engine.start()
(ServeController pid=2922)   File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/llm/_internal/serve/deployments/llm/vllm/vllm_engine.py", line 297, in start
(ServeController pid=2922)     self.engine = await self._start_engine()
(ServeController pid=2922)                   ^^^^^^^^^^^^^^^^^^^^^^^^^^
(ServeController pid=2922)   File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/llm/_internal/serve/deployments/llm/vllm/vllm_engine.py", line 309, in _start_engine
(ServeController pid=2922)     if MQLLMEngineClient.is_unsupported_config(engine_args):
(ServeController pid=2922)        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(ServeController pid=2922)   File "/home/ray/anaconda3/lib/python3.11/site-packages/vllm/engine/multiprocessing/client.py", line 139, in is_unsupported_config
(ServeController pid=2922)     return vllm_config.parallel_config.pipeline_parallel_size > 1
(ServeController pid=2922)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(ServeController pid=2922) AttributeError: 'AsyncEngineArgs' object has no attribute 'parallel_config'
(ServeController pid=2922) ERROR 2025-04-02 17:25:34,556 controller 2922 -- Failed to update the deployments ['vLLM:qwen-1_5b'].
(download_model_files pid=2159, ip=192.168.193.78) INFO 2025-04-02 17:25:28,269 serve 2159 -- No cloud storage mirror configured
(ServeController pid=2922) INFO 2025-04-02 17:25:34,659 controller 2922 -- Replica(id='m37qpvvp', deployment='vLLM:qwen-1_5b', app='default') is stopped.
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/serve/scripts.py", line 543, in run
    serve.run(app, blocking=should_block, name=name, route_prefix=route_prefix)
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/serve/api.py", line 625, in run
    handle = _run(
             ^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/serve/api.py", line 535, in _run
    return _run_many(
           ^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/serve/api.py", line 513, in _run_many
    return client.deploy_applications(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/serve/_private/client.py", line 52, in check
    return f(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/serve/_private/client.py", line 311, in deploy_applications
    self._wait_for_application_running(app.name)
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/serve/_private/client.py", line 235, in _wait_for_application_running
    raise RuntimeError(
RuntimeError: Deploying application default failed: Failed to update the deployments ['vLLM:qwen-1_5b'].
2025-04-02 17:25:35,378 ERR scripts.py:586 -- Received unexpected error, see console logs for more details. Shutting down...
from ray import serve
from ray.serve.llm import LLMConfig, LLMServer, LLMRouter

llm_config = LLMConfig(
    model_loading_config=dict(
        model_id="qwen-1.5b",
        model_source="/oss-data/models/DeepSeek-R1-Distill-Qwen-1.5B",
    ),
    deployment_config=dict(
        autoscaling_config=dict(
            min_replicas=1, max_replicas=2,
        )
    ),
    # Pass the desired accelerator type (e.g. A10G, L4, etc.)
    #accelerator_type="T4",
    # You can customize the engine arguments (e.g. vLLM engine kwargs)
    engine_kwargs=dict(
        tensor_parallel_size=1,
        pipeline_parallel_size=2
    ),
)

# Deploy the application
deployment = LLMServer.as_deployment(llm_config.get_serve_options(name_prefix="vLLM:")).bind(llm_config)
llm_app = LLMRouter.as_deployment().bind([deployment])

@rainmaple
Copy link

@kouroshHakha

@kouroshHakha
Copy link
Contributor

@rainmaple what vllm version are you using ? I think the traceback that you pasted suggests it's vllm > 0.8?

related issue (at least part 1 of that thread is exactly the issue you are talking about)
https://discuss.ray.io/t/ray-serve-llm-example-in-document-cannot-work/22199/3

This PR adds support for that version of vllm. It was merged yesterday. So you should be able to use nightly to see if you still have the issue.
#51726

@rainmaple
Copy link

rainmaple commented Apr 3, 2025

use the vllm==0.8.2
head 0GPU vllm serve model directly failed due to infer device type, so i turned to use the ray serve llm api, if env set USE_VLLM_V1": "0 can help? Or vllm==0.7.2 directly failed for TP=1 limits mentioned above. @kouroshHakha

@rainmaple
Copy link

The process that creates RayDistributedExecutor, which serves as the entry point, must have access to a GPU while not occupying GPU resources within Ray. This conflicts with the typical configuration of Ray Serve.

Does it mean that if vllm serve directly without ray serve in a product environment (ray cluster with head node has 0 GPU) may failed due to this design ? @huiyeruzhou

@kouroshHakha
Copy link
Contributor

head 0GPU vllm serve model directly failed due to infer device type, so i turned to use the ray serve llm api, if env set USE_VLLM_V1": "0 can help? Or vllm==0.7.2 directly failed for TP=1 limits mentioned above. @kouroshHakha

Did not quite understand the question? Did you try vllm==0.7.2? If it doesn't work what is the error?

@rainmaple
Copy link

rainmaple commented Apr 3, 2025 via email

@kouroshHakha
Copy link
Contributor

kouroshHakha commented Apr 3, 2025

Emm just to confirm, according the analysis mentioned above, for vllm without the ray serve if head node has gpu is necessary ?

Yes without ray serve, vllm assumes on the driver (which could be head node) the gpu type and cuda platform can be determined. With ray serve we orchestrate so that requirement is satisfied.

@rainmaple
Copy link

rainmaple commented Apr 3, 2025 via email

@nitingoyal0996
Copy link

nitingoyal0996 commented Apr 15, 2025

@kouroshHakha I was also following the documentation quick start example to refactor our Ray[Serve] + vLLM implementation to Ray[Serve,LLM] and so far it has been challenging.

After navigating my way through this issue I was able to complete the deployment without LLMRouter but to be able to use the openai compatible server I tried adding LLMRouter which failed the deployment with the following error -

(ServeReplica:default:vLLM:meta-llama--Llama-3_1-8B-Instruct pid=19144) INFO 2025-04-15 10:19:52,176 default_vLLM:meta-llama--Llama-3_1-8B-Instruct met6piw3 99ad0f04-9cd0-46ef-ac0a-2c0e59465dee -- CALL llm_config OK 429.5ms
INFO 2025-04-15 10:19:53,924 serve 27 -- Application 'default' is ready at http://127.0.0.1:8000/.
(ServeReplica:default:vLLM:meta-llama--Llama-3_1-8B-Instruct pid=19144) INFO 2025-04-15 10:20:35,618 default_vLLM:meta-llama--Llama-3_1-8B-Instruct met6piw3 6f652b14-4b89-4d4b-9ad4-cef817e8b260 -- Received streaming request 6f652b14-4b89-4d4b-9ad4-cef817e8b260
(ServeReplica:default:vLLM:meta-llama--Llama-3_1-8B-Instruct pid=19144) INFO 2025-04-15 10:20:35,672 default_vLLM:meta-llama--Llama-3_1-8B-Instruct met6piw3 6f652b14-4b89-4d4b-9ad4-cef817e8b260 -- Request 6f652b14-4b89-4d4b-9ad4-cef817e8b260 started. Prompt: <|begin_of_text|><|start_header_id|>system<|end_header_id|>
(ServeReplica:default:vLLM:meta-llama--Llama-3_1-8B-Instruct pid=19144) 
(ServeReplica:default:vLLM:meta-llama--Llama-3_1-8B-Instruct pid=19144) Cutting Knowledge Date: December 2023
(ServeReplica:default:vLLM:meta-llama--Llama-3_1-8B-Instruct pid=19144) Today Date: 26 Jul 2024
(ServeReplica:default:vLLM:meta-llama--Llama-3_1-8B-Instruct pid=19144) 
(ServeReplica:default:vLLM:meta-llama--Llama-3_1-8B-Instruct pid=19144) <|eot_id|><|start_header_id|>user<|end_header_id|>
(ServeReplica:default:vLLM:meta-llama--Llama-3_1-8B-Instruct pid=19144) 
(ServeReplica:default:vLLM:meta-llama--Llama-3_1-8B-Instruct pid=19144) Hello!<|eot_id|><|start_header_id|>assistant<|end_header_id|>
(ServeReplica:default:vLLM:meta-llama--Llama-3_1-8B-Instruct pid=19144) 
(ServeReplica:default:vLLM:meta-llama--Llama-3_1-8B-Instruct pid=19144) 
(_EngineBackgroundProcess pid=19245) INFO 04-15 10:20:35 engine.py:275] Added request 6f652b14-4b89-4d4b-9ad4-cef817e8b260.
(_EngineBackgroundProcess pid=19245) INFO 04-15 10:19:51 model_runner.py:1562] Graph capturing finished in 8 secs, took 0.17 GiB
(_EngineBackgroundProcess pid=19245) /project/lib/Analysis/Allocation.cpp:47: std::pair<llvm::SmallVector<unsigned int>, llvm::SmallVector<unsigned int> > mlir::triton::getCvtOrder(mlir::Attribute, mlir::Attribute): Assertion `!(srcMmaLayout && dstMmaLayout && !srcMmaLayout.isAmpere()) && "mma -> mma layout conversion is only supported on Ampere"' failed.
(_EngineBackgroundProcess pid=19245) *** SIGABRT received at time=1744737636 on cpu 27 ***
(_EngineBackgroundProcess pid=19245) PC: @     0x7fee3f25b9fc  (unknown)  pthread_kill
(_EngineBackgroundProcess pid=19245)     @     0x7fee3f207520  (unknown)  (unknown)
(_EngineBackgroundProcess pid=19245) [2025-04-15 10:20:36,222 E 19245 19245] logging.cc:497: *** SIGABRT received at time=1744737636 on cpu 27 ***
(_EngineBackgroundProcess pid=19245) [2025-04-15 10:20:36,222 E 19245 19245] logging.cc:497: PC: @     0x7fee3f25b9fc  (unknown)  pthread_kill
(_EngineBackgroundProcess pid=19245) [2025-04-15 10:20:36,222 E 19245 19245] logging.cc:497:     @     0x7fee3f207520  (unknown)  (unknown)
(_EngineBackgroundProcess pid=19245) Fatal Python error: Aborted

I have combination of V100 and A100s and this exact cluster works fine with older vLLM implementation. I was wondering if there are additional configurations required to handle heterogeneous GPUs.

Here are my complete configs and terminal output -

# LLMConfig
llm_config_dict = {
    "model_loading_config": {
        "model_id": args.get("model", "meta-llama/Llama-3.1-8B-Instruct"),
        "model_source": "meta-llama/Llama-3.1-8B-Instruct",
    },
    "engine_kwargs": {
        # .. vLLM engine arguments
    },
    "deployment_config": {
        "ray_actor_options": {},
        "autoscaling_config": {
            "min_replicas": 1,
            "max_replicas": 1
        }
    },
    "runtime_env": {
        "pip": ["httpx", "ray[llm,serve]==2.44.1", "vllm==0.7.2"],
        "env_vars": {
            "USE_VLLM_V1": "0",
            "HF_TOKEN": os.getenv("HF_TOKEN")
        }
    }
}
configs = LLMConfig(**llm_config_dict)

bundles=[
    {"CPU": 1, "GPU": 1} 
    for _ in range(int(args["tensor_parallel_size"]))
]

deployment = LLMServer.as_deployment(
    configs.get_serve_options(name_prefix="vLLM:"),
).options(
    placement_group_bundles=bundles,
    placement_group_strategy="SPREAD"
).bind(configs)

app = LLMRouter.as_deployment().bind(llm_deployments=[deployment])

return app
# Terminal Logs
(base) nitingoyal:~$ docker run --network host -v ~/storage/tmp/ray:/tmp/ray -e RAY_ADDRESS=████████████:3002 serve:latest

==========
== CUDA ==
==========

CUDA Version 12.1.1

Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.

2025-04-15 10:18:35,401 INFO scripts.py:494 -- Running import path: 'serve:build_app'.
INFO 04-15 10:18:38 __init__.py:190] Automatically detected platform cuda.
2025-04-15 10:18:39,518 INFO worker.py:1520 -- Using address ████████████:3002 set in the environment variable RAY_ADDRESS
2025-04-15 10:18:39,518 INFO worker.py:1660 -- Connecting to existing Ray cluster at address: ████████████:3002...
2025-04-15 10:18:39,529 INFO worker.py:1843 -- Connected to Ray cluster. View the dashboard at http://████████████:5678 
(ProxyActor pid=18988) INFO 2025-04-15 10:18:42,157 proxy ████████████ -- Proxy starting on node 348553707a378daeb5edea16f5fe9c5aafbdcd7ff550361a6988df81 (HTTP port: 8000).
INFO 2025-04-15 10:18:42,264 serve 27 -- Started Serve in namespace "serve".
INFO 2025-04-15 10:18:42,281 serve 27 -- Connecting to existing Serve app in namespace "serve". New http options will not be applied.
(ProxyActor pid=18988) INFO 2025-04-15 10:18:42,243 proxy ████████████ -- Got updated endpoints: {}.
(ServeController pid=18919) INFO 2025-04-15 10:18:42,388 controller 18919 -- Deploying new version of Deployment(name='vLLM:meta-llama--Llama-3_1-8B-Instruct', app='default') (initial target replicas: 1).
(ServeController pid=18919) INFO 2025-04-15 10:18:42,391 controller 18919 -- Deploying new version of Deployment(name='LLMRouter', app='default') (initial target replicas: 2).
(ProxyActor pid=18988) INFO 2025-04-15 10:18:42,400 proxy ████████████ -- Got updated endpoints: {Deployment(name='LLMRouter', app='default'): EndpointInfo(route='/', app_is_cross_language=False)}.
(ProxyActor pid=18988) INFO 2025-04-15 10:18:42,421 proxy ████████████ -- Started <ray.serve._private.router.SharedRouterLongPollClient object at 0x7f5394515820>.
(ServeController pid=18919) INFO 2025-04-15 10:18:42,502 controller 18919 -- Adding 1 replica to Deployment(name='vLLM:meta-llama--Llama-3_1-8B-Instruct', app='default').
(ServeController pid=18919) INFO 2025-04-15 10:18:42,507 controller 18919 -- Adding 2 replicas to Deployment(name='LLMRouter', app='default').
(ServeReplica:default:LLMRouter pid=2550, ip=████████████) INFO 04-15 10:18:46 __init__.py:190] Automatically detected platform cuda.
(ProxyActor pid=2630, ip=████████████) INFO 2025-04-15 10:18:48,219 proxy ████████████ -- Proxy starting on node 06553f83498eebaad508fa44d5b1912cc9ec51c786558c089bbe92c7 (HTTP port: 8000).
(ProxyActor pid=2630, ip=████████████) INFO 2025-04-15 10:18:48,274 proxy ████████████ -- Got updated endpoints: {Deployment(name='LLMRouter', app='default'): EndpointInfo(route='/', app_is_cross_language=False)}.
(ProxyActor pid=2630, ip=████████████) INFO 2025-04-15 10:18:48,287 proxy ████████████ -- Started <ray.serve._private.router.SharedRouterLongPollClient object at 0x7fdd1e476810>.
(ServeReplica:default:vLLM:meta-llama--Llama-3_1-8B-Instruct pid=19144) INFO 2025-04-15 10:18:51,153 default_vLLM:meta-llama--Llama-3_1-8B-Instruct met6piw3 -- No cloud storage mirror configured
(ServeReplica:default:vLLM:meta-llama--Llama-3_1-8B-Instruct pid=19144) INFO 2025-04-15 10:18:51,153 default_vLLM:meta-llama--Llama-3_1-8B-Instruct met6piw3 -- Downloading the tokenizer for meta-llama/Llama-3.1-8B-Instruct
(ServeReplica:default:vLLM:meta-llama--Llama-3_1-8B-Instruct pid=19144) You are using a model of type llama to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
(ServeReplica:default:vLLM:meta-llama--Llama-3_1-8B-Instruct pid=19144) You are using a model of type llama to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
(ServeReplica:default:vLLM:meta-llama--Llama-3_1-8B-Instruct pid=19144) WARNING 04-15 10:18:58 config.py:2386] Casting torch.bfloat16 to torch.float16.
(ServeReplica:default:vLLM:meta-llama--Llama-3_1-8B-Instruct pid=19144) INFO 04-15 10:18:50 __init__.py:190] Automatically detected platform cuda. [repeated 2x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/user-guides/configure-logging.html#log-deduplication for more options.)
(ServeReplica:default:vLLM:meta-llama--Llama-3_1-8B-Instruct pid=19144) INFO 04-15 10:19:05 config.py:542] This model supports multiple tasks: {'embed', 'score', 'reward', 'generate', 'classify'}. Defaulting to 'generate'.
(ServeReplica:default:vLLM:meta-llama--Llama-3_1-8B-Instruct pid=19144) INFO 04-15 10:19:05 config.py:1556] Chunked prefill is enabled with max_num_batched_tokens=8192.
(ServeReplica:default:vLLM:meta-llama--Llama-3_1-8B-Instruct pid=19144) INFO 2025-04-15 10:19:06,348 default_vLLM:meta-llama--Llama-3_1-8B-Instruct met6piw3 -- [STATUS] Getting the server ready ...
(pid=19245) INFO 04-15 10:19:10 __init__.py:190] Automatically detected platform cuda.
(_EngineBackgroundProcess pid=19245) INFO 04-15 10:19:11 llm_engine.py:234] Initializing a V0 LLM engine (v0.7.2) with config: model='meta-llama/Llama-3.1-8B-Instruct', speculative_config=None, tokenizer='meta-llama/Llama-3.1-8B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=2, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=meta-llama/Llama-3.1-8B-Instruct, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=True, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":64}, use_cached_outputs=True, 
(_EngineBackgroundProcess pid=19245) INFO 2025-04-15 10:19:11,502 serve 19245 -- Clearing the current platform cache ...
(_EngineBackgroundProcess pid=19245) WARNING 04-15 10:19:12 ray_utils.py:180] tensor_parallel_size=2 is bigger than a reserved number of GPUs (1 GPUs) in a node 348553707a378daeb5edea16f5fe9c5aafbdcd7ff550361a6988df81. Tensor parallel workers can be spread out to 2+ nodes which can degrade the performance unless you have fast interconnect across nodes, like Infiniband. To resolve this issue, make sure you have more than 2 GPUs available at each node.
(_EngineBackgroundProcess pid=19245) WARNING 04-15 10:19:12 ray_utils.py:180] tensor_parallel_size=2 is bigger than a reserved number of GPUs (1 GPUs) in a node 06553f83498eebaad508fa44d5b1912cc9ec51c786558c089bbe92c7. Tensor parallel workers can be spread out to 2+ nodes which can degrade the performance unless you have fast interconnect across nodes, like Infiniband. To resolve this issue, make sure you have more than 2 GPUs available at each node.
(_EngineBackgroundProcess pid=19245) INFO 04-15 10:19:12 ray_distributed_executor.py:149] use_ray_spmd_worker: False
(_EngineBackgroundProcess pid=19245) Connecting to existing Ray cluster at address: ████████████:3002...
(_EngineBackgroundProcess pid=19245) Calling ray.init() again after it has already been called.
(ServeController pid=18919) WARNING 2025-04-15 10:19:12,539 controller 18919 -- Deployment 'vLLM:meta-llama--Llama-3_1-8B-Instruct' in application 'default' has 1 replicas that have taken more than 30s to initialize.
(ServeController pid=18919) This may be caused by a slow __init__ or reconfigure method.
(ServeController pid=18919) WARNING 2025-04-15 10:19:12,540 controller 18919 -- Deployment 'LLMRouter' in application 'default' has 2 replicas that have taken more than 30s to initialize.
(ServeController pid=18919) This may be caused by a slow __init__ or reconfigure method.
(pid=2711, ip=████████████) INFO 04-15 10:19:15 __init__.py:190] Automatically detected platform cuda.
(ServeReplica:default:vLLM:meta-llama--Llama-3_1-8B-Instruct pid=19144) INFO 2025-04-15 10:19:16,399 default_vLLM:meta-llama--Llama-3_1-8B-Instruct met6piw3 -- [STATUS] Waiting for engine process ...
(pid=19315) INFO 04-15 10:19:16 __init__.py:190] Automatically detected platform cuda.
(RayWorkerWrapper pid=2711, ip=████████████) INFO 04-15 10:19:18 cuda.py:230] Using Flash Attention backend.
(_EngineBackgroundProcess pid=19245) INFO 04-15 10:19:18 cuda.py:179] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
(_EngineBackgroundProcess pid=19245) INFO 04-15 10:19:18 cuda.py:227] Using XFormers backend.
(RayWorkerWrapper pid=2711, ip=████████████) INFO 04-15 10:19:19 utils.py:950] Found nccl from library libnccl.so.2
(RayWorkerWrapper pid=2711, ip=████████████) INFO 04-15 10:19:19 pynccl.py:69] vLLM is using nccl==2.21.5
(RayWorkerWrapper pid=2711, ip=████████████) WARNING 04-15 10:19:19 custom_all_reduce.py:84] Custom allreduce is disabled because this process group spans across nodes.
(RayWorkerWrapper pid=2711, ip=████████████) INFO 04-15 10:19:19 model_runner.py:1110] Starting to load model meta-llama/Llama-3.1-8B-Instruct...
(_EngineBackgroundProcess pid=19245) INFO 04-15 10:19:19 shm_broadcast.py:258] vLLM message queue communication handle: Handle(connect_ip='████████████', local_reader_ranks=[], buffer_handle=None, local_subscribe_port=None, remote_subscribe_port=43343)
(_EngineBackgroundProcess pid=19245) INFO 04-15 10:19:20 weight_utils.py:252] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:02<00:08,  2.78s/it]
Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:05<00:05,  2.81s/it]
Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:06<00:01,  1.89s/it]
(ServeReplica:default:vLLM:meta-llama--Llama-3_1-8B-Instruct pid=19144) INFO 2025-04-15 10:19:27,451 default_vLLM:meta-llama--Llama-3_1-8B-Instruct met6piw3 -- [STATUS] Waiting for engine process ...
(RayWorkerWrapper pid=2711, ip=████████████) INFO 04-15 10:19:28 model_runner.py:1115] Loading model weights took 7.5123 GB
(_EngineBackgroundProcess pid=19245) INFO 04-15 10:19:19 utils.py:950] Found nccl from library libnccl.so.2
(_EngineBackgroundProcess pid=19245) INFO 04-15 10:19:19 pynccl.py:69] vLLM is using nccl==2.21.5
(_EngineBackgroundProcess pid=19245) WARNING 04-15 10:19:19 custom_all_reduce.py:84] Custom allreduce is disabled because this process group spans across nodes.
(_EngineBackgroundProcess pid=19245) INFO 04-15 10:19:19 model_runner.py:1110] Starting to load model meta-llama/Llama-3.1-8B-Instruct...
(RayWorkerWrapper pid=2711, ip=████████████) INFO 04-15 10:19:20 weight_utils.py:252] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:09<00:00,  2.26s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:09<00:00,  2.31s/it]
(_EngineBackgroundProcess pid=19245) 
(ServeReplica:default:vLLM:meta-llama--Llama-3_1-8B-Instruct pid=19144) INFO 2025-04-15 10:19:38,495 default_vLLM:meta-llama--Llama-3_1-8B-Instruct met6piw3 -- [STATUS] Waiting for engine process ...
(RayWorkerWrapper pid=2711, ip=████████████) INFO 04-15 10:19:39 worker.py:267] Memory profiling takes 9.60 seconds
(RayWorkerWrapper pid=2711, ip=████████████) INFO 04-15 10:19:39 worker.py:267] the current vLLM instance can use total_gpu_memory (39.38GiB) x gpu_memory_utilization (0.90) = 35.44GiB
(RayWorkerWrapper pid=2711, ip=████████████) INFO 04-15 10:19:39 worker.py:267] model weights take 7.51GiB; non_torch_memory takes 0.28GiB; PyTorch activation peak memory takes 0.52GiB; the rest of the memory reserved for KV Cache is 27.13GiB.
(_EngineBackgroundProcess pid=19245) INFO 04-15 10:19:29 model_runner.py:1115] Loading model weights took 7.5123 GB
(_EngineBackgroundProcess pid=19245) INFO 04-15 10:19:40 executor_base.py:110] # CUDA blocks: 20867, # CPU blocks: 4096
(_EngineBackgroundProcess pid=19245) INFO 04-15 10:19:40 executor_base.py:115] Maximum concurrency for 2048 tokens per request: 163.02x
(ServeController pid=18919) WARNING 2025-04-15 10:19:42,631 controller 18919 -- Deployment 'vLLM:meta-llama--Llama-3_1-8B-Instruct' in application 'default' has 1 replicas that have taken more than 30s to initialize.
(ServeController pid=18919) This may be caused by a slow __init__ or reconfigure method.
(ServeController pid=18919) WARNING 2025-04-15 10:19:42,632 controller 18919 -- Deployment 'LLMRouter' in application 'default' has 2 replicas that have taken more than 30s to initialize.
(ServeController pid=18919) This may be caused by a slow __init__ or reconfigure method.
(_EngineBackgroundProcess pid=19245) INFO 04-15 10:19:43 model_runner.py:1434] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or switching to eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
Capturing CUDA graph shapes:   0%|          | 0/11 [00:00<?, ?it/s]
Capturing CUDA graph shapes:   9%|▉         | 1/11 [00:01<00:11,  1.19s/it]
Capturing CUDA graph shapes:  18%|█▊        | 2/11 [00:01<00:08,  1.10it/s]
Capturing CUDA graph shapes:  27%|██▋       | 3/11 [00:02<00:06,  1.22it/s]
Capturing CUDA graph shapes:  36%|███▋      | 4/11 [00:03<00:05,  1.30it/s]
Capturing CUDA graph shapes:  45%|████▌     | 5/11 [00:04<00:04,  1.32it/s]
Capturing CUDA graph shapes:  55%|█████▍    | 6/11 [00:04<00:03,  1.38it/s]
Capturing CUDA graph shapes:  64%|██████▎   | 7/11 [00:05<00:02,  1.40it/s]
Capturing CUDA graph shapes:  73%|███████▎  | 8/11 [00:06<00:02,  1.45it/s]
(ServeReplica:default:vLLM:meta-llama--Llama-3_1-8B-Instruct pid=19144) INFO 2025-04-15 10:19:49,547 default_vLLM:meta-llama--Llama-3_1-8B-Instruct met6piw3 -- [STATUS] Waiting for engine process ...
Capturing CUDA graph shapes:  82%|████████▏ | 9/11 [00:06<00:01,  1.49it/s]
Capturing CUDA graph shapes:  91%|█████████ | 10/11 [00:07<00:00,  1.53it/s]
(RayWorkerWrapper pid=2711, ip=████████████) INFO 04-15 10:19:50 model_runner.py:1562] Graph capturing finished in 7 secs, took 0.17 GiB
(_EngineBackgroundProcess pid=19245) INFO 04-15 10:19:39 worker.py:267] Memory profiling takes 9.74 seconds
(_EngineBackgroundProcess pid=19245) INFO 04-15 10:19:39 worker.py:267] the current vLLM instance can use total_gpu_memory (31.73GiB) x gpu_memory_utilization (0.90) = 28.56GiB
(_EngineBackgroundProcess pid=19245) INFO 04-15 10:19:39 worker.py:267] model weights take 7.51GiB; non_torch_memory takes 0.14GiB; PyTorch activation peak memory takes 0.52GiB; the rest of the memory reserved for KV Cache is 20.38GiB.
(RayWorkerWrapper pid=2711, ip=████████████) INFO 04-15 10:19:43 model_runner.py:1434] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or switching to eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
(_EngineBackgroundProcess pid=19245) INFO 04-15 10:19:51 llm_engine.py:431] init engine (profile, create kv cache, warmup model) took 21.23 seconds
Capturing CUDA graph shapes: 100%|██████████| 11/11 [00:07<00:00,  1.38it/s]
(ServeReplica:default:vLLM:meta-llama--Llama-3_1-8B-Instruct pid=19144) INFO 2025-04-15 10:19:51,593 default_vLLM:meta-llama--Llama-3_1-8B-Instruct met6piw3 -- [STATUS] Server is ready.
(ServeReplica:default:vLLM:meta-llama--Llama-3_1-8B-Instruct pid=19144) INFO 2025-04-15 10:19:51,594 default_vLLM:meta-llama--Llama-3_1-8B-Instruct met6piw3 -- Started vLLM engine.
(ServeReplica:default:vLLM:meta-llama--Llama-3_1-8B-Instruct pid=19144) INFO 2025-04-15 10:19:52,160 default_vLLM:meta-llama--Llama-3_1-8B-Instruct met6piw3 fc373110-5b2a-4f6a-bd8d-e1da77d4bcf3 -- CALL llm_config OK 415.5ms
(ServeReplica:default:vLLM:meta-llama--Llama-3_1-8B-Instruct pid=19144) INFO 2025-04-15 10:19:52,176 default_vLLM:meta-llama--Llama-3_1-8B-Instruct met6piw3 99ad0f04-9cd0-46ef-ac0a-2c0e59465dee -- CALL llm_config OK 429.5ms
INFO 2025-04-15 10:19:53,924 serve 27 -- Application 'default' is ready at http://127.0.0.1:8000/.
(ServeReplica:default:vLLM:meta-llama--Llama-3_1-8B-Instruct pid=19144) INFO 2025-04-15 10:20:35,618 default_vLLM:meta-llama--Llama-3_1-8B-Instruct met6piw3 6f652b14-4b89-4d4b-9ad4-cef817e8b260 -- Received streaming request 6f652b14-4b89-4d4b-9ad4-cef817e8b260
(ServeReplica:default:vLLM:meta-llama--Llama-3_1-8B-Instruct pid=19144) INFO 2025-04-15 10:20:35,672 default_vLLM:meta-llama--Llama-3_1-8B-Instruct met6piw3 6f652b14-4b89-4d4b-9ad4-cef817e8b260 -- Request 6f652b14-4b89-4d4b-9ad4-cef817e8b260 started. Prompt: <|begin_of_text|><|start_header_id|>system<|end_header_id|>
(ServeReplica:default:vLLM:meta-llama--Llama-3_1-8B-Instruct pid=19144) 
(ServeReplica:default:vLLM:meta-llama--Llama-3_1-8B-Instruct pid=19144) Cutting Knowledge Date: December 2023
(ServeReplica:default:vLLM:meta-llama--Llama-3_1-8B-Instruct pid=19144) Today Date: 26 Jul 2024
(ServeReplica:default:vLLM:meta-llama--Llama-3_1-8B-Instruct pid=19144) 
(ServeReplica:default:vLLM:meta-llama--Llama-3_1-8B-Instruct pid=19144) <|eot_id|><|start_header_id|>user<|end_header_id|>
(ServeReplica:default:vLLM:meta-llama--Llama-3_1-8B-Instruct pid=19144) 
(ServeReplica:default:vLLM:meta-llama--Llama-3_1-8B-Instruct pid=19144) Hello!<|eot_id|><|start_header_id|>assistant<|end_header_id|>
(ServeReplica:default:vLLM:meta-llama--Llama-3_1-8B-Instruct pid=19144) 
(ServeReplica:default:vLLM:meta-llama--Llama-3_1-8B-Instruct pid=19144) 
(_EngineBackgroundProcess pid=19245) INFO 04-15 10:20:35 engine.py:275] Added request 6f652b14-4b89-4d4b-9ad4-cef817e8b260.
(_EngineBackgroundProcess pid=19245) INFO 04-15 10:19:51 model_runner.py:1562] Graph capturing finished in 8 secs, took 0.17 GiB
(_EngineBackgroundProcess pid=19245) /project/lib/Analysis/Allocation.cpp:47: std::pair<llvm::SmallVector<unsigned int>, llvm::SmallVector<unsigned int> > mlir::triton::getCvtOrder(mlir::Attribute, mlir::Attribute): Assertion `!(srcMmaLayout && dstMmaLayout && !srcMmaLayout.isAmpere()) && "mma -> mma layout conversion is only supported on Ampere"' failed.
(_EngineBackgroundProcess pid=19245) *** SIGABRT received at time=1744737636 on cpu 27 ***
(_EngineBackgroundProcess pid=19245) PC: @     0x7fee3f25b9fc  (unknown)  pthread_kill
(_EngineBackgroundProcess pid=19245)     @     0x7fee3f207520  (unknown)  (unknown)
(_EngineBackgroundProcess pid=19245) [2025-04-15 10:20:36,222 E 19245 19245] logging.cc:497: *** SIGABRT received at time=1744737636 on cpu 27 ***
(_EngineBackgroundProcess pid=19245) [2025-04-15 10:20:36,222 E 19245 19245] logging.cc:497: PC: @     0x7fee3f25b9fc  (unknown)  pthread_kill
(_EngineBackgroundProcess pid=19245) [2025-04-15 10:20:36,222 E 19245 19245] logging.cc:497:     @     0x7fee3f207520  (unknown)  (unknown)
(_EngineBackgroundProcess pid=19245) Fatal Python error: Aborted
(_EngineBackgroundProcess pid=19245) 
(_EngineBackgroundProcess pid=19245) Stack (most recent call first):
(_EngineBackgroundProcess pid=19245)   File "/home/ray/anaconda3/lib/python3.12/site-packages/triton/backends/nvidia/compiler.py", line 216 in make_llir
(_EngineBackgroundProcess pid=19245)   File "/home/ray/anaconda3/lib/python3.12/site-packages/triton/backends/nvidia/compiler.py", line 318 in <lambda>
(_EngineBackgroundProcess pid=19245)   File "/home/ray/anaconda3/lib/python3.12/site-packages/triton/compiler/compiler.py", line 282 in compile
(_EngineBackgroundProcess pid=19245)   File "/home/ray/anaconda3/lib/python3.12/site-packages/triton/runtime/jit.py", line 662 in run
(_EngineBackgroundProcess pid=19245)   File "/home/ray/anaconda3/lib/python3.12/site-packages/triton/runtime/jit.py", line 345 in <lambda>
(_EngineBackgroundProcess pid=19245)   File "/home/ray/anaconda3/lib/python3.12/site-packages/vllm/attention/ops/prefix_prefill.py", line 827 in context_attention_fwd
(_EngineBackgroundProcess pid=19245)   File "/home/ray/anaconda3/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116 in decorate_context
(_EngineBackgroundProcess pid=19245)   File "/home/ray/anaconda3/lib/python3.12/site-packages/vllm/attention/ops/paged_attn.py", line 213 in forward_prefix
(_EngineBackgroundProcess pid=19245)   File "/home/ray/anaconda3/lib/python3.12/site-packages/vllm/attention/backends/xformers.py", line 573 in forward
(_EngineBackgroundProcess pid=19245)   File "/home/ray/anaconda3/lib/python3.12/site-packages/vllm/attention/layer.py", line 307 in unified_attention
(_EngineBackgroundProcess pid=19245)   File "/home/ray/anaconda3/lib/python3.12/site-packages/torch/_ops.py", line 1116 in __call__
(_EngineBackgroundProcess pid=19245)   File "/home/ray/anaconda3/lib/python3.12/site-packages/vllm/attention/layer.py", line 201 in forward
(_EngineBackgroundProcess pid=19245)   File "/home/ray/anaconda3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1747 in _call_impl
(_EngineBackgroundProcess pid=19245)   File "/home/ray/anaconda3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1736 in _wrapped_call_impl
(_EngineBackgroundProcess pid=19245)   File "/home/ray/anaconda3/lib/python3.12/site-packages/vllm/model_executor/models/llama.py", line 203 in forward
(_EngineBackgroundProcess pid=19245)   File "/home/ray/anaconda3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1747 in _call_impl
(_EngineBackgroundProcess pid=19245)   File "/home/ray/anaconda3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1736 in _wrapped_call_impl
(_EngineBackgroundProcess pid=19245)   File "/home/ray/anaconda3/lib/python3.12/site-packages/vllm/model_executor/models/llama.py", line 279 in forward
(_EngineBackgroundProcess pid=19245)   File "/home/ray/anaconda3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1747 in _call_impl
(_EngineBackgroundProcess pid=19245)   File "/home/ray/anaconda3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1736 in _wrapped_call_impl
(_EngineBackgroundProcess pid=19245)   File "/home/ray/anaconda3/lib/python3.12/site-packages/vllm/model_executor/models/llama.py", line 365 in forward
(_EngineBackgroundProcess pid=19245)   File "/home/ray/anaconda3/lib/python3.12/site-packages/vllm/compilation/decorators.py", line 172 in __call__
(_EngineBackgroundProcess pid=19245)   File "/home/ray/anaconda3/lib/python3.12/site-packages/vllm/model_executor/models/llama.py", line 541 in forward
(_EngineBackgroundProcess pid=19245)   File "/home/ray/anaconda3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1747 in _call_impl
(_EngineBackgroundProcess pid=19245)   File "/home/ray/anaconda3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1736 in _wrapped_call_impl
(_EngineBackgroundProcess pid=19245)   File "/home/ray/anaconda3/lib/python3.12/site-packages/vllm/worker/model_runner.py", line 1719 in execute_model
(_EngineBackgroundProcess pid=19245)   File "/home/ray/anaconda3/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116 in decorate_context
(_EngineBackgroundProcess pid=19245)   File "/home/ray/anaconda3/lib/python3.12/site-packages/vllm/worker/worker_base.py", line 413 in execute_model
(_EngineBackgroundProcess pid=19245)   File "/home/ray/anaconda3/lib/python3.12/site-packages/vllm/utils.py", line 2220 in run_method
(_EngineBackgroundProcess pid=19245)   File "/home/ray/anaconda3/lib/python3.12/site-packages/vllm/worker/worker_base.py", line 566 in execute_method
(_EngineBackgroundProcess pid=19245)   File "/home/ray/anaconda3/lib/python3.12/site-packages/vllm/executor/ray_distributed_executor.py", line 401 in _driver_execute_model
(_EngineBackgroundProcess pid=19245)   File "/home/ray/anaconda3/lib/python3.12/site-packages/vllm/executor/executor_base.py", line 275 in execute_model
(_EngineBackgroundProcess pid=19245)   File "/home/ray/anaconda3/lib/python3.12/site-packages/vllm/executor/ray_distributed_executor.py", line 408 in execute_model
(_EngineBackgroundProcess pid=19245)   File "/home/ray/anaconda3/lib/python3.12/site-packages/vllm/engine/llm_engine.py", line 1386 in step
(_EngineBackgroundProcess pid=19245)   File "/home/ray/anaconda3/lib/python3.12/site-packages/vllm/engine/multiprocessing/engine.py", line 209 in engine_step
(_EngineBackgroundProcess pid=19245)   File "/home/ray/anaconda3/lib/python3.12/site-packages/vllm/engine/multiprocessing/engine.py", line 200 in run_engine_loop
(_EngineBackgroundProcess pid=19245)   File "/home/ray/anaconda3/lib/python3.12/site-packages/vllm/engine/multiprocessing/engine.py", line 137 in start
(_EngineBackgroundProcess pid=19245)   File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/llm/_internal/serve/deployments/llm/vllm/vllm_engine.py", line 242 in start
(_EngineBackgroundProcess pid=19245)   File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/util/tracing/tracing_helper.py", line 463 in _resume_span
(_EngineBackgroundProcess pid=19245)   File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/_private/function_manager.py", line 689 in actor_method_executor
(_EngineBackgroundProcess pid=19245)   File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/_private/worker.py", line 945 in main_loop
(_EngineBackgroundProcess pid=19245)   File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/_private/workers/default_worker.py", line 320 in <module>
(_EngineBackgroundProcess pid=19245) 
(_EngineBackgroundProcess pid=19245) Extension modules: msgpack._cmsgpack, psutil._psutil_linux, psutil._psutil_posix, setproctitle, yaml._yaml, _brotli, charset_normalizer.md, uvloop.loop, ray._raylet, grpc._cython.cygrpc, numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, pyarrow.lib, pyarrow._hdfsio, pandas._libs.tslibs.ccalendar, pandas._libs.tslibs.np_datetime, pandas._libs.tslibs.dtypes, pandas._libs.tslibs.base, pandas._libs.tslibs.nattype, pandas._libs.tslibs.timezones, pandas._libs.tslibs.fields, pandas._libs.tslibs.timedeltas, pandas._libs.tslibs.tzconversion, pandas._libs.tslibs.timestamps, pandas._libs.properties, pandas._libs.tslibs.offsets, pandas._libs.tslibs.strptime, pandas._libs.tslibs.parsing, pandas._libs.tslibs.conversion, pandas._libs.tslibs.period, pandas._libs.tslibs.vectorized, pandas._libs.ops_dispatch, pandas._libs.missing, pandas._libs.hashtable, pandas._libs.algos, pandas._libs.interval, pandas._libs.lib, pyarrow._compute, pandas._libs.ops, pandas._libs.hashing, pandas._libs.arrays, pandas._libs.tslib, pandas._libs.sparse, pandas._libs.internals, pandas._libs.indexing, pandas._libs.index, pandas._libs.writers, pandas._libs.join, pandas._libs.window.aggregations, pandas._libs.window.indexers, pandas._libs.reshape, pandas._libs.groupby, pandas._libs.json, pandas._libs.parsers, pandas._libs.testing, pyarrow._fs, pyarrow._hdfs, pyarrow._gcsfs, pyarrow._s3fs, markupsafe._speedups, PIL._imaging, torch._C, torch._C._dynamo.autograd_compiler, torch._C._dynamo.eval_frame, torch._C._dynamo.guards, torch._C._dynamo.utils, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special, msgspec._core, PIL._imagingft, _cffi_backend, zmq.backend.cython._zmq, multidict._multidict, yarl._quoting_c, aiohttp._helpers, aiohttp._http_writer, aiohttp._http_parser, aiohttp._websocket, frozenlist._frozenlist, zstandard.backend_c, pyarrow._json, vllm.cumem_allocator, sentencepiece._sentencepiece, scipy._lib._ccallback_c, scipy.sparse._sparsetools, _csparsetools, scipy.sparse._csparsetools, scipy.sparse.linalg._isolve._iterative, scipy.linalg._fblas, scipy.linalg._flapack, scipy.linalg.cython_lapack, scipy.linalg._cythonized_array_utils, scipy.linalg._solve_toeplitz, scipy.linalg._decomp_lu_cython, scipy.linalg._matfuncs_sqrtm_triu, scipy.linalg.cython_blas, scipy.linalg._matfuncs_expm, scipy.linalg._decomp_update, scipy.linalg._flinalg, scipy.sparse.linalg._dsolve._superlu, scipy.sparse.linalg._eigen.arpack._arpack, scipy.sparse.csgraph._tools, scipy.sparse.csgraph._shortest_path, scipy.sparse.csgraph._traversal, scipy.sparse.csgraph._min_spanning_tree, scipy.sparse.csgraph._flow, scipy.sparse.csgraph._matching, scipy.sparse.csgraph._reordering, scipy.optimize._minpack2, scipy.optimize._group_columns, scipy._lib.messagestream, scipy.optimize._trlib._trlib, scipy.optimize._lbfgsb, _moduleTNC, scipy.optimize._moduleTNC, scipy.optimize._cobyla, scipy.optimize._slsqp, scipy.optimize._minpack, scipy.optimize._lsq.givens_elimination, scipy.optimize._zeros, scipy.optimize.__nnls, scipy.optimize._highs.cython.src._highs_wrapper, scipy.optimize._highs._highs_wrapper, scipy.optimize._highs.cython.src._highs_constants, scipy.optimize._highs._highs_constants, scipy.linalg._interpolative, scipy.optimize._bglu_dense, scipy.optimize._lsap, scipy.spatial._ckdtree, scipy.spatial._qhull, scipy.spatial._voronoi, scipy.spatial._distance_wrap, scipy.spatial._hausdorff, scipy.special._ufuncs_cxx, scipy.special._ufuncs, scipy.special._specfun, scipy.special._comb, scipy.special._ellip_harm_2, scipy.spatial.transform._rotation, scipy.optimize._direct, lz4._version, lz4.frame._frame, cuda_utils, __triton_launcher (total: 159)
(raylet) A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: ffffffffffffffffc00c37315c503285c649d48b25000000 Worker ID: 0807e63b47fe35f390c29e38062af140187c0bd495c31ff2bb4e0610 Node ID: 348553707a378daeb5edea16f5fe9c5aafbdcd7ff550361a6988df81 Worker IP address: ████████████ Worker port: 10203 Worker PID: 19245 Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.

This has blocked us from moving forward. I will really appreciate any help with this issue.

@kouroshHakha
Copy link
Contributor

@nitingoyal0996 can you open a new issue on this?

Also why are you setting up the placement group yourself? Let's continue this convo in the new issue.

@nitingoyal0996
Copy link

@kouroshHakha here is the new issue -

#52377

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't llm serve Ray Serve Related Issue
Projects
None yet
Development

No branches or pull requests

5 participants