Skip to content

Commit 8978adf

Browse files
authored
[Doc][KubeRay] Add a doc to explain why some worker Pods are not ready in RayService (#51095)
Signed-off-by: Cheng-Yeh Chung <kenchung285@gmail.com>
1 parent 2765db7 commit 8978adf

File tree

3 files changed

+196
-0
lines changed

3 files changed

+196
-0
lines changed
Loading

doc/source/cluster/kubernetes/user-guides.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,7 @@
66
:hidden:
77
88
Deploy Ray Serve Apps <user-guides/rayservice>
9+
user-guides/rayservice-no-ray-serve-replica
910
user-guides/rayservice-high-availability
1011
user-guides/observability
1112
user-guides/upgrade-guide
@@ -38,6 +39,7 @@ at the {ref}`introductory guide <kuberay-quickstart>` first.
3839
:::
3940

4041
* {ref}`kuberay-rayservice`
42+
* {ref}`kuberay-rayservice-no-ray-serve-replica`
4143
* {ref}`kuberay-rayservice-ha`
4244
* {ref}`kuberay-observability`
4345
* {ref}`kuberay-upgrade-guide`
Lines changed: 194 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,194 @@
1+
(kuberay-rayservice-no-ray-serve-replica)=
2+
3+
# RayService worker Pods aren't ready
4+
5+
This guide explores a specific scenario in KubeRay's RayService API where a Ray worker Pod remains in an unready state due to the absence of a Ray Serve replica.
6+
7+
To better understand this section, you should be familiar with the following Ray Serve components:
8+
the [Ray Serve replica and ProxyActor](https://docs.ray.io/en/latest/serve/architecture.html#high-level-view).
9+
10+
ProxyActor is responsible for forwarding incoming requests to the corresponding Ray Serve replicas.
11+
Hence, if a Ray Pod without a running ProxyActor receives requests, those requests will fail.
12+
KubeRay's readiness probe fails, rendering the Pods unready and preventing ProxyActor from sending requests to them.
13+
14+
The default behavior of Ray Serve only creates ProxyActor on Ray Pods with running Ray Serve replicas.
15+
To illustrate, the following example serves one simple Ray Serve app using RayService.
16+
17+
18+
## Step 1: Create a Kubernetes cluster with Kind
19+
20+
```sh
21+
kind create cluster --image=kindest/node:v1.26.0
22+
```
23+
24+
## Step 2: Install the KubeRay operator
25+
26+
Follow [this document](kuberay-operator-deploy) to install the latest stable KubeRay operator using Helm repository.
27+
28+
## Step 3: Install a RayService
29+
30+
```sh
31+
curl -O https://raw.githubusercontent.com/ray-project/kuberay/master/ray-operator/config/samples/ray-service.no-ray-serve-replica.yaml
32+
kubectl apply -f ray-service.no-ray-serve-replica.yaml
33+
```
34+
35+
Look at the Ray Serve configuration `serveConfigV2` embedded in the RayService YAML. Notice the only deployment in `deployments` of the application named `simple_app`:
36+
* `num_replicas`: Controls the number of replicas, that handle requests to this deployment, to run. Initialize to 1 to ensure the overall number of the Ray Serve replicas is 1.
37+
* `max_replicas_per_node`: Controls the maximum number of replicas on a single pod.
38+
39+
See [Ray Serve Documentation](https://docs.ray.io/en/master/serve/configure-serve-deployment.html) for more details.
40+
```yaml
41+
serveConfigV2: |
42+
applications:
43+
- name: simple_app
44+
import_path: ray-operator.config.samples.ray-serve.single_deployment_dag:DagNode
45+
route_prefix: /basic
46+
runtime_env:
47+
working_dir: "https://github.com/ray-project/kuberay/archive/master.zip"
48+
deployments:
49+
- name: BaseService
50+
num_replicas: 1
51+
max_replicas_per_node: 1
52+
ray_actor_options:
53+
num_cpus: 0.1
54+
```
55+
56+
Look at the head Pod configuration `rayClusterConfig:headGroupSpec` embedded in the RayService YAML.
57+
The configuration sets the CPU resources for the head Pod to 0 by passing the option `num-cpus: "0"` to `rayStartParams`. This setup avoids Ray Serve replicas running on the head Pod.
58+
See [rayStartParams](https://github.com/ray-project/kuberay/blob/master/docs/guidance/rayStartParams.md) for more details.
59+
```sh
60+
headGroupSpec:
61+
rayStartParams:
62+
num-cpus: "0"
63+
template: ...
64+
```
65+
66+
## Step 4: Why 1 worker Pod isn't ready?
67+
68+
```sh
69+
# Step 4.1: Wait until the RayService is ready to serve requests.
70+
kubectl describe rayservices.ray.io rayservice-no-ray-serve-replica
71+
72+
# [Example output]
73+
# Conditions:
74+
# Last Transition Time: 2025-03-18T14:14:43Z
75+
# Message: Number of serve endpoints is greater than 0
76+
# Observed Generation: 1
77+
# Reason: NonZeroServeEndpoints
78+
# Status: True
79+
# Type: Ready
80+
# Last Transition Time: 2025-03-18T14:12:03Z
81+
# Message: Active Ray cluster exists and no pending Ray cluster
82+
# Observed Generation: 1
83+
# Reason: NoPendingCluster
84+
# Status: False
85+
# Type: UpgradeInProgress
86+
87+
# Step 4.2: List all Ray Pods in the `default` namespace.
88+
kubectl get pods -l=ray.io/is-ray-node=yes
89+
90+
# [Example output]
91+
# NAME READY STATUS RESTARTS AGE
92+
# rayservice-no-ray-serve-replica-raycluster-dnm28-head-9h2qt 1/1 Running 0 2m21s
93+
# rayservice-no-ray-serve-replica-raycluster-dnm28-s-worker-46t7l 1/1 Running 0 2m21s
94+
# rayservice-no-ray-serve-replica-raycluster-dnm28-s-worker-77rzk 0/1 Running 0 2m20s
95+
96+
# Step 4.3: Check unready worker pod events
97+
kubectl describe pods {YOUR_UNREADY_WORKER_POD_NAME}
98+
99+
# [Example output]
100+
# Events:
101+
# Type Reason Age From Message
102+
# ---- ------ ---- ---- -------
103+
# Normal Scheduled 3m4s default-scheduler Successfully assigned default/rayservice-no-ray-serve-replica-raycluster-dnm28-s-worker-77rzk to kind-control-plane
104+
# Normal Pulled 3m3s kubelet Container image "rayproject/ray:2.41.0" already present on machine
105+
# Normal Created 3m3s kubelet Created container wait-gcs-ready
106+
# Normal Started 3m3s kubelet Started container wait-gcs-ready
107+
# Normal Pulled 2m57s kubelet Container image "rayproject/ray:2.41.0" already present on machine
108+
# Normal Created 2m57s kubelet Created container ray-worker
109+
# Normal Started 2m57s kubelet Started container ray-worker
110+
# Warning Unhealthy 78s (x19 over 2m43s) kubelet Readiness probe failed: success
111+
```
112+
113+
Look at the output of Step 4.2. One worker Pod is running and ready, while the other is running but not ready.
114+
Starting from Ray 2.8, a Ray worker Pod that doesn't have any Ray Serve replica won't have a Proxy actor.
115+
Starting from KubeRay v1.1.0, KubeRay adds a readiness probe to every worker Pod's Ray container to check if the worker Pod has a Proxy actor or not.
116+
If the worker Pod lacks a Proxy actor, the readiness probe fails, rendering the worker Pod unready, and thus, it doesn't receive any traffic.
117+
118+
With `spec.serveConfigV2`, KubeRay only creates one Ray Serve replica and schedules it to one of the worker Pods.
119+
KubeRay sets up the worker Pod with a Ray Serve replica with a Proxy actor and marks it as ready.
120+
KubeRay marks the other worker Pod, which doesn't have any Ray Serve replica and a Proxy actor, as unready.
121+
122+
## Step 5: Verify the status of the Serve apps
123+
124+
```sh
125+
kubectl port-forward svc/rayservice-no-ray-serve-replica-head-svc 8265:8265
126+
```
127+
128+
See [rayservice-troubleshooting.md](kuberay-raysvc-troubleshoot) for more details on RayService observability.
129+
130+
Below is a screenshot example of the Serve page in the Ray dashboard.
131+
Note that a `ray::ServeReplica::simple_app::BaseService` and a `ray::ProxyActor` are running on one of the worker pod, while no Ray Serve replica and Proxy actor is running on the another. KubeRay marks the former as ready and the later as unready.
132+
![Ray Serve Dashboard](../images/rayservice-no-ray-serve-replica-dashboard.png)
133+
134+
## Step 6: Send requests to the Serve apps by the Kubernetes serve service
135+
136+
`rayservice-no-ray-serve-serve-svc` does traffic routing among all the workers that have Ray Serve replicas.
137+
Although one worker Pod is unready, Ray Serve can still route the traffic to the ready worker Pod with a Ray Serve replica running. Therefore, users can still send requests to the app and receive responses from it.
138+
139+
```sh
140+
# Step 6.1: Run a curl Pod.
141+
# If you already have a curl Pod, you can use `kubectl exec -it <curl-pod> -- sh` to access the Pod.
142+
kubectl run curl --image=radial/busyboxplus:curl -i --tty
143+
144+
# Step 6.2: Send a request to the simple_app.
145+
curl -X POST -H 'Content-Type: application/json' rayservice-no-ray-serve-replica-serve-svc:8000/basic
146+
# [Expected output]: hello world
147+
```
148+
149+
## Step 7: In-place update for Ray Serve apps
150+
151+
Update the `num_replicas` for the app from `1` to `2` in `ray-service.no-ray-serve-replica.yaml`. This change reconfigures the existing RayCluster.
152+
153+
```sh
154+
# Step 7.1: Update the num_replicas of the app from 1 to 2.
155+
# [ray-service.no-ray-serve-replica.yaml]
156+
# deployments:
157+
# - name: BaseService
158+
# num_replicas: 2
159+
# max_replicas_per_node: 1
160+
# ray_actor_options:
161+
# num_cpus: 0.1
162+
163+
# Step 7.2: Apply the updated RayService config.
164+
kubectl apply -f ray-service.no-ray-serve-replica.yaml
165+
166+
# Step 7.3: List all Ray Pods in the `default` namespace.
167+
kubectl get pods -l=ray.io/is-ray-node=yes
168+
169+
# [Example output]
170+
# NAME READY STATUS RESTARTS AGE
171+
# rayservice-no-ray-serve-replica-raycluster-dnm28-head-9h2qt 1/1 Running 0 46m
172+
# rayservice-no-ray-serve-replica-raycluster-dnm28-s-worker-46t7l 1/1 Running 0 46m
173+
# rayservice-no-ray-serve-replica-raycluster-dnm28-s-worker-77rzk 1/1 Running 0 46m
174+
```
175+
176+
After reconfiguration, KubeRay requests the head Pod to create an additional Ray Serve replica to match the `num_replicas` configuration. Because the `max_replicas_per_node` is `1`, the new Ray Serve replica runs on the worker Pod without any replicas. After that, KubeRay marks the worker Pod as ready.
177+
178+
## Step 8: Clean up the Kubernetes cluster
179+
180+
```sh
181+
# Delete the RayService.
182+
kubectl delete -f ray-service.no-ray-serve-replica.yaml
183+
184+
# Uninstall the KubeRay operator.
185+
helm uninstall kuberay-operator
186+
187+
# Delete the curl Pod.
188+
kubectl delete pod curl
189+
```
190+
191+
## Next steps
192+
193+
* See [RayService troubleshooting guide](kuberay-raysvc-troubleshoot) if you encounter any issues.
194+
* See [Examples](kuberay-examples) for more RayService examples.

0 commit comments

Comments
 (0)