Skip to content

Kueue mutating webhooks drops fields in KubeRay resources #2878

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
andrewsykim opened this issue Aug 22, 2024 · 9 comments · Fixed by #3132
Closed

Kueue mutating webhooks drops fields in KubeRay resources #2878

andrewsykim opened this issue Aug 22, 2024 · 9 comments · Fixed by #3132
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@andrewsykim
Copy link
Member

What happened:

When using Kueue with KubeRay, new fields in KubeRay that are not recognized by Kueue are dropped during defaulting. Here's an example using Kind and RayJob:

Create cluster:

$ kind create cluster
Creating cluster "kind" ...
 ✓ Ensuring node image (kindest/node:v1.27.3) 🖼
 ✓ Preparing nodes 📦
 ✓ Writing configuration 📜
 ✓ Starting control-plane 🕹️
 ✓ Installing CNI 🔌
 ✓ Installing StorageClass 💾
Set kubectl context to "kind-kind"
You can now use your cluster with:

kubectl cluster-info --context kind-kind

Have a nice day! 👋

Install Kueue:

$ kubectl apply --server-side -f https://github.com/kubernetes-sigs/kueue/releases/download/v0.8.0/manifests.yaml
namespace/kueue-system serverside-applied
...
...

Install latest release candidate of KubeRay:

$ helm install kuberay-operator kuberay/kuberay-operator --version 1.2.0-rc.0
NAME: kuberay-operator
LAST DEPLOYED: Thu Aug 22 16:03:11 2024
NAMESPACE: default
STATUS: deployed
REVISION: 1
TEST SUITE: None

Deploy Kueue resources:

$ kubectl apply -f kueue-resources.yaml
resourceflavor.kueue.x-k8s.io/default-flavor created
clusterqueue.kueue.x-k8s.io/cluster-queue created
localqueue.kueue.x-k8s.io/local-queue created

Create a RayJob with a reference to the local queue:

$ cat rayjob.yaml
apiVersion: ray.io/v1
kind: RayJob
metadata:
  name: image-resize
  labels:
    kueue.x-k8s.io/queue-name: local-queue
spec:
  backoffLimit: 2
  shutdownAfterJobFinishes: true
  entrypoint: python ray-operator/config/samples/ray-data-image-resize/ray_data_image_resize.py
  runtimeEnvYAML: |
    pip:
      - torch
      - torchvision
      - numpy
      - google-cloud-storage
    working_dir: "https://github.com/ray-project/kuberay/archive/master.zip"
    env_vars:
      BUCKET_NAME: ray-images
      BUCKET_PREFIX: images
  # rayClusterSpec specifies the RayCluster instance to be created by the RayJob controller.
  rayClusterSpec:
    rayVersion: '2.34.0'
    headGroupSpec:
      rayStartParams: {}
      # Pod template
      template:
        spec:
          containers:
            - name: ray-head
              image: rayproject/ray:2.34.0
              ports:
                - containerPort: 6379
                  name: gcs-server
                - containerPort: 8265 # Ray dashboard
                  name: dashboard
                - containerPort: 10001
                  name: client
              resources:
                limits:
                  cpu: "2"
                  memory: "4Gi"
                requests:
                  cpu: "2"
                  memory: "4Gi"
    workerGroupSpecs:
      - replicas: 4
        minReplicas: 1
        maxReplicas: 5
        groupName: small-group
        rayStartParams: {}
        # Pod template
        template:
          spec:
            containers:
              - name: ray-worker
                image: rayproject/ray:2.34.0
                resources:
                  limits:
                    cpu: "2"
                    memory: "4Gi"
                  requests:
                    cpu: "2"
                    memory: "4Gi"
$ kubectl apply -f rayjob.yaml
rayjob.ray.io/image-resize created

Note the RayJob sets a new field introduced in v1.2.0 spec.backoffLimit. In this example it is set to 2. However, Kueue is defaulting the field to 0:

$ kubectl get rayjob image-resize -o yaml
apiVersion: ray.io/v1
kind: RayJob
metadata:
  ...
  labels:
    kueue.x-k8s.io/queue-name: local-queue
  name: image-resize
  ...
spec:
  backoffLimit: 0
  entrypoint: python ray-operator/config/samples/ray-data-image-resize/ray_data_image_resize.py
  rayClusterSpec:
  ...

What you expected to happen:

I believe we saw similar behavior while transition KubeRay v0.6.0 to v1.0.0, but I thought it was specific to the v1apha1 -> v1 upgrade when we deleted fields in v1alpha1.

I would not expect Kueue to drop new fields introduced in v1 APIs of KubeRay.

How to reproduce it (as minimally and precisely as possible):

See steps above.

Anything else we need to know?:

Environment:

  • Kubernetes version (use kubectl version):
  • Kueue version (use git describe --tags --dirty --always): v0.8.0
  • Cloud provider or hardware configuration:
  • OS (e.g: cat /etc/os-release):
  • Kernel (e.g. uname -a):
  • Install tools:
  • Others:
@andrewsykim andrewsykim added the kind/bug Categorizes issue or PR as related to a bug. label Aug 22, 2024
@andrewsykim
Copy link
Member Author

The immediate fix we could do is bump the kuberay dependency in Kueue to a version that includes all new fields, but this doesn't seem like a scalable approach. The defaulting behavior seems a bit unusual to me, maybe it's something specific to controller-runtime webhooks?

@alculquicondor
Copy link
Contributor

In v0.8 we fixed the job reconcilers to use Patch instead of Update to avoid this problem #2501

But we didn't check the webhooks, so that's probably the problem.

@alculquicondor
Copy link
Contributor

@alculquicondor
Copy link
Contributor

Looking at GKE audit logs, the only API updates come from the ray operator. Thus, the only possible culprit is the webhook.

Thinking about this further:
Yes, the webhook uses patch, which is good.
However, the patch is build from the difference between the Raw object (which will have the field) and the marshalled object (which doesn't have the field).

PatchResponseFromRaw(req.Object.Raw, marshalled)

So it's a bug in controller-runtime.

I'll try to fix it there, but while waiting for a release there, our only chance is to update the APIs in Kueue.

@alculquicondor
Copy link
Contributor

Fix is up kubernetes-sigs/controller-runtime#2931

@tenzen-y
Copy link
Member

/assign @alculquicondor

@andrewsykim
Copy link
Member Author

As a short-term fix should we update kuberay version to v1.2? cc @astefanutti

@mbobrovskyi
Copy link
Contributor

As a short-term fix should we update kuberay version to v1.2? cc @astefanutti

We already done it on #2960.

@andrewsykim
Copy link
Member Author

Ah I missed that, thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants