Skip to content

Conversation

madmecodes
Copy link
Contributor

@madmecodes madmecodes commented Aug 5, 2025

✏️ Summary of Changes

Updated the old PR #3052 (Istio sidecar memory optimization) to work with current manifests structure:

Changes:

  • Added sidecar-prune-egress.yaml to common/istio/istio-install/overlays/insecure/
  • Updated kustomization.yaml to include the new sidecar config
  • Implements global egress restriction (./, kubeflow/, istio-system/*)
  • Adds notebook-controller exception for full cluster access

Benefits: (as discussed in #3052)

  • Reduces sidecar memory from ~1GB to much less by limiting service discovery scope
  • Saves TBs of memory in large clusters (proven in Roblox production)
  • Uses modern overlay pattern instead of modifying base

📦 Dependencies

#3052

✅ Contributor Checklist

  • I have tested these changes with kustomize. See Installation Prerequisites.
  • All commits are signed-off to satisfy the DCO check.
  • I have considered adding my company to the adopters page to support Kubeflow and help the community, since I expect help from the community for my issue (see 1. and 2.).

You can join the CNCF Slack and access our meetings at the Kubeflow Community website. Our channel on the CNCF Slack is here #kubeflow-platform.

@madmecodes
Copy link
Contributor Author

/retest

@madmecodes
Copy link
Contributor Author

@juliusvonkohout

@madmecodes
Copy link
Contributor Author

/retest

@madmecodes
Copy link
Contributor Author

/retest

@juliusvonkohout
Copy link
Member

@kunal-511 can you help investigating the test failure ?

@juliusvonkohout
Copy link
Member

juliusvonkohout commented Aug 19, 2025

@kunal-511 @madmecodes this is from the gha logs and i see CrashLoopBackOff so probably something you need to debug locally. Also it should just wair for all pods in the users namespace, also the master, not just the worker.

Name:             pytorch-simple-master-0
Namespace:        kubeflow-user-example-com
Priority:         0
Service Account:  default
Node:             kind-worker/172.18.0.2
Start Time:       Tue, 19 Aug 2025 15:11:33 +0000
Labels:           sidecar.istio.io/inject=false
                  training.kubeflow.org/job-name=pytorch-simple
                  training.kubeflow.org/job-role=master
                  training.kubeflow.org/operator-name=pytorchjob-controller
                  training.kubeflow.org/replica-index=0
                  training.kubeflow.org/replica-type=master
Annotations:      <none>
Status:           Running
IP:               10.244.2.36
IPs:
  IP:           10.244.2.36
Controlled By:  PyTorchJob/pytorch-simple
Containers:
  pytorch:
    Container ID:  containerd://a92eb6fa1e726dfca8f9eba8601cdc2c68c7d2572926a5fae04290847ce24557
    Image:         docker.io/kubeflowkatib/pytorch-mnist:v1beta1-45c5727
    Image ID:      docker.io/kubeflowkatib/pytorch-mnist@sha256:5164399299fc6ceebcdfa0df5b303a2d63c05776188f55a336c5d3514a4e3227
    Port:          23456/TCP
    Host Port:     0/TCP
    Command:
      python3
      /opt/pytorch-mnist/mnist.py
      --epochs=1
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    136
      Started:      Tue, 19 Aug 2025 15:15:00 +0000
      Finished:     Tue, 19 Aug 2025 15:15:01 +0000
    Ready:          False
    Restart Count:  5
    Environment:
      PYTHONUNBUFFERED:    1
      MASTER_PORT:         23456
      PET_MASTER_PORT:     23456
      MASTER_ADDR:         pytorch-simple-master-0
      PET_MASTER_ADDR:     pytorch-simple-master-0
      WORLD_SIZE:          2
      RANK:                0
      PET_NODE_RANK:       0
      PET_NPROC_PER_NODE:  auto
      PET_NNODES:          2
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-nsmjg (ro)
Conditions:
  Type                        Status
  PodReadyToStartContainers   True 
  Initialized                 True 
  Ready                       False 
  ContainersReady             False 
  PodScheduled                True 
Volumes:
  kube-api-access-nsmjg:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    Optional:                false
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason     Age                  From               Message
  ----     ------     ----                 ----               -------
  Normal   Scheduled  4m5s                 default-scheduler  Successfully assigned kubeflow-user-example-com/pytorch-simple-master-0 to kind-worker
  Normal   Pulled     3m38s                kubelet            Successfully pulled image "docker.io/kubeflowkatib/pytorch-mnist:v1beta1-45c5727" in 26.133s (26.133s including waiting). Image size: 1285228786 bytes.
  Normal   Pulled     3m33s                kubelet            Successfully pulled image "docker.io/kubeflowkatib/pytorch-mnist:v1beta1-45c5727" in 588ms (588ms including waiting). Image size: 1285228786 bytes.
  Normal   Pulled     3m20s                kubelet            Successfully pulled image "docker.io/kubeflowkatib/pytorch-mnist:v1beta1-45c5727" in 561ms (561ms including waiting). Image size: 1285228786 bytes.
  Normal   Pulled     2m53s                kubelet            Successfully pulled image "docker.io/kubeflowkatib/pytorch-mnist:v1beta1-45c5727" in 584ms (584ms including waiting). Image size: 1285228786 bytes.
  Normal   Pulled     2m4s                 kubelet            Successfully pulled image "docker.io/kubeflowkatib/pytorch-mnist:v1beta1-45c5727" in 559ms (559ms including waiting). Image size: 1285228786 bytes.
  Normal   Pulling    37s (x6 over 4m4s)   kubelet            Pulling image "docker.io/kubeflowkatib/pytorch-mnist:v1beta1-45c5727"
  Normal   Created    37s (x6 over 3m38s)  kubelet            Created container: pytorch
  Normal   Started    37s (x6 over 3m38s)  kubelet            Started container pytorch
  Normal   Pulled     37s                  kubelet            Successfully pulled image "docker.io/kubeflowkatib/pytorch-mnist:v1beta1-45c5727" in 575ms (575ms including waiting). Image size: 1285228786 bytes.
  Warning  BackOff    8s (x16 over 3m32s)  kubelet            Back-off restarting failed container pytorch in pod pytorch-simple-master-0_kubeflow-user-example-com(0efd83d5-7b47-4d51-aede-a454bb4806cd)


Name:             pytorch-simple-worker-0
Namespace:        kubeflow-user-example-com
Priority:         0
Service Account:  default
Node:             kind-worker2/172.18.0.3
Start Time:       Tue, 19 Aug 2025 15:11:32 +0000
Labels:           sidecar.istio.io/inject=false
                  training.kubeflow.org/job-name=pytorch-simple
                  training.kubeflow.org/operator-name=pytorchjob-controller
                  training.kubeflow.org/replica-index=0
                  training.kubeflow.org/replica-type=worker
Annotations:      <none>
Status:           Pending
IP:               10.244.1.44
IPs:
  IP:           10.244.1.44
Controlled By:  PyTorchJob/pytorch-simple
Init Containers:
  init-pytorch:
    Container ID:  containerd://3b7de682c87ff93158dd55d822a034f6b3595c3006cd172f38ba5d264e35c09f
    Image:         alpine:3.10
    Image ID:      docker.io/library/alpine@sha256:451eee8bedcb2f029756dc3e9d73bab0e7943c1ac55cff3a4861c52a0fdd3e98
    Port:          <none>
    Host Port:     <none>
    Command:
      sh
      -c
      err=1;for i in $(seq 100); do if nslookup pytorch-simple-master-0; then err=0 && break; fi;echo waiting for master; sleep 2; done; exit $err
    State:          Running
      Started:      Tue, 19 Aug 2025 15:11:35 +0000
    Ready:          False
    Restart Count:  0
    Limits:
      cpu:     100m
      memory:  20Mi
    Requests:
      cpu:        50m
      memory:     10Mi
    Environment:  <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-sw55l (ro)
Containers:
  pytorch:
    Container ID:  
    Image:         docker.io/kubeflowkatib/pytorch-mnist:v1beta1-45c5727
    Image ID:      
    Port:          23456/TCP
    Host Port:     0/TCP
    Command:
      python3
      /opt/pytorch-mnist/mnist.py
      --epochs=1
    State:          Waiting
      Reason:       PodInitializing
    Ready:          False
    Restart Count:  0
    Environment:
      PYTHONUNBUFFERED:    1
      MASTER_PORT:         23456
      PET_MASTER_PORT:     23456
      MASTER_ADDR:         pytorch-simple-master-0
      PET_MASTER_ADDR:     pytorch-simple-master-0
      WORLD_SIZE:          2
      RANK:                1
      PET_NODE_RANK:       1
      PET_NPROC_PER_NODE:  auto
      PET_NNODES:          2
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-sw55l (ro)
Conditions:
  Type                        Status
  PodReadyToStartContainers   True 
  Initialized                 False 
  Ready                       False 
  ContainersReady             False 
  PodScheduled                True 
Volumes:
  kube-api-access-sw55l:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    Optional:                false
    DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type    Reason     Age   From               Message
  ----    ------     ----  ----               -------
  Normal  Scheduled  4m5s  default-scheduler  Successfully assigned kubeflow-user-example-com/pytorch-simple-worker-0 to kind-worker2
  Normal  Pulling    4m4s  kubelet            Pulling image "alpine:3.10"
  Normal  Pulled     4m2s  kubelet            Successfully pulled image "alpine:3.10" in 2.001s (2.001s including waiting). Image size: 2801976 bytes.
  Normal  Created    4m2s  kubelet            Created container: init-pytorch
  Normal  Started    4m2s  kubelet            Started container init-pytorch

@juliusvonkohout
Copy link
Member

juliusvonkohout commented Aug 19, 2025

Istio injection is disabled for these pods, so that should be fine. You can also fork and test.

kubeflow-user-example-com   4m18s       Normal    Completed                      job/grid-7vbf6d55                                                Job completed
kubeflow-user-example-com   4m18s       Normal    JobSucceeded                   trial/grid-7vbf6d55                                              Job grid-7vbf6d55 has succeeded
kubeflow-user-example-com   4m18s       Normal    JobDeleted                     trial/grid-7vbf6d55                                              Job grid-7vbf6d55 has been deleted
kubeflow-user-example-com   4m1s        Normal    SuccessfulCreateService        pytorchjob/pytorch-simple                                        Created service: pytorch-simple-worker-0
kubeflow-user-example-com   4m1s        Normal    SuccessfulCreatePod            pytorchjob/pytorch-simple                                        Created pod: pytorch-simple-master-0
kubeflow-user-example-com   4m1s        Normal    Scheduled                      pod/pytorch-simple-worker-0                                      Successfully assigned kubeflow-user-example-com/pytorch-simple-worker-0 to kind-worker2
kubeflow-user-example-com   4m1s        Normal    SuccessfulCreatePod            pytorchjob/pytorch-simple                                        Created pod: pytorch-simple-worker-0
kubeflow-user-example-com   4m1s        Normal    JobDeleted                     trial/grid-t7sq4qtq                                              Job grid-t7sq4qtq has been deleted
kubeflow-user-example-com   4m1s        Normal    JobSucceeded                   trial/grid-t7sq4qtq                                              Job grid-t7sq4qtq has succeeded
kubeflow-user-example-com   4m1s        Normal    SuccessfulCreateService        pytorchjob/pytorch-simple                                        Created service: pytorch-simple-master-0
kubeflow-user-example-com   4m1s        Normal    Scheduled                      pod/pytorch-simple-master-0                                      Successfully assigned kubeflow-user-example-com/pytorch-simple-master-0 to kind-worker
kubeflow-user-example-com   4m1s        Normal    Completed                      job/grid-t7sq4qtq                                                Job completed
kubeflow-user-example-com   4m          Normal    Pulling                        pod/pytorch-simple-worker-0                                      Pulling image "alpine:3.10"
kubeflow-user-example-com   33s         Normal    Pulling                        pod/pytorch-simple-master-0                                      Pulling image "docker.io/kubeflowkatib/pytorch-mnist:v1beta1-45c5727"
kubeflow-user-example-com   3m58s       Normal    Started                        pod/pytorch-simple-worker-0                                      Started container init-pytorch
kubeflow-user-example-com   3m58s       Normal    Created                        pod/pytorch-simple-worker-0                                      Created container: init-pytorch
kubeflow-user-example-com   3m58s       Normal    Pulled                         pod/pytorch-simple-worker-0                                      Successfully pulled image "alpine:3.10" in 2.001s (2.001s including waiting). Image size: 2801976 bytes.
kubeflow-user-example-com   3m42s       Warning   Unhealthy                      pod/grid-grid-8c9df6b6-l9zsg                                     Readiness probe failed: timeout: failed to connect service "10.244.2.33:6789" within 1s: context deadline exceeded
kubeflow-user-example-com   33s         Normal    Started                        pod/pytorch-simple-master-0                                      Started container pytorch
kubeflow-user-example-com   3m34s       Normal    Pulled                         pod/pytorch-simple-master-0                                      Successfully pulled image "docker.io/kubeflowkatib/pytorch-mnist:v1beta1-45c5727" in 26.133s (26.133s including waiting). Image size: 1285228786 bytes.
kubeflow-user-example-com   33s         Normal    Created                        pod/pytorch-simple-master-0                                      Created container: pytorch
kubeflow-user-example-com   31s         Normal    ExitedWithCode                 pytorchjob/pytorch-simple                                        Pod: kubeflow-user-example-com.pytorch-simple-master-0 exited with code 136
kubeflow-user-example-com   31s         Warning   Error                          pytorchjob/pytorch-simple                                        Error pod pytorch-simple-master-0 container pytorch exitCode: 136 terminated message:
kubeflow-user-example-com   3m29s       Normal    Pulled                         pod/pytorch-simple-master-0                                      Successfully pulled image "docker.io/kubeflowkatib/pytorch-mnist:v1beta1-45c5727" in 588ms (588ms including waiting). Image size: 1285228786 bytes.
kubeflow-user-example-com   4s          Warning   BackOff                        pod/pytorch-simple-master-0                                      Back-off restarting failed container pytorch in pod pytorch-simple-master-0_kubeflow-user-example-com(0efd83d5-7b47-4d51-aede-a454bb4806cd)
kubeflow-user-example-com   3m16s       Normal    Pulled                         pod/pytorch-simple-master-0                                      Successfully pulled image "docker.io/kubeflowkatib/pytorch-mnist:v1beta1-45c5727" in 561ms (561ms including waiting). Image size: 1285228786 bytes.
kubeflow-user-example-com   3m16s       Warning   CrashLoopBackOff               pytorchjob/pytorch-simple                                        Error pod pytorch-simple-master-0 container pytorch waiting message: back-off 10s restarting failed container=pytorch pod=pytorch-simple-master-0_kubeflow-user-example-com(0efd83d5-7b47-4d51-aede-a454bb4806cd)
kubeflow-user-example-com   3m3s        Warning   CrashLoopBackOff               pytorchjob/pytorch-simple                                        Error pod pytorch-simple-master-0 container pytorch waiting message: back-off 20s restarting failed container=pytorch pod=pytorch-simple-master-0_kubeflow-user-example-com(0efd83d5-7b47-4d51-aede-a454bb4806cd)

madmecodes and others added 8 commits August 20, 2025 16:25
…ore scalable

Signed-off-by: madmecodes <ayushguptadev1@gmail.com>
Signed-off-by: madmecodes <ayushguptadev1@gmail.com>
Signed-off-by: Julius von Kohout <45896133+juliusvonkohout@users.noreply.github.com>
Signed-off-by: Julius von Kohout <45896133+juliusvonkohout@users.noreply.github.com>
…ess.yaml

Signed-off-by: Julius von Kohout <45896133+juliusvonkohout@users.noreply.github.com>
Signed-off-by: Julius von Kohout <45896133+juliusvonkohout@users.noreply.github.com>
Signed-off-by: Julius von Kohout <45896133+juliusvonkohout@users.noreply.github.com>
Signed-off-by: Julius von Kohout <45896133+juliusvonkohout@users.noreply.github.com>
Signed-off-by: Julius von Kohout <45896133+juliusvonkohout@users.noreply.github.com>
Signed-off-by: Julius von Kohout <45896133+juliusvonkohout@users.noreply.github.com>
Signed-off-by: Julius von Kohout <45896133+juliusvonkohout@users.noreply.github.com>
Signed-off-by: Julius von Kohout <45896133+juliusvonkohout@users.noreply.github.com>
Signed-off-by: Julius von Kohout <45896133+juliusvonkohout@users.noreply.github.com>
Signed-off-by: Julius von Kohout <45896133+juliusvonkohout@users.noreply.github.com>
Signed-off-by: Julius von Kohout <45896133+juliusvonkohout@users.noreply.github.com>
Signed-off-by: Julius von Kohout <45896133+juliusvonkohout@users.noreply.github.com>
Signed-off-by: Julius von Kohout <45896133+juliusvonkohout@users.noreply.github.com>
Signed-off-by: Julius von Kohout <45896133+juliusvonkohout@users.noreply.github.com>
Signed-off-by: Julius von Kohout <45896133+juliusvonkohout@users.noreply.github.com>
Signed-off-by: Julius von Kohout <45896133+juliusvonkohout@users.noreply.github.com>
Signed-off-by: Julius von Kohout <45896133+juliusvonkohout@users.noreply.github.com>
Signed-off-by: Julius von Kohout <45896133+juliusvonkohout@users.noreply.github.com>
Signed-off-by: Julius von Kohout <45896133+juliusvonkohout@users.noreply.github.com>
Signed-off-by: Julius von Kohout <45896133+juliusvonkohout@users.noreply.github.com>
@juliusvonkohout
Copy link
Member

/lgtm
/approve

@google-oss-prow
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: juliusvonkohout

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@juliusvonkohout juliusvonkohout changed the title Limit Istio Sidecar Scope to reduce memory Limit Istio Sidecar Scope to reduce memory usage Aug 20, 2025
@google-oss-prow google-oss-prow bot merged commit 397fe88 into kubeflow:master Aug 20, 2025
32 checks passed
andyatmiami added a commit to andyatmiami/kubeflow-manifests that referenced this pull request Aug 22, 2025
related: kubeflow#3206

With the work to Limit Istio Sidecar Scope to reduce memory usage, the `kubeflow` namespace now needs to be present **prior** to the installation of Istio.

While this actual work was already merged, a minor doc update was missed in the `Install Individual Components` section of the `README` - which had the `kubeflow` `namespace` getting installed much later in the process.

Anyone following the doc would get an error about `kubeflow` `namespace` not existing while trying to install `Istio`.

This PR simply moves the `namespace` install section prior to Istio.

Signed-off-by: Andy Stoneberg <astonebe@redhat.com>
google-oss-prow bot pushed a commit that referenced this pull request Aug 22, 2025
* doc: fix ordering of individual install steps

related: #3206

With the work to Limit Istio Sidecar Scope to reduce memory usage, the `kubeflow` namespace now needs to be present **prior** to the installation of Istio.

While this actual work was already merged, a minor doc update was missed in the `Install Individual Components` section of the `README` - which had the `kubeflow` `namespace` getting installed much later in the process.

Anyone following the doc would get an error about `kubeflow` `namespace` not existing while trying to install `Istio`.

This PR simply moves the `namespace` install section prior to Istio.

Signed-off-by: Andy Stoneberg <astonebe@redhat.com>

* Update README.md

Signed-off-by: Julius von Kohout <45896133+juliusvonkohout@users.noreply.github.com>

---------

Signed-off-by: Andy Stoneberg <astonebe@redhat.com>
Signed-off-by: Julius von Kohout <45896133+juliusvonkohout@users.noreply.github.com>
Co-authored-by: Julius von Kohout <45896133+juliusvonkohout@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants