-
Notifications
You must be signed in to change notification settings - Fork 960
Limit Istio Sidecar Scope to reduce memory and make cluster more scalable #3052
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Limit Istio Sidecar Scope to reduce memory and make cluster more scalable #3052
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
* istio proxy version 1.24.3 Signed-off-by: Julius von Kohout <45896133+juliusvonkohout@users.noreply.github.com> * Update install.yaml Signed-off-by: Julius von Kohout <45896133+juliusvonkohout@users.noreply.github.com> --------- Signed-off-by: Julius von Kohout <45896133+juliusvonkohout@users.noreply.github.com> Signed-off-by: Steve Han <stevehan@roblox.com>
Signed-off-by: Steve Han <stevehan2001@gmail.com> Signed-off-by: Steve Han <stevehan@roblox.com>
58bbc19
to
c021c15
Compare
Thank you for the PR. /ok-to-test |
spec: | ||
egress: | ||
- hosts: | ||
- "./*" # use mTLS within the namespace |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does that block stuff that is not on the servicemesh by default ? E.g. talking from your jupyterlab to a deployment in the same namespace with istio-injection disabled. Or egress to a non-istio native kubernetes service in another namespaces or just the internet in general. Because that must be still allowed. Maybe we could just block inter kubeflow-profile namespace communication by default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes all traffic is still allowed, this rule just removes mTLS and egress rules from being enforced, which I don't think we have any anyways. The "./*" still enables mTLS in the same namespace. See the istio documentation for details. I would love to hear from a kubeflow istio person about this idea
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have many mtls rules for the namespace "kubeflow". Please check the destination rules in the manifests first. We should not prune the existing explicit ones.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These destination rules all point to other services in the "kubeflow" namespace, which this policy allows. If you want, we can also have this policy only apply to user namespaces (kubeflow-*) by using a kyverno policy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is "kubeflow", knative-serving, istio-system, oauth2-proxy etc whil the user namespaces are named arbitrarily. They could be thisismycoolusernamespace. Also kyverno is not yet part of the default platform and we need to work with the tools that are there by default.
"These destination rules all point to other services in the "kubeflow" namespace, which this policy allows" if that is true it might work. but please add a test in this PR here to verify the functionality (connection and mtls).
@tarekabouzeid do you mind testing this ? |
Pull Request Template for Kubeflow Manifests
✏️ Summary of Changes
Adding a new istio sidecar resource to limit the sidecar's egress visibility to unnecessary services.
We (Roblox) have been running Kubeflow in production for a long time, and we are noticing that the istio sidecar memory is almost 1GB now due to the amount of services in the cluster that has to be cached in each sidecar. This adds up to over 2 TB of memory in total. This change limits the caching of cluster services in each sidecar, thus helping the scalability of the cluster.
This change can save TBs of memory and spare our DNS services. But I want to ask the community to see if there are any istio-enabled egress communication from kubeflow pods that we haven't considered. As far as we know
Communications to Notebook and Pipeline backends go through the ingress gateway instead of directly inside the cluster, so that won't matter
Communications to kserve models go through cluster ingress gateway
All other CRD-based workloads don't need any egress communication
🐛 Related Issues
knative/serving#12917 We are facing this issue where each sidecar is pinging DNS to resolve the cluster ingressgateway ip, essentially DDOSing our DNS. Removing the ExternalName service for cluster ingress gateway from the sidecars would resolve this problem.
✅ Contributor Checklist
Slack message link: https://cloud-native.slack.com/archives/C073W572LA2/p1741893411623659