Skip to content

Conversation

jholt96
Copy link
Contributor

@jholt96 jholt96 commented Sep 6, 2025

Pull Request Template for Kubeflow Manifests

✏️ Summary of Changes

This change fixes artifact retrieval in Kubeflow pipelines when the default sidecar is enabled. With the first default sidecar configuration, it blocks the ml-pipeline-ui pod from being able to proxy artifact requests from users that go to the ml-pipeline-ui-artifact pods in profile created namespaces. This is because in kubeflow 1.9+, the authorizationpolicy created in the namespace limits requests based on the principle and the SA of the ml-pipeline-ui pod. Without this change, users get “RBAC: Access Denied”

📦 Dependencies

None

🐛 Related Issues

Link any issues that are resolved or affected by this PR.

✅ Contributor Checklist

  • I have tested these changes with kustomize. See Installation Prerequisites.
  • All commits are signed-off to satisfy the DCO check.
  • I have considered adding my company to the adopters page to support Kubeflow and help the community, since I expect help from the community for my issue (see 1. and 2.).

You can join the CNCF Slack and access our meetings at the Kubeflow Community website. Our channel on the CNCF Slack is here #kubeflow-platform.

@jholt96 jholt96 force-pushed the master branch 4 times, most recently from f91b089 to 581e81c Compare September 6, 2025 20:54
@droctothorpe
Copy link

Thanks for this contribution, @jholt96!

/approve
/lgtm

CC @juliusvonkohout

@juliusvonkohout
Copy link
Member

juliusvonkohout commented Sep 9, 2025

@jholt96 ml-pipeline-ui-artifact has been removed and you now have proper zero overhead namespaces. It is already merged in Kubeflow/pipelines and soon the default also in Kubeflow/manifests. https://medium.com/@hpotpose26/kubeflow-pipelines-embraces-seaweedfs-9a7e022d5571. See

- name: Install Pipelines with SeaweedFS
. What we actually need is to extend our tests to also fetch artifacts in
python3 tests/pipeline_v2_test.py run_pipeline "${TOKEN}" "${KF_PROFILE}"
. Can you adjust the python file to also fetch artifacts from the run? CC @akagami-harsh for helping out in a separate PR against the master branch of kubeflow/manifests and ping me. Just get main.logs or so and print it in the terminal. @droctothorpe we need a KFP release from the KFP master branch with seaweedfs. Then we synchronize and it is done. It is a large change so some small friction is expected.

@droctothorpe
Copy link

@juliusvonkohout in the seaweed implementation, how does the FE access artifacts in external object stores that only end user namespaces have access to?

@jholt96
Copy link
Contributor Author

jholt96 commented Sep 9, 2025

@juliusvonkohout Happy to help with the test, and the zero overhead namespaces sound awesome.

  1. Does that mean the ml-pipeline-ui pods no longer initiate any egress requests to any pods in user namespaces? If it still does then this would be necessary to ensure authpolicies work properly.

  2. For those that have not upgraded and are even on the latest, I think this change is still good to commit to this repo as this issue will exist until they migrate to seaweedfs. I can move it to an overlay directory if that makes more sense to you.

IMO it is worth it to at least document as this issue is hard to diagnose if people are unaware of the sidecar object implementation and what that means for istio mtls.

@juliusvonkohout
Copy link
Member

@juliusvonkohout Happy to help with the test, and the zero overhead namespaces sound awesome.

1. Does that mean the ml-pipeline-ui pods no longer initiate any egress requests to any pods in user namespaces? If it still does then this would be necessary to ensure authpolicies work properly.

No, they do not need access anymore and with 500 users/namespaces y on your Kubeflow cluster you can save terabytes of memory for istio sidecars and 1000 (500x2) pods if you use the master branch with seaweedfs

2. For those that have not upgraded and are even on the latest, I think this change is still good to commit to this repo as this issue will exist until they migrate to seaweedfs. I can move it to an overlay directory if that makes more sense to you.

IMO it is worth it to at least document as this issue is hard to diagnose if people are unaware of the sidecar object implementation and what that means for istio mtls.

We can make seaweedfs the default here as i did with one commit here.

@juliusvonkohout
Copy link
Member

juliusvonkohout commented Sep 9, 2025

@juliusvonkohout in the seaweed implementation, how does the FE access artifacts in external object stores that only end user namespaces have access to?

I do not know what you mean by FE, but i will try to answer the general question.

The s3 storage, in the Kubeflow namespace, even if it just acts as gateway to AWS/GCP should handle it. you can configure it as gateway. If there is storage only accessible from the users namespace but is somehow considered Frankenstein Kubeflow storage, that breaks the architecture. The artifact-proxy has many CVEs and is very bad architecture in general. Relying on it would be massive technical debt and graduation blocker. We should have one Cluster administrator defined S3 storage for Kubeflow where stuff is stored. That can be seaweedfs directly or in gateway mode to GCP/AWS/AZURE/MINIO etc. but it it one place where also the pipeline run caching etc relies on. Where administrators set lifcyclepolicies etc. for users. You can still pull input data for your KF pipelines from anywhere else but the derivatives computed in KFP should either be stored and managed within Kubeflows S3 endpoint with the administrators policies or considered out of scope for Kubeflow.
Per namespace separate storage systems cannot be properly/reliably managed and accessed by Kubeflow. That would be end-user territory. Namespaces and external buckets can be deleted, secrets and configmaps in user namespaces can be modified etc. The artifact-proxy with its many CVEs allows a lot of exploits, cost resources, bloats up the service mesh and can just be stopped by the user. Also network connection to the external user-provided bucket somewhere in the internet is not guaranteed. That is nothing we can rely on, nothing we can enforce policies for, support caching or can guarantee security for. For such external stuff we can only show the URL to the bucket/artifact in the UI, everything else would be best effort and unreliable.

@jholt96
Copy link
Contributor Author

jholt96 commented Sep 9, 2025

@juliusvonkohout Happy to help with the test, and the zero overhead namespaces sound awesome.

1. Does that mean the ml-pipeline-ui pods no longer initiate any egress requests to any pods in user namespaces? If it still does then this would be necessary to ensure authpolicies work properly.

No, they do not need access anymore and with 500 users/namespaces y on your Kubeflow cluster you can save terabytes of memory for istio sidecars and 1000 (500x2) pods if you use the master branch with seaweedfs

2. For those that have not upgraded and are even on the latest, I think this change is still good to commit to this repo as this issue will exist until they migrate to seaweedfs. I can move it to an overlay directory if that makes more sense to you.

IMO it is worth it to at least document as this issue is hard to diagnose if people are unaware of the sidecar object implementation and what that means for istio mtls.

We can make seaweedfs the default here as i did with one commit here.

That is a great change and I am excited for the CVE remediation alone lol. I still think that the sidecar configuration needs to be documented in the istio README as it essentially hides the fact that it breaks MTLS for any use case that is not covered in the egress rules. We had to modify ours not only for the ml-pipeline-ui-artifact but for some other workloads where the sidecar configuration was breaking MTLS rules in our authorization policies. I added a commit example for the documentation.

I can create a new PR that updates the test, and then added that documentation if that sounds good to you

@juliusvonkohout
Copy link
Member

@juliusvonkohout Happy to help with the test, and the zero overhead namespaces sound awesome.

1. Does that mean the ml-pipeline-ui pods no longer initiate any egress requests to any pods in user namespaces? If it still does then this would be necessary to ensure authpolicies work properly.

No, they do not need access anymore and with 500 users/namespaces y on your Kubeflow cluster you can save terabytes of memory for istio sidecars and 1000 (500x2) pods if you use the master branch with seaweedfs

2. For those that have not upgraded and are even on the latest, I think this change is still good to commit to this repo as this issue will exist until they migrate to seaweedfs. I can move it to an overlay directory if that makes more sense to you.

IMO it is worth it to at least document as this issue is hard to diagnose if people are unaware of the sidecar object implementation and what that means for istio mtls.

We can make seaweedfs the default here as i did with one commit here.

That is a great change and I am excited for the CVE remediation alone lol. I still think that the sidecar configuration needs to be documented in the istio README as it essentially hides the fact that it breaks MTLS for any use case that is not covered in the egress rules. We had to modify ours not only for the ml-pipeline-ui-artifact but for some other workloads where the sidecar configuration was breaking MTLS rules in our authorization policies. I added a commit example for the documentation.

I can create a new PR that updates the test, and then added that documentation if that sounds good to you

Yes please you can get help from @akagami-harsh on slack for enabling seaweedfs by default right now as in this PR. I think we have to move one networkpolicy. @akagami-harsh you can also fork this one here and create a separate pr to use seaweedfs by default right now.

@juliusvonkohout
Copy link
Member

/ok-to-test

@droctothorpe
Copy link

droctothorpe commented Sep 9, 2025

@juliusvonkohout, our end users have specific buckets tied to their namespaces that only they have access to. I think this is a fairly common model. Isolating end users to namespace-specific S3 buckets improves multi-tenant isolation and security and ensures that sensitive end-user data is not stored in a shared place. It also grants end users the ability to access and life-cycle manage their buckets directly. This is precisely why the artifact proxy was implemented. If the seaweed implementation breaks that functionality, I suspect we have a significant regression on our hands, and it's likely that other users will be impacted as well.

I understand the reluctance to use the FE (frontend) server for the proxy. I share it. What if we made the artifact proxy deployment in end user namespaces conditional so that users like you who don't need it can disable it? Medium-term, we can work on implementing stripped down proxy in Golang that can also be disabled.

Did the seaweedfs PR strip the artifact proxy handlers out of the node server?

CC @zazulam

@juliusvonkohout
Copy link
Member

juliusvonkohout commented Sep 9, 2025

@juliusvonkohout, our end users have specific buckets tied to their namespaces that only they have access to. I think this is a fairly common model. Isolating end users to namespace-specific S3 buckets improves multi-tenant isolation and security and ensures that sensitive end-user data is not stored in a shared place. It also grants end users the ability to access and life-cycle manage their buckets directly. This is precisely why the artifact proxy was implemented. If the seaweed implementation breaks that functionality, I suspect we have a significant regression on our hands, and it's likely that other users will be impacted as well.

I understand the reluctance to use the FE (frontend) server for the proxy. I share it. What if we made the artifact proxy deployment in end user namespaces conditional so that users like you who don't need it can disable it? Medium-term, we can work on implementing stripped down proxy in Golang that can also be disabled.

Did the seaweedfs PR strip the artifact proxy handlers out of the node server?

CC @zazulam

Seaweedfs should also work with the artifact proxy, but we should definitely make it opt-in instead of opt-out to have security by default. You can reenable the not covered by tests artifact proxy if needed in your cluster right now by adjusting sync.py and the ml-pipeline-ui deployment. By default also the old deployments should continue to be there. "Isolating end users to namespace-specific S3 buckets improves multi-tenant isolation and security and ensures that sensitive end-user data is not stored in a shared place. " Is solved by seaweedfs now, but yes the lifecycle of artifacts created by KFP is then governed by the administrator. For me that makes more sense. It is fine if users have their own buckets, but then they should push long-term needed artifacts explicitly to per namespace buckets and not have them unreliably half-managed by Kubeflow. Also as a side note, all of this should only affect the UI, not actual pipeline runs. If this is not true anymore please point me to a piece of code where the artifact proxy is used for more than UI stuff. So as of now we are only talking about displaying artifacts from User-Controllers outside of Kubeflow buckets differently in the UI.

@droctothorpe
Copy link

You're correct in that it only impacts the UI.

Seaweedfs should also work with the artifact proxy

Can you elaborate on this pattern? What would this look like, practically speaking? Or do you just mean it would work alongside the artifact proxy? Meaning some UI endpoints can path through Seaweedfs, while others can path through the artifact proxy.

@juliusvonkohout
Copy link
Member

juliusvonkohout commented Sep 10, 2025

You're correct in that it only impacts the UI.

Seaweedfs should also work with the artifact proxy

Can you elaborate on this pattern? What would this look like, practically speaking? Or do you just mean it would work alongside the artifact proxy? Meaning some UI endpoints can path through Seaweedfs, while others can path through the artifact proxy.

It is just the same as before with minio and the artifact proxy, so an ugly mess. It goes from the user to the UI to the users namespace, to the artifact proxy and then with the secrets in the users namespace to the bucket, so either back to the kubeflow namespace to the S3 storage (minio/seaweedfs) or a bucket outside of the kubeflow space. And then back to the UI in the kubeflow namespace. So that violating architecture plus the CVEs you do not want to have in an enterprise environment. Without the proxy it is just UI to S3
Just displaying the artifact URL in the UI (status quo) if it is an external and user controlled bucket sounds like the better and safer option to me. Also external buckets are not pure Kubernetes kubeflow that we can properly test in our CI/CD. That is customisation on the consuming companies end.

@juliusvonkohout
Copy link
Member

Please rebase after #3240 is merged

@droctothorpe
Copy link

I don't think it makes sense to use the artifact proxy alongside MinIO. The proxy is useful specifically when you're not relying on in-cluster storage. I think we can decouple MinIO support from artifact proxy support. And I think we can implement a stripped down proxy that doesn't rely on the frontend image.

@droctothorpe
Copy link

I don't understand how S3 requests are succeeding at all without an artifact proxy given this handler, which attempts to route requests through the artifact proxy. Have you had a chance to manually validate S3 (or GCS or ABS for that matter) access with the artifact proxy disabled?

…y and add documentation on the sidecar object configuration

Signed-off-by: Josh Holt <jholt96@live.com>
Signed-off-by: Josh Holt <jholt96@live.com>
@droctothorpe
Copy link

Moving discussion to here: kubeflow/pipelines#12239

@juliusvonkohout
Copy link
Member

I don't think it makes sense to use the artifact proxy alongside MinIO. The proxy is useful specifically when you're not relying on in-cluster storage. I think we can decouple MinIO support from artifact proxy support. And I think we can implement a stripped down proxy that doesn't rely on the frontend image.

I could imagine companies using both and heterogenous use cases per department.

Signed-off-by: Julius von Kohout <45896133+juliusvonkohout@users.noreply.github.com>
Signed-off-by: Julius von Kohout <45896133+juliusvonkohout@users.noreply.github.com>
@juliusvonkohout
Copy link
Member

/ok-to-test

Signed-off-by: Julius von Kohout <45896133+juliusvonkohout@users.noreply.github.com>
@google-oss-prow google-oss-prow bot added size/M and removed size/S labels Sep 12, 2025
juliusvonkohout and others added 2 commits September 12, 2025 11:56
Signed-off-by: Julius von Kohout <45896133+juliusvonkohout@users.noreply.github.com>
Signed-off-by: Josh Holt <jholt96@live.com>
@juliusvonkohout
Copy link
Member

@jholt96 @akagami-harsh can you fix the tests? Did you rebase to the latest master?

@jholt96
Copy link
Contributor Author

jholt96 commented Sep 12, 2025

@jholt96 @akagami-harsh can you fix the tests? Did you rebase to the latest master?

It has been rebased to latest master.

@jholt96
Copy link
Contributor Author

jholt96 commented Sep 12, 2025

Ah there was an issue in camelCase should be good now @juliusvonkohout

@jholt96
Copy link
Contributor Author

jholt96 commented Sep 12, 2025

The apiVersion change in the authorizationpolicy for kserve was also broken. Not sure why the dex job is failing but the kserve one has been fixed

@juliusvonkohout
Copy link
Member

/lgtm
/approve

@google-oss-prow
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: droctothorpe, juliusvonkohout

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@juliusvonkohout
Copy link
Member

Thank you for the PR.

@google-oss-prow google-oss-prow bot merged commit 2368901 into kubeflow:master Sep 12, 2025
30 of 31 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants