-
Notifications
You must be signed in to change notification settings - Fork 1k
Description
Validation Checklist
- Is this a Kubeflow issue?
- Are you posting in the right repository ?
- Did you follow the installation guide https://github.com/kubeflow/manifests?tab=readme-ov-file ?
- Is the issue report properly structured and detailed with version numbers?
- Is this for Kubeflow development ?
- Would you like to work on this issue?
- You can join the CNCF Slack and access our meetings at the Kubeflow Community website. Our channel on the CNCF Slack is here #kubeflow-platform.
Version
master
Describe your issue
Environment
- Kubeflow:
v1.9.0-rc.2 - Kubernetes:
v1.27.4 - Platform: On-premise Kubernetes cluster
- Kubeflow Pipelines (KFP) -
v2.2.0 - KFP SDK -
v2.8.0 - OS: Ubuntu 22.04
- Deployment:
- Using Kubeflow Manifests (without any specific distribution) from the
masterbranch. v1.9.0-rc.2version was deployed (manually checked with every deployed component versions using release notes).
- Using Kubeflow Manifests (without any specific distribution) from the
Description
Kubeflow was upgraded from v1.8.1 stable to v1.9.0-rc.2 (since no stable release is available yet for v1.9 ) to use the latest KFP v2.2.0, as a clean redeployment. When attempted to run a pipeline via the UI, it resulted in the following error:
Cannot get MLMD objects from Metadata store. Cannot find context with {"typeName":"system.PipelineRun" "contextName":"cc1bbc51-426f-4192-843a-bf4b94535a5b"}: Cannot find specified context
The same pipeline executed successfully without any issues or errors in the previous stable Kubeflow v1.8.1. As a sanity check, a sample pipeline from the documentation and an existing tutorial pipeline available within the KFP UI were also attempted to run, both resulting in the above error.
Upon inspection of the embedded MySQL pod, the pipeline context record was created in the mlpipeline database as the following:
mlpipeline table
mysql> USE mlpipeline;
mysql> SELECT uuid, name, status from pipelines;
| uuid | name | status |
|--------------------------------------|-------------------------------------|--------|
| 645a4823-8e01-432b-b6b1-75776d14c805 | [Tutorial] DSL - Control structures | READY |
run_details table
mysql> SELECT uuid, displayname, pipelinecontextid, pipelineid, conditions from run_details;
| uuid | displayname | pipelinecontextid | pipelineid | conditions |
|--------------------------------------|----------------------------------------------------|-------------------|--------------------------------------|------------|
| cc1bbc51-426f-4192-843a-bf4b94535a5b | Run of [Tutorial] DSL - Control structures (be550) | 0 | 645a4823-8e01-432b-b6b1-75776d14c805 | Failed |
However, the execution-run context for the same pipeline was not created or referenced in the metadb database. Analysis of the pods from the kubeflow namespace revealed that the ml-pipeline-api-server container within the ml-pipeline pod uses the mlpipeline database as backend storage for the pipeline component and the metadata-controller pod uses the metadb database as backend storage for the MLMD store. It appears that metadb cannot find or access the pipeline context record from the mlpipeline db or something similar. The connection to the MySQL-db pod is strong and the respective pvc is mounted, available and accessible.
Note: The above description applies to any pipeline.
Expected Behavior
Pipeline run should succeed without any issues.
Current Behavior
When a pipeline run is triggered from the UI, a system-dag-driver pod is created in the KF user namespace and runs to completion successfully. After that, the KFP execution pod is created with respect to the pipeline components and fails immediately, resulting in the above error.
Steps to reproduce the issue
- Install Kubeflow
v1.9.0-rc.2using Kubeflow Manifests. - Copy the pipeline code or use the already existing tutorial pipeline from the UI and create a run from it.
Additional Context
v1.9.0-rc.2's release notes states that it supports Kubernetes v1.27 - 1.29. But, the README from this particular RC's release tag states that it targets Kubernetes v1.29+ which is a little confusing. My questions are:
- What are the supported Kubernetes versions for KF v1.9?
- Is the above issue a known bug in this RC version which will be patched in v1.9 stable release?
- Is anyone else impacted by this issue or are there any solutions available?
Related Issues
- Cannot get MLMD objects from Metadata store when running v2 pipeline #8733
- Error running pipelines with pod labels or annotation in pipeline steps added using kfp-kubernetes #10868
Put here any screenshots or videos (optional)
No response