Skip to content

[backend] Cannot get MLMD objects from Metadata store. Cannot find context (Version 1.9.0-rc.2) #2800

@Jithsaavvy

Description

@Jithsaavvy

Validation Checklist

Version

master

Describe your issue

Environment

  • Kubeflow: v1.9.0-rc.2
  • Kubernetes: v1.27.4
  • Platform: On-premise Kubernetes cluster
  • Kubeflow Pipelines (KFP) - v2.2.0
  • KFP SDK - v2.8.0
  • OS: Ubuntu 22.04
  • Deployment:
    • Using Kubeflow Manifests (without any specific distribution) from the master branch.
    • v1.9.0-rc.2 version was deployed (manually checked with every deployed component versions using release notes).

Description

Kubeflow was upgraded from v1.8.1 stable to v1.9.0-rc.2 (since no stable release is available yet for v1.9 ) to use the latest KFP v2.2.0, as a clean redeployment. When attempted to run a pipeline via the UI, it resulted in the following error:

Cannot get MLMD objects from Metadata store. Cannot find context with {"typeName":"system.PipelineRun" "contextName":"cc1bbc51-426f-4192-843a-bf4b94535a5b"}: Cannot find specified context

The same pipeline executed successfully without any issues or errors in the previous stable Kubeflow v1.8.1. As a sanity check, a sample pipeline from the documentation and an existing tutorial pipeline available within the KFP UI were also attempted to run, both resulting in the above error.

Upon inspection of the embedded MySQL pod, the pipeline context record was created in the mlpipeline database as the following:

mlpipeline table

mysql> USE mlpipeline;
mysql> SELECT uuid, name, status from pipelines;
| uuid                                 | name                                | status |
|--------------------------------------|-------------------------------------|--------|
| 645a4823-8e01-432b-b6b1-75776d14c805 | [Tutorial] DSL - Control structures | READY  |

run_details table

mysql> SELECT uuid, displayname, pipelinecontextid, pipelineid, conditions from run_details;

| uuid                                 | displayname                                        | pipelinecontextid | pipelineid                           | conditions |
|--------------------------------------|----------------------------------------------------|-------------------|--------------------------------------|------------|
| cc1bbc51-426f-4192-843a-bf4b94535a5b | Run of [Tutorial] DSL - Control structures (be550) | 0                 | 645a4823-8e01-432b-b6b1-75776d14c805 | Failed     |

However, the execution-run context for the same pipeline was not created or referenced in the metadb database. Analysis of the pods from the kubeflow namespace revealed that the ml-pipeline-api-server container within the ml-pipeline pod uses the mlpipeline database as backend storage for the pipeline component and the metadata-controller pod uses the metadb database as backend storage for the MLMD store. It appears that metadb cannot find or access the pipeline context record from the mlpipeline db or something similar. The connection to the MySQL-db pod is strong and the respective pvc is mounted, available and accessible.

Note: The above description applies to any pipeline.

Expected Behavior

Pipeline run should succeed without any issues.

Current Behavior

When a pipeline run is triggered from the UI, a system-dag-driver pod is created in the KF user namespace and runs to completion successfully. After that, the KFP execution pod is created with respect to the pipeline components and fails immediately, resulting in the above error.

Steps to reproduce the issue

  1. Install Kubeflow v1.9.0-rc.2 using Kubeflow Manifests.
  2. Copy the pipeline code or use the already existing tutorial pipeline from the UI and create a run from it.

Additional Context

v1.9.0-rc.2's release notes states that it supports Kubernetes v1.27 - 1.29. But, the README from this particular RC's release tag states that it targets Kubernetes v1.29+ which is a little confusing. My questions are:

  1. What are the supported Kubernetes versions for KF v1.9?
  2. Is the above issue a known bug in this RC version which will be patched in v1.9 stable release?
  3. Is anyone else impacted by this issue or are there any solutions available?

Related Issues

  1. Cannot get MLMD objects from Metadata store when running v2 pipeline #8733
  2. Error running pipelines with pod labels or annotation in pipeline steps added using kfp-kubernetes #10868

Put here any screenshots or videos (optional)

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions