Adding out of the box support to TrainJob #2560

ram4444 · 2025-07-26T03:43:44Z

What this PR does / why we need it:

Adding out of the box support to TrainJob and providing an example

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #

Checklist:

Docs included if any changes are user facing

andreyvelich

Thanks for this effort @ram4444!

Please also update RBAC: https://github.com/kubeflow/katib/blob/master/manifests/v1beta1/components/controller/rbac.yaml
Primary pod labels a described here: https://www.kubeflow.org/docs/components/katib/user-guides/trial-template/#use-crds-with-trial-template: https://github.com/kubeflow/katib/blob/master/pkg/apis/controller/experiments/v1beta1/experiment_defaults.go#L117
Success and Failure conditions: https://github.com/kubeflow/katib/blob/master/pkg/apis/controller/experiments/v1beta1/experiment_defaults.go#L110

/assign @kubeflow/kubeflow-trainer-team @astefanutti @franciscojavierarceo @helenxie-bit

examples/v1beta1/kubeflow-training-operator/trainjob-pytorch.yaml

andreyvelich · 2025-07-26T15:53:15Z

examples/v1beta1/kubeflow-training-operator/trainjob-pytorch.yaml

Can you move this example to examples/v1beta1/kubeflow-trainer/trainjob-pytorch.yaml

andreyvelich · 2025-07-26T15:55:50Z

examples/v1beta1/kubeflow-training-operator/trainjob-pytorch.yaml

+    #    min: 1
+    #    max: 5
+  trialTemplate:
+    primaryContainerName: pytorch


This should be node, right ?

Not quite understand your question
This is copy from pytorchjob-mnist.yaml and I keep it the same

The container name for torch-distributed runtime is node: https://github.com/kubeflow/trainer/blob/master/manifests/base/runtimes/torch_distributed.yaml#L22

You can read about API description for trialParameters in this doc: https://www.kubeflow.org/docs/components/katib/user-guides/trial-template/#configure-trial-template-specification

I have done it. Please state in somewhere in the doc that this field should be referenced to the ClusterTrainingRuntime. Since I thought it is a custom defined name of the container by the user

andreyvelich · 2025-07-26T15:56:15Z

examples/v1beta1/kubeflow-training-operator/trainjob-pytorch.yaml

+      - name: arg
+        description: An additional argument for the training model
+        reference: arg


This can be removed

OK I will removed it.

Should I create another PR or put everything on top of these 2 commits?

BTW Do you have any docker image for pytorch-deepspeed_train_t5 so that I can create another example?

Should I create another PR or put everything on top of these 2 commits?

You can make the appropriate changes in this PR.

BTW Do you have any docker image for pytorch-deepspeed_train_t5 so that I can create another example?

We don't have docker image, since we create it using the Kubeflow SDK: https://github.com/kubeflow/trainer/blob/master/examples/deepspeed/text-summarization/T5-Fine-Tuning.ipynb

andreyvelich · 2025-07-26T16:00:18Z

cc @kramaranya @szaher

juliusvonkohout · 2025-07-30T07:18:38Z

/ok-to-test

once that is in i can merge kubeflow/manifests#3199

andreyvelich · 2025-08-04T22:15:42Z

@ram4444 Please can you update the controller code according to this: #2560 (review) ?

ram4444 · 2025-08-05T20:41:29Z

@ram4444 Please can you update the controller code according to this: #2560 (review) ?

Hi,

Is it simply adding entry to
pkg/apis/controller/experiments/v1beta1/constants.go

KubeflowJobKinds = map[string]bool{
	"TFJob":      true,
	"PyTorchJob": true,
	"XGBoostJob": true,
	"MPIJob":     true,
	"TrainJob":   true,
}

?

andreyvelich · 2025-08-05T22:13:09Z

@ram4444 Please can you update the controller code according to this: #2560 (review) ?

Hi,

Is it simply adding entry to pkg/apis/controller/experiments/v1beta1/constants.go
KubeflowJobKinds = map[string]bool{
	"TFJob":      true,
	"PyTorchJob": true,
	"XGBoostJob": true,
	"MPIJob":     true,
	"TrainJob":   true,
}
?

No, you have to update other places as I mentioned in this comment: #2560 (review)

ram4444 · 2025-08-06T16:50:16Z

@ram4444 Please can you update the controller code according to this: #2560 (review) ?

Hi,
Is it simply adding entry to pkg/apis/controller/experiments/v1beta1/constants.go
KubeflowJobKinds = map[string]bool{
	"TFJob":      true,
	"PyTorchJob": true,
	"XGBoostJob": true,
	"MPIJob":     true,
	"TrainJob":   true,
}
?
No, you have to update other places as I mentioned in this comment: #2560 (review)

Still not get a clear idea of it. Could you explain more?

andreyvelich · 2025-08-06T20:18:15Z

Still not get a clear idea of it. Could you explain more?

Could you read this doc which explains how CRDs within Katib Trial work: https://www.kubeflow.org/docs/components/katib/user-guides/trial-template/#use-crds-with-trial-template ?
We should update the default values for Success and Failure conditions and
PrimaryPodLabels which represent MASTER training pod.

ram4444 · 2025-08-06T20:52:18Z

I have go through the code and doc, but I am not understand what is going to be added/changed in the lines (110&117) specified.

Am I go to add an else condition to TrainJob (but I am not sure what is going to be the Default Fail/SuccessCondition/PrimaryPodLabels)?

func (e *Experiment) setDefaultTrialTemplate() {
	t := e.Spec.TrialTemplate

	// Set default values for Job and Kubeflow Training Job if TrialSpec is not nil
	if t != nil && t.TrialSource.TrialSpec != nil {
		jobKind := t.TrialSource.TrialSpec.GetKind()
		if  == consts.JobKindJob {
			if t.SuccessCondition == "" {
				t.SuccessCondition = DefaultJobSuccessCondition
			}
			if t.FailureCondition == "" {
				t.FailureCondition = DefaultJobFailureCondition
			}
		} else if _, ok := KubeflowJobKinds[jobKind]; ok {
			if t.SuccessCondition == "" {
				t.SuccessCondition = DefaultKubeflowJobSuccessCondition
			}
			if t.FailureCondition == "" {
				t.FailureCondition = DefaultKubeflowJobFailureCondition
			}
			// For Kubeflow Job also set default PrimaryPodLabels
			if len(t.PrimaryPodLabels) == 0 {
				t.PrimaryPodLabels = DefaultKubeflowJobPrimaryPodLabels
			}
		} else if jobKind == "TrainJob" {
            if t.SuccessCondition == "" {
                //t.SuccessCondition = DefaultKubeflowJobSuccessCondition
				// A different Default value for success condition
            }
            if t.FailureCondition == "" {
                //t.FailureCondition = DefaultKubeflowJobFailureCondition
				// A different Default value for failure condition
            }
            if t.PrimaryPodLabels == nil {
                //t.PrimaryPodLabels = DefaultKubeflowJobPrimaryPodLabels
				// A different Default value for PrimaryPodLabels
            }
        }
	}
	e.Spec.TrialTemplate = t
}

andreyvelich · 2025-08-07T00:56:38Z

DefaultKubeflowJobSuccessCondition

Here are the values that we should use for TrainJob:

	DefaultTrainJobSuccessCondition = "status.conditions.#(type==\"Complete\")#|#(status==\"True\")#"
	DefaultTrainJobFailureCondition = "status.conditions.#(type==\"Failed\")#|#(status==\"True\")#"
        DefaultTrainJobPrimaryPodLabels = map[string]string{"jobset.sigs.k8s.io/replicatedjob-name": "node"}

ram4444 · 2025-08-07T01:57:03Z

Please consider whether we should to put them in const.go to be consistent. (or I could proceed as mentioned)

andreyvelich · 2025-08-07T15:14:06Z

Please consider whether we should to put them in const.go to be consistent. (or I could proceed as mentioned)

Yes, please add them into constants.go

andreyvelich

@ram4444 Would you be able to verify that this integration works on your local Kind cluster?
Since we don't have E2Es for TrainJob, it would be nice to verify it.
cc @Electronic-Waste @kubeflow/kubeflow-trainer-team @astefanutti

andreyvelich · 2025-08-08T23:10:09Z

pkg/apis/controller/experiments/v1beta1/experiment_defaults.go

 				t.PrimaryPodLabels = DefaultKubeflowJobPrimaryPodLabels
 			}
-		}
+		} else if jobKind == "TrainJob" {


Could you add TrainJob to the KubeflowJobKinds list as well please ?

the function will turn to something like

func (e *Experiment) setDefaultTrialTemplate() { t := e.Spec.TrialTemplate // Set default values for Job and Kubeflow Training Job if TrialSpec is not nil if t != nil && t.TrialSource.TrialSpec != nil { jobKind := t.TrialSource.TrialSpec.GetKind() if jobKind == consts.JobKindJob { if t.SuccessCondition == "" { t.SuccessCondition = DefaultJobSuccessCondition } if t.FailureCondition == "" { t.FailureCondition = DefaultJobFailureCondition } } else if _, ok := KubeflowJobKinds[jobKind]; ok { if t.SuccessCondition == "" { if jobKind == "TrainJob" t.SuccessCondition = DefaultTrainJobSuccessCondition else t.SuccessCondition = DefaultKubeflowJobSuccessCondition } if t.FailureCondition == "" { if jobKind == "TrainJob" t.FailureCondition = DefaultTrainJobFailureCondition else t.FailureCondition = DefaultKubeflowJobFailureCondition } // For Kubeflow Job also set default PrimaryPodLabels if len(t.PrimaryPodLabels) == 0 { if jobKind == "TrainJob" t.PrimaryPodLabels = DefaultTrainJobPrimaryPodLabels else t.PrimaryPodLabels = DefaultKubeflowJobPrimaryPodLabels } } } e.Spec.TrialTemplate = t }

To test it in my local k8s
since the katib controller is installed by pulling image in kustomize,
please let me know any command to build the image from scratch

images: - name: ghcr.io/kubeflow/katib/katib-controller newName: ghcr.io/kubeflow/katib/katib-controller newTag: v0.18.0

You can build controller as follows:

docker build . -f cmd/katib-controller/v1beta1/Dockerfile -t <MY_IMAGE>

ram4444 · 2025-08-12T23:56:54Z

@ram4444 Would you be able to verify that this integration works on your local Kind cluster?

Since we don't have E2Es for TrainJob, it would be nice to verify it.

cc @Electronic-Waste @kubeflow/kubeflow-trainer-team @astefanutti

Hi,

I am sorry to inform that due to hardware issue (my 10-years-old homelab which is the only server capable to run the whole Kubeflow is down), I am not able to test it at this moment.😩

I could commit my latest code to my own repo first. Please let me know if it is ok to proceed.

Ram

andreyvelich · 2025-08-13T01:52:51Z

I am sorry to inform that due to hardware issue (my 10-years-old homelab which is the only server capable to run the whole Kubeflow is down), I am not able to test it at this moment.

Sure, no problem, I can try to deploy it from my machine. Please push your latest changes.

Btw, you don't need to deploy the entire Kubeflow Platform, you can just deploy Katib + Trainer control plane to verify it.

juliusvonkohout · 2025-08-13T10:26:21Z

@ram4444 Would you be able to verify that this integration works on your local Kind cluster?
Since we don't have E2Es for TrainJob, it would be nice to verify it.
cc @Electronic-Waste @kubeflow/kubeflow-trainer-team @astefanutti

Hi,

I am sorry to inform that due to hardware issue (my 10-years-old homelab which is the only server capable to run the whole Kubeflow is down), I am not able to test it at this moment.😩

I could commit my latest code to my own repo first. Please let me know if it is ok to proceed.

Ram

Yo can run it with 4 GB or so. just remove what you do not need. See https://github.com/kubeflow/manifests#kubeflow-components-versions

ram4444 · 2025-08-13T20:03:26Z

@andreyvelich @juliusvonkohout

I have committed the latest changes to my repo and I have build the image of katib controller to my dockerhub repo

cmd of my deployment
kustomize build applications/katib/upstream/installs/katib-with-kubeflow | kubectl apply -f -

apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
namespace: kubeflow
resources:
  - ../katib-cert-manager
  # Kubeflow Katib components.
  - kubeflow-katib-roles.yaml
  - ui-virtual-service.yaml
  - istio-authorizationpolicy.yaml
images:
  #- name: ghcr.io/kubeflow/katib/katib-controller
  #  newName: ghcr.io/kubeflow/katib/katib-controller
  #  newTag: v0.18.0
  - name: dionysbiz/katib-controller
    newName: dionysbiz/katib-controller
    newTag: latest
  - name: ghcr.io/kubeflow/katib/katib-db-manager
    newName: ghcr.io/kubeflow/katib/katib-db-manager
    newTag: v0.18.0
  - name: ghcr.io/kubeflow/katib/katib-ui
    newName: ghcr.io/kubeflow/katib/katib-ui
    newTag: v0.18.0

The training operator produce the following error log

{"level":"info","ts":"2025-08-13T20:05:37.062503964Z","logger":"setup","caller":"training-operator.v2alpha1/main.go:198","msg":"Probe endpoints are configured on healthz and readyz"}
{"level":"error","ts":"2025-08-13T20:05:37.064450881Z","logger":"setup","caller":"training-operator.v2alpha1/main.go:145","msg":"Could not initialize runtimes","error":"initializing runtime "TrainingRuntime.trainer.kubeflow.org": setting index on TrainingRuntime for TrainJob: no matches for kind "TrainJob" in version "trainer.kubeflow.org/v2alpha1"","stacktrace":"main.main\n\t/workspace/cmd/training-operator.v2alpha1/main.go:145\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:272"}
stream closed EOF for kubeflow-system/training-operator-v2-7b9949cc86-cq2zm (manager)

ram4444 · 2025-08-21T02:28:29Z

should I replace the image with my own latest build or just use the original ghcr one?

andreyvelich · 2025-08-21T02:30:56Z

should I replace the image with my own latest build or just use the original ghcr one?

Yes, please use your image: dionysbiz/katib-controller:latest

ram4444 · 2025-08-21T03:23:48Z

Trainjob

Name:         torch-distributed-example-526stpjk
Namespace:    kubeflow
Labels:       <none>
Annotations:  <none>
API Version:  trainer.kubeflow.org/v1alpha1
Kind:         TrainJob
Metadata:
  Creation Timestamp:  2025-08-21T02:48:55Z
  Generation:          1
  Owner References:
    API Version:           kubeflow.org/v1beta1
    Block Owner Deletion:  true
    Controller:            true
    Kind:                  Trial
    Name:                  torch-distributed-example-526stpjk
    UID:                   e2370d25-c5b1-4352-8f66-256b0350ff64
  Resource Version:        6966
  UID:                     8e5e65d5-89ad-4aa1-96f8-83140ac504ad
Spec:
  Managed By:  trainer.kubeflow.org/trainjob-controller
  Runtime Ref:
    API Group:  trainer.kubeflow.org
    Kind:       ClusterTrainingRuntime
    Name:       torch-distributed
  Suspend:      false
  Trainer:
    Command:
      python3
      /opt/pytorch-mnist/mnist.py
      --epochs=1
      --lr=0.031100969364658455
      --momentum=0.5426043366290967
    Image:      ghcr.io/kubeflow/katib/pytorch-mnist-cpu:latest
    Num Nodes:  2
Status:
  Conditions:
    Last Transition Time:  2025-08-21T03:08:46Z
    Message:               jobset completed successfully
    Reason:                AllJobsCompleted
    Status:                True
    Type:                  Complete
Events:                    <none>

I still cannot find the metrics collector and it is showing the same error log in the katib controller log after the job has finished
in addition, I cannot list out the experiments crds
please also notes the trainjob is using ClusterTrainingRuntime but it was TrainingRuntime

andreyvelich · 2025-08-21T10:36:31Z

I still cannot find the metrics collector and it is showing the same error log in the katib controller log after the job has finished

Can you show full log from Katib controller ?

in addition, I cannot list out the experiments crds

What do you mean ? I can see that TrainJob that you showed uses Katib's Trial, so Experiment should be there.

ram4444 · 2025-08-21T22:06:22Z

I ve just re-run the experiment and here is the log when it starts

 {"level":"info","ts":"2025-08-21T21:37:50Z","logger":"experiment-controller","msg":"Statistics","Experiment":{"name":"torch-distributed-example","namespace":"kubeflow"},"requiredActiveCount":3,"parallelCount":3,"activeCount":0,"completedCount":0}                                                                                                                                                                                                                                                                                                       │
│ {"level":"info","ts":"2025-08-21T21:37:50Z","logger":"experiment-controller","msg":"Reconcile Suggestion","Experiment":{"name":"torch-distributed-example","namespace":"kubeflow"},"addCount":3}                                                                                                                                                                                                                                                                                                                                                             │
│ {"level":"info","ts":"2025-08-21T21:37:50Z","logger":"experiment-controller","msg":"GetOrCreateSuggestion","Experiment":{"name":"torch-distributed-example","namespace":"kubeflow"},"name":"torch-distributed-example","Suggestion Requests":3}                                                                                                                                                                                                                                                                                                              │
│ {"level":"info","ts":"2025-08-21T21:37:50Z","logger":"experiment-suggestion-client","msg":"Creating Suggestion","experiment":{"name":"torch-distributed-example","namespace":"kubeflow"},"namespace":"kubeflow","name":"torch-distributed-example","Suggestion Requests":3}                                                                                                                                                                                                                                                                                  │
│ {"level":"info","ts":"2025-08-21T21:37:50Z","logger":"experiment-suggestion-client","msg":"Suggestion created","experiment":{"name":"torch-distributed-example","namespace":"kubeflow"},"namespace":"kubeflow","name":"torch-distributed-example"}                                                                                                                                                                                                                                                                                                           │
│ {"level":"info","ts":"2025-08-21T21:37:50Z","logger":"experiment-controller","msg":"Statistics","Experiment":{"name":"torch-distributed-example","namespace":"kubeflow"},"requiredActiveCount":3,"parallelCount":3,"activeCount":0,"completedCount":0}                                                                                                                                                                                                                                                                                                       │
│ {"level":"info","ts":"2025-08-21T21:37:50Z","logger":"experiment-controller","msg":"Reconcile Suggestion","Experiment":{"name":"torch-distributed-example","namespace":"kubeflow"},"addCount":3}                                                                                                                                                                                                                                                                                                                                                             │
│ {"level":"info","ts":"2025-08-21T21:37:50Z","logger":"experiment-controller","msg":"GetOrCreateSuggestion","Experiment":{"name":"torch-distributed-example","namespace":"kubeflow"},"name":"torch-distributed-example","Suggestion Requests":3}                                                                                                                                                                                                                                                                                                              │
│ {"level":"info","ts":"2025-08-21T21:37:50Z","logger":"experiment-suggestion-client","msg":"Creating Suggestion","experiment":{"name":"torch-distributed-example","namespace":"kubeflow"},"namespace":"kubeflow","name":"torch-distributed-example","Suggestion Requests":3}                                                                                                                                                                                                                                                                                  │
│ {"level":"info","ts":"2025-08-21T21:37:50Z","logger":"suggestion-controller","msg":"Creating Service","Suggestion":{"name":"torch-distributed-example","namespace":"kubeflow"},"name":"torch-distributed-example-random"}                                                                                                                                                                                                                                                                                                                                    │
│ {"level":"error","ts":"2025-08-21T21:37:50Z","logger":"experiment-suggestion-client","msg":"CreateSuggestion failed","experiment":{"name":"torch-distributed-example","namespace":"kubeflow"},"instance":"torch-distributed-example","error":"suggestions.kubeflow.org \"torch-distributed-example\" already exists","stacktrace":"github.com/kubeflow/katib/pkg/controller.v1beta1/experiment/suggestion.(*General).GetOrCreateSuggestion\n\t/go/src/github.com/kubeflow/katib/pkg/controller.v1beta1/experiment/suggestion/suggestion.go:61\ngithub.com/ku │
│ {"level":"error","ts":"2025-08-21T21:37:50Z","logger":"experiment-controller","msg":"GetOrCreateSuggestion failed","Experiment":{"name":"torch-distributed-example","namespace":"kubeflow"},"name":"torch-distributed-example","Suggestion Requests":3,"error":"suggestions.kubeflow.org \"torch-distributed-example\" already exists","stacktrace":"github.com/kubeflow/katib/pkg/controller.v1beta1/experiment.(*ReconcileExperiment).ReconcileSuggestions\n\t/go/src/github.com/kubeflow/katib/pkg/controller.v1beta1/experiment/experiment_controller.go │
│ {"level":"error","ts":"2025-08-21T21:37:50Z","logger":"experiment-controller","msg":"Get suggestions error","Experiment":{"name":"torch-distributed-example","namespace":"kubeflow"},"error":"suggestions.kubeflow.org \"torch-distributed-example\" already exists","stacktrace":"github.com/kubeflow/katib/pkg/controller.v1beta1/experiment.(*ReconcileExperiment).createTrials\n\t/go/src/github.com/kubeflow/katib/pkg/controller.v1beta1/experiment/experiment_controller.go:350\ngithub.com/kubeflow/katib/pkg/controller.v1beta1/experiment.(*Reconc │
│ {"level":"error","ts":"2025-08-21T21:37:50Z","logger":"experiment-controller","msg":"Create trials error","Experiment":{"name":"torch-distributed-example","namespace":"kubeflow"},"error":"suggestions.kubeflow.org \"torch-distributed-example\" already exists","stacktrace":"github.com/kubeflow/katib/pkg/controller.v1beta1/experiment.(*ReconcileExperiment).ReconcileTrials\n\t/go/src/github.com/kubeflow/katib/pkg/controller.v1beta1/experiment/experiment_controller.go:334\ngithub.com/kubeflow/katib/pkg/controller.v1beta1/experiment.(*Recon │
│ {"level":"error","ts":"2025-08-21T21:37:50Z","logger":"experiment-controller","msg":"Reconcile experiment error","Experiment":{"name":"torch-distributed-example","namespace":"kubeflow"},"error":"suggestions.kubeflow.org \"torch-distributed-example\" already exists","stacktrace":"github.com/kubeflow/katib/pkg/controller.v1beta1/experiment.(*ReconcileExperiment).Reconcile\n\t/go/src/github.com/kubeflow/katib/pkg/controller.v1beta1/experiment/experiment_controller.go:239\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controlle │
│ {"level":"error","ts":"2025-08-21T21:37:50Z","msg":"Reconciler error","controller":"experiment-controller","object":{"name":"torch-distributed-example","namespace":"kubeflow"},"namespace":"kubeflow","name":"torch-distributed-example","reconcileID":"9df30035-7a77-48ee-954c-f194366e27c8","error":"suggestions.kubeflow.org \"torch-distributed-example\" already exists","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.19.1/pkg/internal/c │
│ {"level":"info","ts":"2025-08-21T21:37:50Z","logger":"experiment-controller","msg":"Statistics","Experiment":{"name":"torch-distributed-example","namespace":"kubeflow"},"requiredActiveCount":3,"parallelCount":3,"activeCount":0,"completedCount":0}                                                                                                                                                                                                                                                                                                       │
│ {"level":"info","ts":"2025-08-21T21:37:50Z","logger":"experiment-controller","msg":"Reconcile Suggestion","Experiment":{"name":"torch-distributed-example","namespace":"kubeflow"},"addCount":3}                                                                                                                                                                                                                                                                                                                                                             │
│ {"level":"info","ts":"2025-08-21T21:37:50Z","logger":"experiment-controller","msg":"GetOrCreateSuggestion","Experiment":{"name":"torch-distributed-example","namespace":"kubeflow"},"name":"torch-distributed-example","Suggestion Requests":3}                                                                                                                                                                                                                                                                                                              │
│ {"level":"info","ts":"2025-08-21T21:37:50Z","logger":"experiment-controller","msg":"Statistics","Experiment":{"name":"torch-distributed-example","namespace":"kubeflow"},"requiredActiveCount":3,"parallelCount":3,"activeCount":0,"completedCount":0}                                                                                                                                                                                                                                                                                                       │
│ {"level":"info","ts":"2025-08-21T21:37:50Z","logger":"experiment-controller","msg":"Reconcile Suggestion","Experiment":{"name":"torch-distributed-example","namespace":"kubeflow"},"addCount":3}                                                                                                                                                                                                                                                                                                                                                             │
│ {"level":"info","ts":"2025-08-21T21:37:50Z","logger":"experiment-controller","msg":"GetOrCreateSuggestion","Experiment":{"name":"torch-distributed-example","namespace":"kubeflow"},"name":"torch-distributed-example","Suggestion Requests":3}                                                                                                                                                                                                                                                                                                              │
│ {"level":"info","ts":"2025-08-21T21:37:50Z","logger":"suggestion-controller","msg":"Creating Deployment","Suggestion":{"name":"torch-distributed-example","namespace":"kubeflow"},"name":"torch-distributed-example-random"}                                                                                                                                                                                                                                                                                                                                 │
│ {"level":"info","ts":"2025-08-21T21:37:50Z","logger":"experiment-controller","msg":"Statistics","Experiment":{"name":"torch-distributed-example","namespace":"kubeflow"},"requiredActiveCount":3,"parallelCount":3,"activeCount":0,"completedCount":0}                                                                                                                                                                                                                                                                                                       │
│ {"level":"info","ts":"2025-08-21T21:37:50Z","logger":"experiment-controller","msg":"Reconcile Suggestion","Experiment":{"name":"torch-distributed-example","namespace":"kubeflow"},"addCount":3}                                                                                                                                                                                                                                                                                                                                                             │
│ {"level":"info","ts":"2025-08-21T21:37:50Z","logger":"experiment-controller","msg":"GetOrCreateSuggestion","Experiment":{"name":"torch-distributed-example","namespace":"kubeflow"},"name":"torch-distributed-example","Suggestion Requests":3}                                                                                                                                                                                                                                                                                                              │
│ {"level":"info","ts":"2025-08-21T21:37:50Z","logger":"suggestion-controller","msg":"Update suggestion instance status failed, reconciler requeued","Suggestion":{"name":"torch-distributed-example","namespace":"kubeflow"},"err":"Operation cannot be fulfilled on suggestions.kubeflow.org \"torch-distributed-example\": the object has been modified; please apply your changes to the latest version and try again"}                                                                                                                                    │
│ {"level":"info","ts":"2025-08-21T21:38:23Z","logger":"suggestion-client","msg":"Algorithm settings are validated","Suggestion":{"name":"torch-distributed-example","namespace":"kubeflow"}}                                                                                                                                                                                                                                                                                                                                                                  │
│ {"level":"info","ts":"2025-08-21T21:38:23Z","logger":"suggestion-controller","msg":"Sync assignments","Suggestion":{"name":"torch-distributed-example","namespace":"kubeflow"},"Suggestion Requests":3,"Suggestion Count":0}                                                                                                                                                                                                                                                                                                                                 │
│ {"level":"info","ts":"2025-08-21T21:38:23Z","logger":"suggestion-client","msg":"Getting suggestions","Suggestion":{"name":"torch-distributed-example","namespace":"kubeflow"},"endpoint":"torch-distributed-example-random.kubeflow:6789","Number of current request parameters":3,"Number of response parameters":3}                                                                                                                                                                                                                                        │
│ {"level":"info","ts":"2025-08-21T21:38:23Z","logger":"experiment-controller","msg":"Statistics","Experiment":{"name":"torch-distributed-example","namespace":"kubeflow"},"requiredActiveCount":3,"parallelCount":3,"activeCount":0,"completedCount":0}                                                                                                                                                                                                                                                                                                       │
│ {"level":"info","ts":"2025-08-21T21:38:23Z","logger":"experiment-controller","msg":"Reconcile Suggestion","Experiment":{"name":"torch-distributed-example","namespace":"kubeflow"},"addCount":3}                                                                                                                                                                                                                                                                                                                                                             │
│ {"level":"info","ts":"2025-08-21T21:38:23Z","logger":"experiment-controller","msg":"GetOrCreateSuggestion","Experiment":{"name":"torch-distributed-example","namespace":"kubeflow"},"name":"torch-distributed-example","Suggestion Requests":3}                                                                                                                                                                                                                                                                                                              │
│ {"level":"info","ts":"2025-08-21T21:38:23Z","logger":"suggestion-controller","msg":"Sync assignments","Suggestion":{"name":"torch-distributed-example","namespace":"kubeflow"},"Suggestion Requests":3,"Suggestion Count":3}                                                                                                                                                                                                                                                                                                                                 │
│ {"level":"info","ts":"2025-08-21T21:38:23Z","logger":"experiment-controller","msg":"Created Trials","Experiment":{"name":"torch-distributed-example","namespace":"kubeflow"},"trialNames":["torch-distributed-example-x4x9j58j","torch-distributed-example-wqktfdwk","torch-distributed-example-jtgz2bhd"]}                                                                                                                                                                                                                                                  │
│ {"level":"info","ts":"2025-08-21T21:38:23Z","logger":"trial-controller","msg":"Creating Job","Trial":{"name":"torch-distributed-example-x4x9j58j","namespace":"kubeflow"},"kind":"TrainJob","name":"torch-distributed-example-x4x9j58j"}                                                                                                                                                                                                                                                                                                                     │
│ {"level":"info","ts":"2025-08-21T21:38:23Z","logger":"trial-controller","msg":"Trial status changed to Running","Trial":{"name":"torch-distributed-example-x4x9j58j","namespace":"kubeflow"}}                                                                                                                                                                                                                                                                                                                                                                │
│ {"level":"info","ts":"2025-08-21T21:38:23Z","logger":"trial-controller","msg":"Creating Job","Trial":{"name":"torch-distributed-example-wqktfdwk","namespace":"kubeflow"},"kind":"TrainJob","name":"torch-distributed-example-wqktfdwk"}                                                                                                                                                                                                                                                                                                                     │
│ {"level":"info","ts":"2025-08-21T21:38:23Z","logger":"trial-controller","msg":"Trial status changed to Running","Trial":{"name":"torch-distributed-example-wqktfdwk","namespace":"kubeflow"}}                                                                                                                                                                                                                                                                                                                                                                │
│ {"level":"info","ts":"2025-08-21T21:38:23Z","logger":"trial-controller","msg":"Creating Job","Trial":{"name":"torch-distributed-example-jtgz2bhd","namespace":"kubeflow"},"kind":"TrainJob","name":"torch-distributed-example-jtgz2bhd"}                                                                                                                                                                                                                                                                                                                     │
│ {"level":"info","ts":"2025-08-21T21:38:23Z","logger":"trial-controller","msg":"Trial status changed to Running","Trial":{"name":"torch-distributed-example-jtgz2bhd","namespace":"kubeflow"}}

when it comes to the end

{"level":"info","ts":"2025-08-21T22:01:16Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-jtgz2bhd","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:17Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-wqktfdwk","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:17Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-x4x9j58j","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:17Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-jtgz2bhd","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:18Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-wqktfdwk","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:18Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-x4x9j58j","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:18Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-jtgz2bhd","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:19Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-wqktfdwk","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:19Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-x4x9j58j","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:19Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-jtgz2bhd","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:20Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-wqktfdwk","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:20Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-x4x9j58j","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:20Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-jtgz2bhd","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:21Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-wqktfdwk","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:21Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-x4x9j58j","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:21Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-jtgz2bhd","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:22Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-wqktfdwk","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:22Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-x4x9j58j","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:22Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-jtgz2bhd","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:23Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-wqktfdwk","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:23Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-x4x9j58j","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:23Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-jtgz2bhd","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:24Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-wqktfdwk","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:24Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-x4x9j58j","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:24Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-jtgz2bhd","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:25Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-wqktfdwk","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:25Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-x4x9j58j","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:25Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-jtgz2bhd","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:26Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-wqktfdwk","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:26Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-x4x9j58j","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:26Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-jtgz2bhd","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:27Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-wqktfdwk","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:27Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-x4x9j58j","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:27Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-jtgz2bhd","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:28Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-wqktfdwk","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:28Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-x4x9j58j","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:28Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-jtgz2bhd","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:29Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-wqktfdwk","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:29Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-x4x9j58j","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:29Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-jtgz2bhd","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:30Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-wqktfdwk","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:30Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-x4x9j58j","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:30Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-jtgz2bhd","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:31Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-wqktfdwk","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:31Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-x4x9j58j","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:31Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-jtgz2bhd","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:32Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-wqktfdwk","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:32Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-x4x9j58j","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:32Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-jtgz2bhd","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:33Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-wqktfdwk","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:33Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-x4x9j58j","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:33Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-jtgz2bhd","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:34Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-wqktfdwk","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:34Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-x4x9j58j","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:34Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-jtgz2bhd","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:35Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-wqktfdwk","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:35Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-x4x9j58j","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:35Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-jtgz2bhd","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:36Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-wqktfdwk","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:36Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-x4x9j58j","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:36Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-jtgz2bhd","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:37Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-wqktfdwk","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:37Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-x4x9j58j","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:37Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-jtgz2bhd","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:38Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-wqktfdwk","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:38Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-x4x9j58j","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:38Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-jtgz2bhd","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:39Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-wqktfdwk","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:39Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-x4x9j58j","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:39Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-jtgz2bhd","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:40Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-wqktfdwk","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:40Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-x4x9j58j","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:40Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-jtgz2bhd","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:41Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-wqktfdwk","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:41Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-x4x9j58j","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:41Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-jtgz2bhd","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:42Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-wqktfdwk","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:42Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-x4x9j58j","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:42Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-jtgz2bhd","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:43Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-wqktfdwk","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:43Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-x4x9j58j","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:43Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-jtgz2bhd","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:44Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-wqktfdwk","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:44Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-x4x9j58j","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:44Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-jtgz2bhd","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:45Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-wqktfdwk","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:45Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-x4x9j58j","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:45Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-jtgz2bhd","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:46Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-wqktfdwk","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:46Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-x4x9j58j","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:46Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-jtgz2bhd","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:47Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-wqktfdwk","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:47Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-x4x9j58j","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:47Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-jtgz2bhd","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:48Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-wqktfdwk","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:48Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-x4x9j58j","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:48Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-jtgz2bhd","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:49Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-wqktfdwk","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:49Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-x4x9j58j","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:49Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-jtgz2bhd","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:50Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-wqktfdwk","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:50Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-x4x9j58j","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:50Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-jtgz2bhd","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:51Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-wqktfdwk","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:51Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-x4x9j58j","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:51Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-jtgz2bhd","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:52Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-wqktfdwk","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:52Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-x4x9j58j","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:53Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-jtgz2bhd","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:54Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-wqktfdwk","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:54Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-x4x9j58j","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:54Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-jtgz2bhd","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:55Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-wqktfdwk","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:55Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-x4x9j58j","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:55Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-jtgz2bhd","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:56Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-wqktfdwk","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:56Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-x4x9j58j","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:56Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-jtgz2bhd","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:57Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-wqktfdwk","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:57Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-x4x9j58j","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:57Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-jtgz2bhd","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:58Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-wqktfdwk","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:58Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-x4x9j58j","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:58Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-jtgz2bhd","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:59Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-wqktfdwk","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:59Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-x4x9j58j","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:59Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-jtgz2bhd","namespace":"kubeflow"}}

andreyvelich · 2025-08-21T23:45:18Z

We also should add batch.kubernetes.io/job-completion-index: 0 to the primaryPodLabels const.

ram4444 · 2025-08-22T00:00:52Z

so the constant.go should be updated to

DefaultTrainJobPrimaryPodLabels = map[string]string{"jobset.sigs.k8s.io/replicatedjob-name": "node", "batch.kubernetes.io/job-completion-index": "0"}

Just rebuild the image and add the label but still got the same result

Name:             torch-distributed-example-c599nw8t-node-0-0-hc855
Namespace:        kubeflow
Priority:         0
Service Account:  default
Node:             kind-control-plane/172.19.0.2
Start Time:       Fri, 22 Aug 2025 02:10:31 +0100
Labels:           batch.kubernetes.io/controller-uid=e71dfdf7-2119-4a57-ac37-690db8bb23fb
                  batch.kubernetes.io/job-completion-index=0
                  batch.kubernetes.io/job-name=torch-distributed-example-c599nw8t-node-0
                  controller-uid=e71dfdf7-2119-4a57-ac37-690db8bb23fb
                  job-name=torch-distributed-example-c599nw8t-node-0
                  jobset.sigs.k8s.io/global-replicas=1
                  jobset.sigs.k8s.io/job-global-index=0
                  jobset.sigs.k8s.io/job-index=0
                  jobset.sigs.k8s.io/job-key=1e24e5259edfdec1fcb7914899d643c3a5abe72a
                  jobset.sigs.k8s.io/jobset-name=torch-distributed-example-c599nw8t
                  jobset.sigs.k8s.io/replicatedjob-name=node
                  jobset.sigs.k8s.io/replicatedjob-replicas=1
                  jobset.sigs.k8s.io/restart-attempt=0
Annotations:      batch.kubernetes.io/job-completion-index: 0
                  jobset.sigs.k8s.io/global-replicas: 1
                  jobset.sigs.k8s.io/job-global-index: 0
                  jobset.sigs.k8s.io/job-index: 0
                  jobset.sigs.k8s.io/job-key: 1e24e5259edfdec1fcb7914899d643c3a5abe72a
                  jobset.sigs.k8s.io/jobset-name: torch-distributed-example-c599nw8t
                  jobset.sigs.k8s.io/replicatedjob-name: node
                  jobset.sigs.k8s.io/replicatedjob-replicas: 1
                  jobset.sigs.k8s.io/restart-attempt: 0

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

andreyvelich · 2025-08-22T22:44:00Z

Found an issue.
As described in this doc, you must give Katib controller permission to all nested resources that Trial creates: https://www.kubeflow.org/docs/components/katib/user-guides/trial-template/#use-crds-with-trial-template:~:text=Modify%20Katib%20controller%20ClusterRole%E2%80%99s%20rules%20with%20the%20new%20rule%20to%20give%20Katib%20access%20to%20all%20resources%20that%20are%20created%20by%20the%20Trial.%20To%20know%20more%20about%20ClusterRole%2C%20check%20the%20Kubernetes%20guide.

Katib needs to understand whether desired pod belongs to Trial here: https://github.com/kubeflow/katib/blob/master/pkg/webhook/v1beta1/pod/inject_webhook.go#L286-L290

Now Katib can optimize HPs on TrainJobs 🎉
cc @kubeflow/kubeflow-trainer-team @kramaranya @szaher @astefanutti

$ k get trial -n $NS
NAME                                 TYPE        STATUS   AGE
torch-distributed-example-2pnvtn72   Running     True     22s
torch-distributed-example-4nqzdhvw   Running     True     21s
torch-distributed-example-9v79g85p   Succeeded   True     5m2s
torch-distributed-example-ftw6rv7f   Running     True     28s
torch-distributed-example-jq9jz7mt   Succeeded   True     5m2s
torch-distributed-example-phsqstg5   Succeeded   True     5m2s

$ k get trainjob -n $NS
NAME                                 STATE   AGE
torch-distributed-example-2pnvtn72           25s
torch-distributed-example-4nqzdhvw           24s
torch-distributed-example-ftw6rv7f           31s

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

andreyvelich

Thanks for this great contributions @ram4444!
/lgtm
/approve

google-oss-prow · 2025-08-22T22:47:16Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andreyvelich

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [andreyvelich]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

ram4444 · 2025-08-22T23:30:50Z

@andreyvelich
Thanks for your work. If there is any help I could do please let me know

andreyvelich · 2025-08-23T00:00:30Z

/lgtm

juliusvonkohout · 2025-08-23T08:34:28Z

Just pinge me if you have a release to synchronize.

kramaranya · 2025-08-25T14:44:23Z

This looks great! This will be especially useful when we migrate Katib to Kubeflow SDK. Thank you for working on this!

google-oss-prow bot added the size/M label Jul 26, 2025

google-oss-prow bot requested review from Electronic-Waste and anencore94 July 26, 2025 03:44

ram4444 mentioned this pull request Jul 26, 2025

Please add user custom return value for TrainerClient().train run kubeflow/trainer#2749

Closed

andreyvelich reviewed Jul 26, 2025

View reviewed changes

google-oss-prow bot added size/L size/M and removed size/M size/L labels Jul 27, 2025

ram4444 mentioned this pull request Jul 28, 2025

TrainJob support to Katib Trial Templates kubeflow/manifests#3199

Merged

3 tasks

ram4444 force-pushed the master branch 3 times, most recently from 2062958 to cb319a2 Compare July 29, 2025 17:03

google-oss-prow bot added the ok-to-test label Jul 30, 2025

andreyvelich reviewed Aug 8, 2025

View reviewed changes

Grant JobSet permission to Katib controller

94a84af

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

google-oss-prow bot added size/L and removed size/M labels Aug 22, 2025

Remove create/delete RBAC for TrainJob

a6ac9ca

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

google-oss-prow bot added size/M and removed size/L labels Aug 22, 2025

andreyvelich reviewed Aug 22, 2025

View reviewed changes

google-oss-prow bot assigned andreyvelich Aug 22, 2025

google-oss-prow bot added the lgtm label Aug 22, 2025

google-oss-prow bot added approved size/L and removed lgtm size/M labels Aug 22, 2025

Fix docker build with libpcre2

ee46f13

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

andreyvelich force-pushed the master branch from 4ddb443 to ee46f13 Compare August 22, 2025 23:10

google-oss-prow bot added the lgtm label Aug 23, 2025

google-oss-prow bot merged commit c9528e7 into kubeflow:master Aug 23, 2025
80 of 81 checks passed

Adding out of the box support to TrainJob #2560

Adding out of the box support to TrainJob #2560

Uh oh!

Conversation

ram4444 commented Jul 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

andreyvelich left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

andreyvelich commented Jul 26, 2025

Uh oh!

juliusvonkohout commented Jul 30, 2025

Uh oh!

andreyvelich commented Aug 4, 2025

Uh oh!

ram4444 commented Aug 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

andreyvelich commented Aug 5, 2025

Uh oh!

ram4444 commented Aug 6, 2025

Uh oh!

andreyvelich commented Aug 6, 2025

Uh oh!

ram4444 commented Aug 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

andreyvelich commented Aug 7, 2025

Uh oh!

ram4444 commented Aug 7, 2025

Uh oh!

andreyvelich commented Aug 7, 2025

Uh oh!

andreyvelich left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ram4444 commented Aug 12, 2025

Uh oh!

andreyvelich commented Aug 13, 2025

Uh oh!

juliusvonkohout commented Aug 13, 2025

Uh oh!

ram4444 commented Aug 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ram4444 commented Aug 21, 2025

Uh oh!

andreyvelich commented Aug 21, 2025

Uh oh!

ram4444 commented Aug 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

andreyvelich commented Aug 21, 2025

Uh oh!

ram4444 commented Aug 21, 2025

Uh oh!

ram4444 commented Jul 26, 2025 •

edited

Loading

ram4444 commented Aug 5, 2025 •

edited

Loading

ram4444 commented Aug 6, 2025 •

edited

Loading

ram4444 commented Aug 13, 2025 •

edited

Loading

ram4444 commented Aug 21, 2025 •

edited

Loading

ram4444 commented Aug 22, 2025 •

edited

Loading